# Python and R

In [3]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [4]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [5]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.1.4      ✔ stringr 1.4.0 
✔ readr   2.1.3      ✔ forcats 0.5.1 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



# The Data

For this repository, we'll be working with FiveThirtyEight's Pollster Ratings database. First, watch these primer videos that will help you better understand public opinion polling, and the pollster ratings data in particular:

- https://www.youtube.com/watch?v=TambSayfCOE
- https://www.youtube.com/watch?v=fzzX9jHDK4k

While FiveThirtyEight publishes [pollster ratings](https://projects.fivethirtyeight.com/pollster-ratings/), I ask that you do not consult those as you do this assignment. I also ask that you not do outside reading about the reputation of individual pollsters. This assignment is an exercise in discovering insights about this dataset. When we debrief, we will look into the kinds of insights that journalists have been able to produce and see.

I have pulled the underlying data into a file called `raw-polls.csv` which you will find in this repository. It contain polls from the final 21 days of Presidential, Senate, and Gubernatorial elections. 

Let's start by looking at presidential elections only:

In [6]:
# EXAMPLE PYTHON CELL
df = pd.read_csv('raw-polls.csv')\
    .query("type_simple=='Pres-G'")

You can find documentation about the meanings of these columns here: 
https://github.com/fivethirtyeight/data/tree/master/pollster-ratings

Please note that I have strategically removed some columns, so you won't find all of the columns in the original datasest here in this assignment.

Here is how we can load that same data in R and filter down to just presidential polls (just as we did above in Python).

In [8]:
%%R

# EXAMPLE R CELL
df <- read_csv('raw-polls.csv') %>% 
    filter(type_simple=='Pres-G')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2,940 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1    6478      7947      40  2000 2000… US      Pres-G  Pres-G  Zogby …     395
 2    6483      7952     815  2000 2000… FL      Pres-G  Pres-G  McLaug…     203
 3    6470      7939     820  2000 2000… IL      Pres-G  Pres-G  KRC Re…     160
 4    6473      7942     820  2000 2000… IL      Pres-G  Pres-G  Resear…     281
 5    6474      7943     836  2000 2000… NH      Pres-G 

Before we dive into exploratory data viz, make sure you understand what all the columns in the spreadsheet are. There is a data dictionary on FiveThirtyEight’s GitHub page for this dataset that you can reference. If there is something you don’t understand, try asking your classmates in the #algorithms23 Slack channel.

Answer the questions below about the limitations and possibilities of what you can and cannot learn using this dataset.

## 👉 Possibilties and Limitations

_Whenever you see the 👉 icon, that means you need to write an answer._

👉 Questions about the accuracy of election polling in the U.S. that this dataset should allow me to answer (answer in bullet points below):

1. Which pollsters made the best predictions?

2. In which type of elections was the prediction most off? And why (which would require further reporting apart from looking at the pollsters' sample size, methodology, bias, etc)

3. Do Dem-leaning polls have better or worse predictions that Rep-leaning polls?


👉 Questions about the accuracy of election polling in the U.S. that I won’t be able to answer with this dataset alone (answer in bullet points below):

Why are predictions in a certain year better or worse compared to another year? (I guess that would require more reporting about the political climate at the time)


👉 Questions I have about this dataset. What don't you know about it that you'd like to in order to responsibly use this data to report on elections?

1. Is a certain methodology better than the other? Do I trust those how have used 'Live Phone' more (or less) than who used 'Mail' or anything else?

2. What's question_id?

3. Some of the comments say 'among registered voters'. What does that mean?
