# Python and R

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML



In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



-- [1mAttaching packages[22m --------------------------------------------------------------------------------------------------------- tidyverse 1.3.2 --
[32mv[39m [34mggplot2[39m 3.4.0      [32mv[39m [34mpurrr  [39m 1.0.1 
[32mv[39m [34mtibble [39m 3.1.8      [32mv[39m [34mdplyr  [39m 1.0.10
[32mv[39m [34mtidyr  [39m 1.2.1      [32mv[39m [34mstringr[39m 1.4.1 
[32mv[39m [34mreadr  [39m 2.1.3      [32mv[39m [34mforcats[39m 0.5.2 
-- [1mConflicts[22m ------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# The Data

For this repository, we'll be working with FiveThirtyEight's Pollster Ratings database. First, watch these primer videos that will help you better understand public opinion polling, and the pollster ratings data in particular:

- https://www.youtube.com/watch?v=TambSayfCOE
- https://www.youtube.com/watch?v=fzzX9jHDK4k

While FiveThirtyEight publishes [pollster ratings](https://projects.fivethirtyeight.com/pollster-ratings/), I ask that you do not consult those as you do this assignment. I also ask that you not do outside reading about the reputation of individual pollsters. This assignment is an exercise in discovering insights about this dataset. When we debrief, we will look into the kinds of insights that journalists have been able to produce and see.

I have pulled the underlying data into a file called `raw-polls.csv` which you will find in this repository. It contain polls from the final 21 days of Presidential, Senate, and Gubernatorial elections. 

Let's start by looking at presidential elections only:

In [5]:
# EXAMPLE PYTHON CELL
df = pd.read_csv('raw-polls.csv')\
    .query("type_simple=='Pres-G'")

You can find documentation about the meanings of these columns here: 
https://github.com/fivethirtyeight/data/tree/master/pollster-ratings

Please note that I have strategically removed some columns, so you won't find all of the columns in the original datasest here in this assignment.

Here is how we can load that same data in R and filter down to just presidential polls (just as we did above in Python).

In [6]:
%%R

# EXAMPLE R CELL
df <- read_csv('raw-polls.csv') %>% 
    filter(type_simple=='Pres-G')

df

R[write to console]: 
R[write to console]: [1mindexed[0m [32m0B[0m in [36m 0s[0m, [32m0B/s[0m
R[write to console]: 
R[write to console]: [1mindexed[0m [32m2.15GB[0m in [36m 0s[0m, [32m2.15GB/s[0m
                                                                              
R[write to console]: 


[1mRows: [22m[34m10776[39m [1mColumns: [22m[34m31[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------------------------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (14): race, location, type_simple, type_detail, pollster, methodology, p...
[32mdbl[39m (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[38;5;246m# A tibble: 2,940 x 31[39m
   poll_id questio~1 race_id  year race  locat~2 type_~3 type_~4 polls~5 polls~6
     [3m[38;5;246m<dbl>[39m[23m     [3m[38;5;246m<dbl>[39m[23m   [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<chr>[39m[23m 

Before we dive into exploratory data viz, make sure you understand what all the columns in the spreadsheet are. There is a data dictionary on FiveThirtyEight’s GitHub page for this dataset that you can reference. If there is something you don’t understand, try asking your classmates in the #algorithms23 Slack channel.

Answer the questions below about the limitations and possibilities of what you can and cannot learn using this dataset.

## 👉 Possibilties and Limitations

_Whenever you see the 👉 icon, that means you need to write an answer._

👉 Questions about the accuracy of election polling in the U.S. that this dataset should allow me to answer (answer in bullet points below):
- `rightcall` column enables me to answer a question of overall accuracy of polls: how accurately do they predict the winners of individual races? 
- substracting `margin_poll` from `margin_actual` allows to analyze the methodological efficacy of the polls. If the actual and margin polls diverge even if the poll predicted the winner accurately, it adds a layer caution to the usefulness of the poll


👉 Questions about the accuracy of election polling in the U.S. that I won’t be able to answer with this dataset alone (answer in bullet points below):

- No good methodological question comes to mind, but I would assume that polls are more useful in reflecting the expected voter preferences when they are closer in time to actual polling day and the race is well-defined


👉 Questions I have about this dataset. What don't you know about it that you'd like to in order to responsibly use this data to report on elections?
- Are some polls methodologically stronger or better? Should the "better" polls receive more weight in aggregating results? I would assume that all poll mee a minimun quality threshold