# Introduction

In this problem you will be analysing polling data. Specifically, the data that you will be working with has been obtained from http://projects.fivethirtyeight.com/2016-election-forecast/. If you're interested, you can go through some interesting visualizations and analyses they have performed on top of this data.

# Polling Data
The polling data consists of several polls (uniquely identified by `poll_id`) conducted by pollsters. Each pollster is associated with a `poll_wt` and `grade`, which reflect the confidence in the poll results conducted by them. The raw poll counts reflect the actual results of the poll, while the adjusted polls are the forecasts made by http://projects.fivethirtyeight.com using one of its three models (this is the `type` column). For more details on attributes, see [this](http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/) page.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

## Q1. Load data [2pts]
You will load the data from CSV file into a pandas dataframe, following the specifications given below.

### Specifications
1. State must be one of the 50 states (or U.S.), without being further subdivided (e.g., see Maine).
2. Some polls may not have the option of Johnson or McMullin, so their raw and adjusted poll counts would be empty. We need treat these as zeros.
3. Assign grade 'E' to any ungraded pollster.
4. Create a new column called 'adj_poll_weight' which takes into account both the poll_wt and the sample size. This is will be used as the *adjusted weight* for each poll, in all weighted mean/variance/confidence interval/p-value computations.

 adj_poll_weight = poll_wt * $log_{10}$(samplesize)
5. If samplesize is missing or is 0, simply ignore that row so that the above expression makes sense.

In [120]:
def load_data(file_name):
    """ Loads data from the input CSV file and processes it as mentioned above
    Inputs:
        file_name (str): file name
    Outputs:
        pd.DataFrame: processed data frame containing polls
    """
    df = pd.DataFrame.from_csv(file_name)
    df[['startdate', 'enddate', 'createddate']] = df[['startdate', 'enddate', 'createddate']].apply(pd.to_datetime)
    df['state'] = df['state'].apply(lambda s: 'Maine' if s.startswith('Maine') else 'Nebraska' if s.startswith('Nebraska') else s)
    df = df.fillna({'rawpoll_mcmullin' : 0.0, 'adjpoll_mcmullin':0.0, 'rawpoll_johnson':0.0, 'adjpoll_johnson': 0.0, 'grade': 'E',})
    df = df[df['samplesize'] > 0]
    df = df.assign(adj_poll_weight=df['poll_wt'] * np.log10(df['samplesize']))
    return df.reset_index()

#AUTOLAB_IGNORE_START
df = load_data('polls.csv')
# print df.dtypes
# print df.head()
# print df.shape
print df['state'].unique()
#AUTOLAB_IGNORE_STOP

['U.S.' 'Florida' 'California' 'Pennsylvania' 'Missouri' 'New Hampshire'
 'New Mexico' 'North Carolina' 'Iowa' 'Arizona' 'Minnesota' 'Texas'
 'Virginia' 'Georgia' 'Nevada' 'Idaho' 'Ohio' 'Vermont' 'Colorado'
 'Wisconsin' 'Illinois' 'South Carolina' 'Michigan' 'Montana' 'New York'
 'Kansas' 'South Dakota' 'Washington' 'Maine' 'Utah' 'Louisiana'
 'Massachusetts' 'Oregon' 'Maryland' 'Mississippi' 'Oklahoma' 'Indiana'
 'New Jersey' 'Nebraska' 'Connecticut' 'Tennessee' 'Delaware' 'Arkansas'
 'Hawaii' 'Rhode Island' 'Alaska' 'Wyoming' 'District of Columbia'
 'Alabama' 'West Virginia' 'Kentucky' 'North Dakota']


Our reference yields the following output:

```python
>>> df.dtypes
type                        object
state                       object
startdate           datetime64[ns]
enddate             datetime64[ns]
pollster                    object
grade                       object
samplesize                 float64
population                  object
poll_wt                    float64
rawpoll_clinton            float64
rawpoll_trump              float64
rawpoll_johnson            float64
rawpoll_mcmullin           float64
adjpoll_clinton            float64
adjpoll_trump              float64
adjpoll_johnson            float64
adjpoll_mcmullin           float64
url                         object
poll_id                      int64
question_id                  int64
createddate         datetime64[ns]
adj_poll_weight            float64
dtype: object
```

```python
>>> df.head()
         type    state  startdate    enddate  \
0  polls-plus     U.S. 2016-10-20 2016-10-24   
1  polls-plus     U.S. 2016-10-20 2016-10-25   
2  polls-plus  Florida 2016-10-20 2016-10-24   
3  polls-plus     U.S. 2016-10-22 2016-10-25   
4  polls-plus     U.S. 2016-10-25 2016-10-27   

                                            pollster grade  samplesize  \
0                            Google Consumer Surveys     B     21240.0   
1                                Pew Research Center    B+      2120.0   
2                                          SurveyUSA     A      1251.0   
3  Fox News/Anderson Robbins Research/Shaw & Comp...     A      1221.0   
4                           ABC News/Washington Post    A+       956.0   

  population   poll_wt  rawpoll_clinton       ...         rawpoll_mcmullin  \
0         lv  5.237220            38.54       ...                      0.0   
1         rv  3.623270            46.00       ...                      0.0   
2         lv  3.584933            48.00       ...                      0.0   
3         lv  3.561260            44.00       ...                      0.0   
4         lv  3.471576            47.00       ...                      0.0   

   adjpoll_clinton  adjpoll_trump  adjpoll_johnson  adjpoll_mcmullin  \
0         43.46984       39.98077         5.426960               0.0   
1         45.46572       41.28611         3.960040               0.0   
2         46.66093       44.43937         2.152259               0.0   
3         44.91556       41.45449         6.522821               0.0   
4         45.13807       43.87921         4.487876               0.0   

                                                 url  poll_id question_id  \
0  https://datastudio.google.com/u/0/#/org//repor...    47407       74188   
1  http://www.people-press.org/2016/10/27/as-elec...    47616       74519   
2  http://www.baynews9.com/content/news/baynews9/...    47465       74252   
3  http://www.foxnews.com/politics/interactive/20...    47542       74365   
4  http://www.langerresearch.com/wp-content/uploa...    47711       74693   

   createddate  adj_poll_weight  
0   2016-10-25        22.662260  
1   2016-10-27        12.052213  
2   2016-10-25        11.103460  
3   2016-10-26        10.992597  
4   2016-10-29        10.346886  

[5 rows x 22 columns]
```

# Sample Statistics

Recall from class slides that the sample mean and variance for a list of observations $x_i$ for $i=1,\ldots,m$ are given by 
$$ \bar{x} = \frac{1}{m} \sum_{i=1}^{m} x_i $$

$$ s^2 = \frac{1}{m-1} \sum_{i=1}^{m} (x_i - \bar{x})^2 $$

## Q2. Sample mean and variance [4pts]
These unweighted observations can be treated as weighted observations, each with weight 1. Your first task is to extend sample mean and variance to weighted observations and implement them.

In [35]:
def sample_mean(x, w):
    """ Estimates weighted mean of data
    Inputs:
        x (array-like): array of data points/observations
        w (array-like): array of weights for these observations
    Outputs:
        (float) - weighted mean of data
    """
    return float((x*w).sum()) / w.sum()

def sample_variance(x, w):
    """ Estimates sample variance of data
    Inputs:
        x (array-like): array of data points/observations
        w (array-like): array of weights for these observations
    Outputs:
        (float) - sample variance of data (according to the given weights)
    """
    mean = sample_mean(x, w)
    return float((w*((x-mean)**2)).sum())/(w.sum()-1)
    
#AUTOLAB_IGNORE_START
print sample_mean(np.arange(10), np.arange(10))
print sample_variance(np.arange(10), np.arange(10))
#AUTOLAB_IGNORE_STOP

6.33333333333
5.0


Our implementation yields:

```python
>>> sample_mean(np.arange(10), np.arange(10))
6.33333333333
>>> sample_variance(np.arange(10), np.arange(10))
5.0
```

## Q3. Confidence Interval [4pts]
Using the computation of sample mean and variance above, implement the following function to compute the two-sided confidence interval of mean, using the T-statistic or normal distribution depending on the total weight of observations. Use the rule of thumb given in the slides (after adapting it for weighted samples).

In [67]:
def two_sided_confidence_interval_of_mean(x, w, alpha=0.05):
    """ Estimates confidence interval of mean of data using the Student's T distribution or normal distribution.
    Inputs:
        x (array-like): array of data points/observations
        w (array-like): array of weights for these observations
        alpha (float): confidence level
    Outputs:
        (float, float) - lower and upper limit of the confidence interval of mean of data 
                        (according to the given weights)
    """
    mean = sample_mean(x, w)
    variance = sample_variance(x, w)
    m = w.sum()
    if m < 30:
        ci = np.sqrt(variance / m) * st.t(m - 1).ppf(1.0-alpha/2.0)
    else:
        ci = np.sqrt(variance / m) * st.norm().ppf(1.0-alpha/2.0)
    
    return mean-ci, mean+ci

#AUTOLAB_IGNORE_START
print two_sided_confidence_interval_of_mean(np.arange(5), np.arange(5))
print two_sided_confidence_interval_of_mean(np.arange(50), np.arange(50))
#AUTOLAB_IGNORE_STOP

(2.2459476124196693, 3.7540523875803307)
(32.34667867181998, 33.65332132818002)


Our implemenation yields:

```python
>>> print two_sided_confidence_interval_of_mean(np.arange(5), np.arange(5))
(2.2459476124196693, 3.7540523875803307)
>>> two_sided_confidence_interval_of_mean(np.arange(50), np.arange(50))
(32.34667867181998, 33.65332132818002)
```

## Q4. Swing states [7pts]
In this part, you will first implement a function to compute the confidence interval of raw poll of all candidates in a given state. In doing so, make sure to take into account a given poll (identified uniquely by the `poll_id`) exactly once. (A single poll occurs under each `type` once, with the same value of raw polls, but different value of adjusted polls.)

In [69]:
def poll_confidence_intervals(df, state='Florida', alpha=0.05):
    """ Estimates confidence intervals for raw polls of clinton, trump, johnson and mcmullin
    Inputs:
        df (pd.DataFrame) - data frame with polls data
        state (str) - state for which confidence interval of mean has to be computed
        alpha (float) - confidence level
    Outputs:
        dict: keys are candidate names and values are the confidence intervals (i.e., tuples of floats, 
                indicating the lower and upper bounds of the interval, respectively)
    """
    df = df[df['state'] == state]
    df = df.drop_duplicates(subset='poll_id', keep='first')
    candidates = ['clinton', 'trump', 'johnson', 'mcmullin']
    return {c: two_sided_confidence_interval_of_mean(df['rawpoll_' + c], df['adj_poll_weight'], alpha) for c in candidates}

#AUTOLAB_IGNORE_START
print poll_confidence_intervals(df, state='Florida', alpha=0.05)
print poll_confidence_intervals(df, state='Maine', alpha=0.1)
#AUTOLAB_IGNORE_STOP

{'clinton': (45.034797316193895, 45.904042863492315), 'trump': (42.591138848195598, 43.480477966774188), 'johnson': (3.3155643710340765, 4.0095495564481176), 'mcmullin': (0.0, 0.0)}
{'clinton': (41.414017061498278, 44.0333275406038), 'trump': (35.861329660838202, 38.177661364411641), 'johnson': (5.3307907923279281, 7.8243833448439855), 'mcmullin': (0.0, 0.0)}


Our implementation yields:
```python
>>> print poll_confidence_intervals(df, state='Florida', alpha=0.05)
{'clinton': (45.034797316193895, 45.904042863492315), 'trump': (42.591138848195598, 43.480477966774188), 'johnson': (3.3155643710340765, 4.0095495564481176), 'mcmullin': (0.0, 0.0)}
>>> print poll_confidence_intervals(df, state='Maine', alpha=0.1)
{'clinton': (41.414017061498278, 44.0333275406038), 'trump': (35.861329660838202, 38.177661364411641), 'johnson': (5.3307907923279281, 7.8243833448439855), 'mcmullin': (0.0, 0.0)}
```

Now, let us define **swing state** as a state (exclude U.S. and the District of Columbia) where the confidence intervals (at a specified confidence level) of the leading candidates (we will assume them to be `trump` and `clinton` in all states) overlap. Report the set of all swing states by filling the function below.

In [81]:
def swing_states(df, alpha=0.05):
    """ Determines the set of swing states at a given confidence level
    Inputs:
        df (pd.DataFrame) - data frame with polls data
        alpha (float) - confidence level
    Outputs:
        set(str) - set of swing states
    """
    result = set()
    for state in df['state'].unique():
        if state in ['U.S.', 'District of Columbia']:
            continue
        
        poll_interval = poll_confidence_intervals(df, state, alpha)
        c_l, c_h = poll_interval['clinton']
        t_l, t_h = poll_interval['trump']
        if (c_l - t_l) * (c_l - t_h) <= 0 or (t_l - c_l) * (t_l - c_h) <= 0:
            result.add(state)
    return result
    

#AUTOLAB_IGNORE_START
print swing_states(df, alpha=0.05)
#AUTOLAB_IGNORE_STOP

set(['Ohio', 'Arizona', 'Iowa'])


Our implementation gives:
```python
>>>  swing_states(df, alpha=0.05)
set(['Ohio', 'Arizona', 'Iowa'])
```

## Q5. Significant differences [8pts]
The adjusted polls in data have been determined using three models:
- *polls-plus*: What polls, the economy and historical data tell us about Nov. 8
- *polls-only*: What polls alone tell us about Nov. 8
- *now-cast*: Who would win the election if it were held today

Is there a difference in the adjusted polls using the three models? For a given pair of models (`type1` and `type2`), take the null hypothesis to be that the adjusted polls by both types have the same (weighted) mean (for a given candidate), and the alternate hypothesis to be that they have different (weighted) means. Report the p-value using the Welch's $t$-test from class slides.

In [107]:
def difference_model(df, type1='polls-only', type2='now-cast', candidate='clinton'):
    """
    Inputs:
        df (pd.DataFrame) - data frame with polls data
        type1 (str) - model type 1
        type2 (str) - model type 2
        candidate (str) - candidate whose poll is being compared
    Outputs:
        float - p-value of hypothesis
    """
    df1 = df[df['type'] == type1]
    df2 = df[df['type'] == type2]
    mean_1 = sample_mean(df1['adjpoll_'+candidate], df1['adj_poll_weight'])
    s2_1 = sample_variance(df1['adjpoll_'+candidate], df1['adj_poll_weight'])
    m_1 = df1['adj_poll_weight'].sum(axis=0)
    mean_2 = sample_mean(df2['adjpoll_'+candidate], df2['adj_poll_weight'])
    s2_2 = sample_variance(df2['adjpoll_'+candidate], df2['adj_poll_weight'])
    m_2 = df2['adj_poll_weight'].sum(axis=0)
    
   
    t = (mean_1 - mean_2) / np.sqrt(s2_1 / m_1 + s2_2 / m_2)
    d = (s2_1 / m_1 + s2_2 / m_2) ** 2 / (((s2_1 / m_1) ** 2) / (m_1 - 1) + ((s2_2 / m_2) ** 2 / (m_2 - 1)))
    return 2 * st.t(d-1).cdf(-np.abs(t))
    
#AUTOLAB_IGNORE_START
print difference_model(df, type1='polls-only', type2='now-cast', candidate='clinton')
#AUTOLAB_IGNORE_STOP

0.77527543606


Our implementation gives a p-value of 0.77527543606.

Perform a similar hypothesis test on the *raw polls* (again, use each `poll_id` exactly once) using pollsters of different grades. For a given pair of grades (`grade1` and `grade2`), your null hypothesis is that the (weighted) mean of raw polls from pollsters of different grades is identical and your alternate hypothesis is that these are unequal.

In [110]:
def difference_grade(df, grade1='A+', grade2='B+', candidate='clinton'):
    """
    Inputs:
        df (pd.DataFrame) - data frame with polls data
        grade1 (str) - grade 1
        grade2 (str) - grade 2
        candidate (str) - candidate whose poll is being compared
    Outputs:
        float - p-value of hypothesis
    """
    df = df.drop_duplicates(subset='poll_id', keep='first')
    df1 = df[df['grade'] == grade1]
    df2 = df[df['grade'] == grade2]
    
    mean_1 = sample_mean(df1['rawpoll_'+candidate], df1['adj_poll_weight'])
    s2_1 = sample_variance(df1['rawpoll_'+candidate], df1['adj_poll_weight'])
    m_1 = df1['adj_poll_weight'].sum(axis=0)
    
    mean_2 = sample_mean(df2['rawpoll_'+candidate], df2['adj_poll_weight'])
    s2_2 = sample_variance(df2['rawpoll_'+candidate], df2['adj_poll_weight'])
    m_2 = df2['adj_poll_weight'].sum(axis=0)
    
    t = (mean_1 - mean_2) / np.sqrt(s2_1 / m_1 + s2_2 / m_2)
    d = (s2_1 / m_1 + s2_2 / m_2) ** 2 / (((s2_1 / m_1) ** 2) / (m_1 - 1) + ((s2_2 / m_2) ** 2 / (m_2 - 1)))
    return 2 * st.t(d).cdf(-np.abs(t))

#AUTOLAB_IGNORE_START
print difference_grade(df, grade1='A+', grade2='B+', candidate='clinton')
#AUTOLAB_IGNORE_STOP

0.0307326598542


Our implementation gives a p-value of 0.0307300897182.