# Homework 4: Can we predict the Midterm Elections?

---


## Introduction

**Add Introduction**

I will use the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api/v2) to extract the polls for the current 2014 Senate Midterm Elections and provide a final prediction of the result of each state.

#### Data

We will use the polls from the [2014 Senate Midterm Elections](http://elections.huffingtonpost.com/pollster) from the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api/v2). 

---

## Problem 1: Data Wrangling

We will read in the polls from the [2014 Senate Midterm Elections](http://elections.huffingtonpost.com/pollster) from the [HuffPost Pollster API](http://elections.huffingtonpost.com/pollster/api/v2) and create a dictionary of DataFrames as well a master table information for each race.

#### Problem 1(a)

Read in [this JSON object](http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate) containing the polls for the 2014 Senate Elections using the HuffPost API. Call this JSON object `info`.  This JSON object is imported as a list in Python where each element contains the information for one race.  Use the function `type` to confirm the that `info` is a list. 

In [1]:
import requests
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from bs4 import BeautifulSoup
from collections import defaultdict

In [2]:
info = requests.get('http://elections.huffingtonpost.com/pollster/api/v2/charts/',
                    params = {'tags': '2014-senate'}).json()

#### Problem 1(b)

For each element of the list in `info` extract the state. We should have one poll per state, but we do not. Why?

**Hint**: Use the internet to find out information on the races in each state that has more than one entry. Eliminate entries of the list that represent races that are not happening.

In [None]:
new_info = defaultdict(list)
for i in info['items']:
    state = i['question']['slug'].split('-')[1]
    candidates = set([c['label'] for c in i['question']['responses']])
    candidates.discard('Undecided')
    candidates.discard('Other')
    if (state not in new_info
        or not set([c['label'] for s in new_info[state] for c in s['question']['responses']]) & candidates):
        # Append a race if its state is not in our dictionary
        # or if this race is a special election ie. no overlap between candidate lists
        new_info[state].append(i)
    else:
        for idx, v in enumerate(new_info[state]):
            if (set([c['label'] for c in v['question']['responses']]) & candidates
                and i['question']['n_polls'] > v['question']['n_polls']):
                # Replace a current race with one that has larger n_polls
                # if there's an overlap between candidate lists
                new_info[state][idx] = i
unpacked_new_info = [rc for v in new_info.values() for rc in v]

** My answer here: ** There are polls for races that have never happened, and there are special elections at some states.

#### Problem 1(c)

Create a dictionary of pandas DataFrames called `polls` keyed by the name of the election (a string). Each value in the dictionary should contain the polls for one of the races.

In [None]:
polls = {}
for rc in unpacked_new_info:
    polls[rc['slug']] = pd.read_csv('http://elections.huffingtonpost.com/pollster/api/v2/questions/'
                                    + rc['question']['slug'] + '/poll-responses-clean.tsv', sep = '\t')

#### Problem 1(d)

Now create a master table information containing information about each race. Create a pandas DataFrame called `candidates` with rows containing information about each race. The `candidates` DataFrame should have the following columns: 

1. `State` = the state where the race is being held
2. `R` = name of republican candidate
3. `D` = name of non-republican candidate (Democrate or Independent) 
4. `incumbent` = R, D or NA

**Hint**: We will need a considerable amount of data wrangling for this.

In [None]:
candidates = defaultdict(list)
for rc in (rc['question'] for v in new_info.values() for rc in v):
    candidates['State'].append(rc['slug'].split('-')[1])
    for rspns in rc['responses']:
        if rspns['party'] == 'Republican':
            candidates['R'].append(rspns['label'])
            if rspns['incumbent']:
                candidates['incumbent'].append('R')
        if rspns['party'] == 'Democrat':
            candidates['D'].append(rspns['label'])
            if rspns['incumbent']:
                candidates['incumbent'].append('D')
    if len(candidates['incumbent']) < len(candidates['State']):
        # Set incumbent = 'NA' if there are no incumbents in the race
        candidates['incumbent'].append('NA')
candidates = pd.DataFrame(candidates)
candidates.head()

## Problem 2: Confidence Intervals

Compute a 99% confidence interval for each state. 

#### Problem 2(a)

Assume we have $M$ polls with sample sizes $n_1, \dots, n_M$. If the polls are independent, what is the average of the variances of each poll if the true proportion is $p$?

** My answer here: ** $$ \frac{p(1-p)(1/n_1 + \dots +1/n_M)}{M} $$

#### Problem 2(b)

Compute the square root of these values in Problem 2(a) for the republican candidates in each race. Then, compute the standard deviations of the observed poll results for each race. 

In [None]:
average_std = {}
poll_std = {}
results = BeautifulSoup(requests.get('http://elections.huffingtonpost.com/2014/results').text, 'lxml')
for rc in unpacked_new_info:
    candidates = set([c['label'].strip() for c in rc['question']['responses']])
    p = 0
    slug = rc['slug']
    for r in results.find_all('div', attrs = {'class': 'race closed',
                                              'data-state': rc['question']['slug'].split('-')[1]}):
        if set([c.contents[0].strip() for c in r.find_all('span', class_ = 'nowrap')]) & candidates:
            # Look for the table that contains the result for our candidates
            for prf in r.find_all('tr'):
                if prf.find('td', class_ = 'party').span['class'][0] == 'gop':
                    # Extract the result if the Party is Republican
                    p = float(prf.find('td', class_ = 'pct').text.strip('%'))/100
                    break
            break
    # Calculate the theoretical standard deviation of the winning proportion
    average_std[rc['slug']] = ((p * (1 - p)) ** (1 / 2) * (1 / polls[rc['slug']].observations ** (1 / 2)).sum()
                               / len(polls[rc['slug']]))
    for r in rc['question']['responses']:
        if r['party'] == 'Republican':
            # Calculate the sample standard deviation of the winning proportion
            poll_std[slug] = (polls[slug][r['label']] / 100.).std()
            break
average_std

#### Problem 2(c) 

Plot observed versus theoretical (average of the theoretical SDs) with the area of the point proportional to number of polls. How do these compare?

In [None]:
races = sorted(list(average_std))
plt.scatter([average_std[rc] for rc in races], [poll_std[rc] for rc in races], s = [len(polls[rc]) for rc in races])
plt.xlabel('Theoretical Std')
plt.ylabel('Actual Std')
plt.show()

** My answer here: ** The theoretical and actual std do not match and the actual std tends to be higher. This might be due to the non-stationary data ie. the voting opinion had changed over time.

#### Problem 2(d)

Repeat Problem 2(c) but include only the most recent polls from the last two months. Do they match better or worse or the same? Can we just trust the theoretical values?

In [None]:
recent_poll_std = {}
for rc in unpacked_new_info:
    slug = rc['slug']
    for r in rc['question']['responses']:
        if r['party'] == 'Republican':
            # Calculate the std for polls from the last two months if the Party is Republican
            recent_poll_std[slug] = (polls[slug][pd.to_datetime(polls[slug].end_date)
                                                 > pd.datetime(2014, 9, 3)][r['label']] / 100).std()
            break

plt.scatter([average_std[rc] for rc in races], [recent_poll_std[rc] for rc in races],
            # Increase the dot size of those races with a larger number of polls
            s = [len(polls[rc][pd.to_datetime(polls[rc].end_date) > pd.datetime(2014, 9, 3)]) for rc in races])
plt.xlabel('Theoretical Std')
plt.ylabel('Actual Std')
plt.show()

** My answer here: **They match better but not significantly. It seems that the theoretical std broke down due to the difference in the true proportion p between pollsters.

#### Problem 2(e)

Create a scatter plot with each point representing one state. Is there one or more races that are outlier in that it they have much larger variabilities than expected ? Explore the original poll data and explain why the discrepancy?

In [None]:
race_states = {rc['slug']: rc['question']['slug'].split('-')[1] for rc in unpacked_new_info}
plt.scatter([average_std[rc] for rc in races], [poll_std[rc] for rc in races], s = 0)
plt.xlabel('Theoretical Std')
plt.ylabel('Actual Std')
for idx, v in enumerate([average_std[rc] for rc in races]):
    plt.text(v, [poll_std[rc] for rc in races][idx], [race_states[rc] for rc in races][idx])
plt.show()

** My answer here: ** There are states like Texas that has much larger variabilities than expected.

#### Problem 2(f)

Construct confidence intervals for the difference in each race. Use either theoretical or data driven estimates of the standard error depending on our answer to this question. Use the results in Problem 2(e), to justify our choice.


In [None]:
confidence = defaultdict(list)
for rc in unpacked_new_info:
    slug = rc['slug']
    confidence['race'].append(slug)
    for r in rc['question']['responses']:
        if r['party'] == 'Republican':
            gop = r['label']
        elif r['party'] == 'Democrat':
            dem = r['label']
    confidence['diff_mean'].append((polls[slug][gop] - polls[slug][dem]).mean() / 100)
    confidence['diff_std'].append(((polls[slug][gop] - polls[slug][dem]) / 100).std())
    confidence['n_polls'].append(len(polls[slug]))
confidence = pd.DataFrame(confidence)
confidence['ste'] = confidence.diff_std / confidence.n_polls**(1 / 2)
confidence['lower_bound'] = (confidence.diff_mean + confidence.ste * stats.t.interval(.95, confidence.n_polls - 1)[0])
confidence['upper_bound'] = (confidence.diff_mean + confidence.ste * stats.t.interval(.95, confidence.n_polls - 1)[1])
confidence.sort_values('diff_mean', inplace = True)
confidence.reset_index(inplace = True)

plt.figure(figsize = (17, 8))
plt.xticks(confidence.index, confidence.race, rotation = 90)
plt.xlim(-1, confidence.shape[0])
plt.errorbar(confidence.index, confidence.diff_mean, fmt = 'o', ms = 4,
             yerr = confidence.ste * stats.t.interval(.95, confidence.n_polls - 1)[1])
plt.axhline(0, linewidth = .5, color = 'black')
plt.ylabel('Winning Margin of GOP')
plt.show()

# Problem 3: Prediction and Posterior Probabilities

Perform a Bayesian analysis to predict the probability of Republicans winning in each state then provide a posterior distribution of the number of republicans in the senate.

#### Problem 3(a)

First, we define a Bayesian model for each race. The prior for the difference $\theta$ between republicans and democrats will be $N(\mu,\tau^2)$. Say before seeing poll data we have no idea who is going to win, what should $\mu$ be? How about $\tau$, should it be large or small? 

** My answer here: **$\mu$=0. $\tau$ should be small as the empirical variance of $\theta$ is not large and we also not want to set too extreme a prior.

#### Problem 3(b)

What is the distribution of $d$ conditioned on $\theta$. What is the posterior distribution of $\theta | d$? 

**Hint**: Use normal approximation. 

**My answer here:**

$d \mid \theta \sim N(\theta, \sigma^2/M)$.

$\theta \mid d \sim N\left(B\mu + (1-B)d, (1-B)\sigma^2/M\right)$, where $B = \frac{1/\tau^2}{M/\sigma^2 + 1/\tau^2}$.

#### Problem 3(c)

The prior represents what we think before hand. We do not know who is expected to win, so we assume $\mu=0$. For this problem estimate $\tau$ using the observed differences across states (Hint: $\tau$ represents the standard deviation of a typical difference). Compute the posterior mean for each state and plot it against original average. Is there much change? Why or why not? 

In [None]:
diffs = []
for rc in unpacked_new_info:
    slug = rc['slug']
    for r in rc['question']['responses']:
        if r['party'] == 'Republican':
            gop = r['label']
        elif r['party'] == 'Democrat':
            dem = r['label']
    diffs += list((polls[slug][gop] - polls[slug][dem]) / 100)
tau0 = np.nanstd(diffs)
# Calculate the posterior mean
confidence['post_diff_mean'] = (confidence.diff_mean * confidence.n_polls / confidence.diff_std ** 2
                                / (1 / tau0 ** 2 + confidence.n_polls / confidence.diff_std ** 2))
# Calculate the posterior std
confidence['post_diff_std'] = (1 / tau0 ** 2 + confidence.n_polls / confidence.diff_std ** 2) ** (-1 / 2)
sns.regplot(confidence.diff_mean, confidence.post_diff_mean)
plt.show()

**My answer here:** There is not much change. This means either we chose a reasonable prior distribution or n_polls for some states are too small.

#### Problem 3(d)

For each state, report a probabilty of Republicans winning. How does our answer here compare to the other aggregators?

In [None]:
confidence['gop_win_proba'] = stats.norm.sf(0, confidence.post_diff_mean, confidence.post_diff_std)
confidence

**My answer here:** My answer is quite in line with other aggregators.

#### Problem 3(e)

Use the posterior distributions in a Monte Carlo simulation to generate election results. In each simulation compute the total number of seats the Republican control. Show a histogram of these results.

In [None]:
gop_seats = []
for i in range(100000):
    gop_seats.append(30 + stats.bernoulli.rvs(confidence.gop_win_proba, random_state = i).sum())
sns.distplot(gop_seats, kde = False)
plt.xlabel('Number of Republican Seats')
plt.ylabel('Frequency')
plt.title('Monte Carlo simulation of Number of Seats in Republican Control')
plt.show()

---