# Week 6-1: Election Prediction through simulation

This is the first of two classes on election prediction. We'll be using simulation throughout to build our model.

All data is downloaded from [Huffington Post Pollster](http://elections.huffingtonpost.com/pollster#historical-charts)

Further references for your enjoyment:

- [The Real Story of 2016](https://fivethirtyeight.com/features/the-real-story-of-2016/) - Fivethirtyeight
- Buzzfeed's post-election [forecast grades](https://www.buzzfeednews.com/article/jsvine/2016-election-forecast-grades)
- [Putting the Polling Miss of the 2016 Election in Perspective](https://www.nytimes.com/interactive/2016/11/13/upshot/putting-the-polling-miss-of-2016-in-perspective.html) - The Upshot
- [After 2016, Can We Ever Trust the Polls Again?](https://newrepublic.com/article/139158/2016-can-ever-trust-polls-again) - The New Republic 

And finally, the single biggest reason that the simple election prediction model in this file misses so badly (predicting Clinton's chances in the high 90s): it does not take into account the [correlations between polling errors](https://www.quantamagazine.org/why-nate-silver-sam-wang-and-everyone-else-were-wrong-part-2-20161111/) in different states. If we fix this one factor, even our simple model will give Trump substantially higher chances.

## Part 1: Simulating one poll

Here we'll produce simulated election outcomes from a single poll. We are uncritically taking the poll results as an unbiased inidicator of results. This assumes that the people who are polled ("likely voters") are a good representation of the people who actually vote. It is possible to adjust these sorts of factors later, but let's begin with the basics.

In [None]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Set a random seed so this whole notebook becomes deterministic (for teaching purposes)


In [None]:
# Load national polling data. It's a TSV file, so we have to tell read_csv it's separated by tabs


In [None]:
# keep only polls of "likely voters" (as opposed to registered voters, or republicans/democracts)


In [1]:
# We need a margin of error to do our simulation, so drop any rows that don't have it


There are lot of polls here! Let's pick just one, near the end of the polling period (just before the election) and look at the outcomes it implies.

We will pick a poll which showed a close race, because it better demonstrates how the margin of error works.

The poll results and the margin of error define a probability distribution of "true" survey results -- that is, the result that the pollster would get if they could ask every single "likely voter" in the country. This distribution is a "normal" distribution.

Now we can take samples from a normal distribution with this mean and standard deviation, to simulate what the underlying "true" voting pattern would be. For example,

To interpret this number, recall that we're simulating the Clinton-Trump difference. So positive means it goes for Clinton, who our poll says is ahead 46-44. Given this, we would expect more of the simulation results to go for Clinton. Let's make 1000 and see what happens.

Sure enough, the center of this distribution is at 2, the lead given by the polls. But many values are negative as well, meaning that Clinton doesn't always win (again, assuming the actual voters splt 46-44, as this poll suggests.)

Let's see how often Trump wins, according to this model

So about 24% according to this model. This makes sense becuase the margin of error (2.8%) is pretty wide relative to the difference between the polls (2%) 

If we run the simulation again, we'll get slightly different results.

The more samples we take, the less variation we'll see in this number. To demonstrate this, let's plot a histogram of the results for various numbers of samples.

We can get a reliable win percentage by counting the wins in a large sample:

## Part 2: The Electoral college
This shows how to interpret the uncertainty in a single poll -- at least the uncertainty in the margin of error. There are two major directions to go from here:

1) In a real election prediction model, we would combine all the polls according to a weighted average of poll reliability. What "reliability" really means is how well the poll matched (predicted) previous election results. We can figure out the optimal combination of poll weights using, for example, linear regression.

2) The US uses an electoral college system, where each state contributes a fixed number of votes (out of a total of 538). We definitely need to simulate that to get anything like a reasonable election prediction.

So for the next step, let's see how to combine polls in the electoral college.

Our first task will be to pick out one poll in each state. We'll use the last dated "Likely Voter" poll.

In [None]:
# Load a CSV of electoral college votes for each state. 
# Ref: https://www.archives.gov/federal-register/electoral-college/allocation.html


In [None]:
# We'll use a little Pandas trick to make merging in the poll data easier: 
# set the index to the abbreviation

# And add the columns we'll need: Trump, Clinton, margin_of_error, all initially blank


In [2]:
# Not all polls have reported margins of error, but we can figure it out if we know the number of people surveyed.
# This function salculate the 95% margin of error, using the classic formula. 
# Ref: https://onlinecourses.science.psu.edu/stat100/node/56/


In [None]:
# Now we'll load polls for each state and pick one poll:
#  - it must be likely voters
#  - it should have the latest possible end date (we're predicting right before the election)
#  - we don't have polling data for DC, so we'll use final results and zero MOE
for abbr in states.abbr:


In [None]:
states

Now we simulate an election by drawing a sample from each state election indpendently, then tallying the electoral college votes. Instead of looking at the distribution of Clinton-Trump vote, we'll just look at the distribution of EC votes for Clinton.

In [None]:
def simulate_election(n_times):
    clinton_ec_votes = np.zeros(n_times)

    return clinton_ec_votes

In [None]:
# Run 10 simulated elections and look at the results

In [None]:
# Run many, many simulated elections and plot histogram of results


To get a Clinton win probability out of this, we can calculate the percentage where she receives 270 or more.