# Predicting Elections

## PART 1: A Simple Election Model

Let us import data that contains a table of probabilities that PredictWise made on October 2, 2012 for the US presidential elections. PredictWise aggregated polling data and, for each state, estimated the probability that Obama or Romney would win. We import the required libraries first.

In [26]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

In [27]:
import pandas as pd # imports a library to handle data as dataframe

In [31]:
!wget https://raw.githubusercontent.com/bitanb1999/HalfDayHacking/main/data/predictwise.csv

--2020-12-26 06:15:32--  https://raw.githubusercontent.com/bitanb1999/HalfDayHacking/main/data/predictwise.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1365 (1.3K) [text/plain]
Saving to: ‘predictwise.csv.7’


2020-12-26 06:15:32 (56.1 MB/s) - ‘predictwise.csv.7’ saved [1365/1365]



In [36]:
predictwise = pd.read_csv('predictwise.csv.7').set_index('States')
predictwise.head(10)

Unnamed: 0_level_0,Obama,Romney,Votes
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,0.0,1.0,9
Alaska,0.0,1.0,3
Arizona,0.062,0.938,11
Arkansas,0.0,1.0,6
California,1.0,0.0,55
Colorado,0.807,0.193,9
Connecticut,1.0,0.0,7
Delaware,1.0,0.0,3
District of Columbia,1.0,0.0,3
Florida,0.72,0.28,29


Each row is the probability predicted by Predictwise that Romney or Obama would win a state. The votes column lists the number of electoral college votes in that state.

In the case of tossed coins, even though we have a model which says that the probability of heads is 0.5, there are sequences of flips in which more or less than half the flips were heads. Similarly, here, if the probability of Romney winning in Arizona is 0.938, it means that if somehow, there were 1000 replications with an election each, Romney would win in 938 of those Arizonas **on the average** across the replications. And there would be some samples with Romney winning more, and some with less. We can run these **simulated** universes on a computer though not in real life.

How did these probabilities come about. It does not matter. I might have fit a deep neural network, a bayesian model, or just asked people to bet (thats what predictwise did). Its all just a model.

### Single simulation for a particular state

Let us consider the sixth state, `Colorado` and do one random simulation for this state. We use the function `np.random.uniform` to draw 1 sample from a uniform distribution of interval \[a,b) with an equal probability. The default values of a and b is \[0,1). Documentation about the function [here](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.uniform.html)

#### Uniform Distribution (in numpy)

`np.random.uniform` gives you a random number between 0 and 1, uniformly. In other words, the number is equally likely to be between 0 and 0.1, 0.1 and 0.2, and so on. This is a very intuitive idea, but it is formalized by the notion of the **Uniform Distribution**.

We then say:

$$X \sim Uniform([0,1),$$

which is to be read as **X has distribution Uniform([0,1])**. 

In [37]:
print(predictwise.loc['Colorado'])
sim = np.random.uniform()
print(sim)

Obama     0.807
Romney    0.193
Votes     9.000
Name: Colorado, dtype: float64
0.5316207922093074


Obama would would win Colorado for this particular simulation if his probability is higher than the random uniform number between 0 and 1. With and Obama probability of 0.807, Obama wins in this particular case.

In [38]:
obama_win = (predictwise.loc['Colorado'].Obama > sim)*1
print(obama_win)

1


Run this cell many times:

In [45]:
obama_win = (predictwise.loc['Colorado'].Obama > np.random.uniform())*1
print(obama_win)

0


Using `np.random.uniform` and comparing it to a probability threshold to get a 0 or 1 - this is nothing but a **Bernoulli Random Variable** for Colorado state.

A Bernoulli Random Variable is the simplest kind of random variable. It can take on two values,
1 and 0. It takes on a 1 if an experiment with probability *p* resulted in success and a 0 otherwise. A coin toss is another example of a Bernoulli Random Variable.  

What assigning the vote to Obama when the random variable **drawn** from the Uniform distribution is less than the Predictwise probability of Obama winning (which is a Bernoulli Parameter) does for us is this: lets say we have a large number of simulations and $p_{Obama}=0.8$. Now if we draw lots of numbers between 0 and 1, uniformly, then 80\% of the time, the random numbes drawn will be below 0.8. Thus we will now have many simulations with Obama either winning or losing, but winning about 80% of the time.

### Multiple simulations for a particular State

Now let's try and make 1000 simulations for Colorado. This is like making 1000 coin tosses with a biased coin. We expect Obama to win in about 80% of these

In [46]:
print(predictwise.loc['Colorado'])
sims = np.random.uniform(size=1000)
print(sims[:100])

Obama     0.807
Romney    0.193
Votes     9.000
Name: Colorado, dtype: float64
[0.46986387 0.77994041 0.04546838 0.52294812 0.508379   0.67593763
 0.53126381 0.01721429 0.77658355 0.4124641  0.52100117 0.83919358
 0.23379954 0.19984961 0.34898706 0.06544632 0.58864528 0.34661758
 0.00475483 0.90357386 0.24223671 0.55775565 0.99303856 0.61688829
 0.98636743 0.23131516 0.39484712 0.33865405 0.01695222 0.09904553
 0.78958856 0.7107905  0.39878765 0.27924475 0.6432539  0.16112811
 0.81025238 0.39146455 0.53992245 0.52089473 0.32244398 0.60598036
 0.33291724 0.09179842 0.96285332 0.65880944 0.20623071 0.01935672
 0.32707239 0.25880251 0.4973693  0.19427831 0.33468734 0.96460956
 0.31231053 0.89688888 0.63705607 0.54255178 0.04172197 0.04122353
 0.36436562 0.81304641 0.21273656 0.30165728 0.60346704 0.44252593
 0.5367778  0.40049229 0.75180245 0.86552224 0.04037653 0.59802713
 0.40419171 0.87559356 0.79751564 0.05592303 0.72931091 0.78638919
 0.19122794 0.97752462 0.51995272 0.63569028 0.952

With an Obama probability of about 0.8, This means that we can pick the simulations in which Obama wins Colorado by seeing in how many of the simulations the random number thrown is less than 0.8

We are simply making use of numpy broadcasting semantics here:

In [47]:
wins = sims < predictwise.loc['Colorado'].Obama
wins[:100]

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True,  True,
        True, False,  True,  True, False,  True, False,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True, False,  True,  True,  True,  True,  True, False,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False])

This is a boolean numpy array. Summing coerces the True to 1 and False to 0 and gives us the total number of simulations.<br>

In [48]:
np.sum(wins)

842

Obama wins in 81% of the simulations, roughly, in Colorado

## Simulating the election model

To do this, 
we will assume that the outcome in each state is the result of an independent coin flip whose probability of coming up Obama is given by the Predictwise state-wise win probabilities. Lets write a function `simulate_election` that uses this **predictive model** to simulate the outcome of the election given a table of probabilities.

But first, what do we need to do to achieve this? We need to take the output of each simulation for a state and compare it to the predictiwise probability. Then we need to do this many times. Then we need to multiply in by the electoral votes in those states. And then sum over the states.

### Simulations on all states
Lets do it for all the states now. Now we make 1000 simulations for all the states. We'll create 51 rows of 1000 random numbers (1000 columns) between 0 and 1 & try and see in how many votes Obama is able to gather in each simulation

In [50]:
predictwise = pd.read_csv('predictwise.csv.7').set_index('States')
predictwise.head()

Unnamed: 0_level_0,Obama,Romney,Votes
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alabama,0.0,1.0,9
Alaska,0.0,1.0,3
Arizona,0.062,0.938,11
Arkansas,0.0,1.0,6
California,1.0,0.0,55


In [51]:
sims = np.random.uniform(size=(51, 1000))
print(sims.shape)
sims

(51, 1000)


array([[0.55521927, 0.90809758, 0.71120403, ..., 0.67960345, 0.4535865 ,
        0.77268543],
       [0.5777497 , 0.1494576 , 0.41051479, ..., 0.38620356, 0.29804423,
        0.28936349],
       [0.75818912, 0.32085217, 0.20763672, ..., 0.49658537, 0.63629118,
        0.15811647],
       ...,
       [0.33363822, 0.21397908, 0.62524781, ..., 0.61487478, 0.35965417,
        0.13485122],
       [0.94280304, 0.15912167, 0.00512678, ..., 0.42113937, 0.64033516,
        0.51151344],
       [0.16894432, 0.15999267, 0.50982789, ..., 0.72189652, 0.00700977,
        0.33322739]])

Now we just need to compare each row of 1000 random numbers with the probability for that state to find how many of the simulations Obama wins:

In [52]:
sims < predictwise.Obama.values

ValueError: ignored

Doing this creates an error as the shape of **sims** and **predictwise.Obama.values** is not the same. We can use numpy.reshape to change the shape of our array

## Comparing simulations and numpy.reshape

Reshape gives a new shape to the array without changing the data.  Detailed documentation about reshape [here](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html). Let us take an example to show how reshape works

In [53]:
a= np.arange(6)
print(a)
print(a.shape)

[0 1 2 3 4 5]
(6,)


In [54]:
a.reshape(2,3)

array([[0, 1, 2],
       [3, 4, 5]])

In [55]:
print(a.reshape(-1,1))   # -1 as a parameter simply means that it is an unknown dimension and we want numpy to figure it out.
print('New shape is: ', a.reshape(-1,1).shape)

[[0]
 [1]
 [2]
 [3]
 [4]
 [5]]
New shape is:  (6, 1)


If we use `reshape` now to our predictwise Obama proababilities, we get 51 rows of 100 Trues and Falses

In [56]:
how_many_won = sims < predictwise.Obama.values.reshape(-1,1) # reshape for broadcasting
how_many_won

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False,  True,  True, ...,  True,  True,  True],
       [False, False, False, ..., False, False, False]])

In [57]:
how_many_won.shape

(51, 1000)

Multiply by the number of electoral Votes in each state

In [58]:
how_many_electoral_votes = how_many_won*predictwise.Votes.values.reshape(-1,1)
how_many_electoral_votes

array([[ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ...,  0,  0,  0],
       [ 0, 10, 10, ..., 10, 10, 10],
       [ 0,  0,  0, ...,  0,  0,  0]])

If we sum the columns now (that is over the 50 states and DC), we get the total electoral votes in the country for Obama in each simulation

In [59]:
total_electoral_votes_per_simulation = np.sum(how_many_electoral_votes, axis=0)
total_electoral_votes_per_simulation[:20]

array([278, 309, 347, 316, 341, 338, 328, 288, 290, 312, 314, 332, 332,
       332, 347, 316, 334, 351, 300, 336])

This is what Nate Silver and other prediction people will quote you. In How many simulations does Obama win? Well there are 538 electoral votes in the US. So you need 269 to break even. Lets compare:

In [60]:
np.sum(total_electoral_votes_per_simulation >= 269)

995

Thats 993 out of 1000 simulations or a 99.3% chance!

## Refactoring the simulation code

Here then is the predictive model we are playing with in short: each column simulates a single outcome from the 50 states + DC by choosing a random number between 0 and 1. Obama wins that simulation if the random number is $<$ the win probability. If he wins that simulation, we add in the electoral votes for that state, otherwise we dont. We do this `n_sim` times and return a list of total Obama electoral votes in each simulation.

### Q1:  *Write a function that uses this **predictive model** to simulate the outcome of the election given a Dataframe of probabilities*.



In [None]:
"""
Function
--------
simulate_election

Inputs
------
model : DataFrame
    A DataFrame summarizing an election forecast. The dataframe has 51 rows -- one for each state and DC
    It has the following columns:
       Obama : Forecasted probability that Obama wins the state
       Votes : Electoral votes for the state
    The DataFrame is indexed by state (i.e., model.index is an array of state names)
    
n_sim : int
   Number of simulations to run
   
Returns
-------
results : Numpy array with n_sim elements
   Each element stores the number of electoral college votes Obama wins in each simulation.   
"""

# your code here


The following code takes the necessary probabilities for the Predictwise data, and runs 10000 simulations. If you think of this in terms of our coins, think of it as having 51 biased coins, one for each state, and tossing them 10,000 times each.

We use the results to compute the number of simulations, according to this predictive model, that Obama wins the election (i.e., the probability that he receives 269 or more electoral college votes)

In [None]:
result = simulate_election(predictwise, 10000)
print(type(result))
np.sum(result >= 269)

There are roughly only 50 simulations in which Romney wins the election!

### Displaying the prediction

Now, lets visualize the simulation. We will build a histogram from the result of `simulate_election`. We will **normalize** the histogram by dividing the frequency of a vote tally by the number of simulations. We'll overplot the "victory threshold" of 269 votes as a vertical black line and the result (Obama winning 332 votes) as a vertical red line.

We also compute the number of votes at the 5th and 95th quantiles, which we call the spread, and display it (this is an estimate of the outcome's uncertainty). By 5th quantile we mean that if we ordered the number of votes Obama gets in each simulation in increasing order, the 5th quantile is the number below which 5\% of the simulations lie. 

We also display the probability of an Obama victory, seeing a new feature in histograms, which is "density" plotting. Here we are normalizing the total frequency of the histogram to 1, to construct our notion of a probability from a large-number frquency.   

In [None]:
def plot_simulation(simulation):
    fig, ax = plt.subplots(figsize=(12,6))
    ax.hist(simulation, bins=np.arange(200, 538, 1), 
             label='simulations', align='left', density=True)
    ax.axvline(332, 0, .5, color='r', label='Actual Outcome')
    ax.axvline(269, 0, .5, color='k', label='Victory Threshold')
    p05 = np.percentile(simulation, 5.)
    p95 = np.percentile(simulation, 95.)
    iq = int(p95 - p05)
    pwin = ((simulation >= 269).mean() * 100)
    ax.set_title("Chance of Obama Victory: %0.2f%%, Spread: %d votes" % (pwin, iq))
    ax.legend(frameon=False, loc='upper left')
    ax.set_xlabel("Obama Electoral College Votes")
    ax.set_ylabel("Probability")
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

In [None]:
plot_simulation(result)

The model created by combining the probabilities we obtained from Predictwise with the simulation of a biased coin flip corresponding to the win probability in each states leads us to obtain a histogram of election outcomes. We are plotting the probabilities of a prediction, so we call this distribution over outcomes the **predictive distribution**. Simulating from our model and plotting a histogram allows us to visualize this predictive distribution. In general, such a set of probabilities is called a **probability distribution** or **probability mass function**. 

## PART 2: Estimating Probabilities

Now we will try to **estimate** our own win probabilities to plug into our predictive model.

We will start with a simple forecast model. We will try to predict the outcome of the election based the estimated proportion of people in each state who identify with one one political party or the other.

Gallup measures the political leaning of each state, based on asking random people which party they identify or affiliate with. [Here's the data](http://www.gallup.com/poll/156437/heavily-democratic-states-concentrated-east.aspx#2) they collected from January-June of 2012:

In [62]:
!wget https://raw.githubusercontent.com/nehal-sharma/HalfDayHacking/main/data/g12.csv

--2020-12-26 07:11:37--  https://raw.githubusercontent.com/nehal-sharma/HalfDayHacking/main/data/g12.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1537 (1.5K) [text/plain]
Saving to: ‘g12.csv’


2020-12-26 07:11:37 (22.9 MB/s) - ‘g12.csv’ saved [1537/1537]



In [63]:
gallup_2012=pd.read_csv("g12.csv").set_index('State')
gallup_2012["Unknown"] = 100 - gallup_2012.Democrat - gallup_2012.Republican
gallup_2012.head()

Unnamed: 0_level_0,Democrat,Republican,Dem_Adv,N,Unknown
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alabama,36.0,49.6,-13.6,3197,14.4
Alaska,35.9,44.3,-8.4,402,19.8
Arizona,39.8,47.3,-7.5,4325,12.9
Arkansas,41.5,40.8,0.7,2071,17.7
California,48.3,34.6,13.7,16197,17.1


Each row lists a state, the percent of surveyed individuals who identify as Democrat/Republican, the percent whose identification is unknown or who haven't made an affiliation yet, the margin between Democrats and Republicans (`Dem_Adv`: the percentage identifying as Democrats minus the percentage identifying as Republicans), and the number `N` of people surveyed.

### QUESTION 2

This survey can be used to predict the outcome of each State's election. The simplest forecast model assigns 100% probability that the state will vote for the majority party.  *Implement this simple forecast*.

In [64]:
"""
Function
--------
simple_gallup_model

A simple forecast that predicts an Obama (Democratic) victory with
0 or 100% probability, depending on whether a state
leans Republican or Democrat.

Inputs
------
gallup : DataFrame
    The Gallup dataframe above

Returns
-------
model : DataFrame
    A dataframe with the following column
     * Obama: probability that the state votes for Obama. All values should be 0 or 1
    model.index should be set to gallup.index (that is, it should be indexed by state name)
    
Examples
---------
>>> simple_gallup_model(gallup_2012).ix['Florida']
Obama    1
Name: Florida, dtype: float64
>>> simple_gallup_model(gallup_2012).ix['Arizona']
Obama    0
Name: Arizona, dtype: float64
"""

# your code here
gallup_2012.Dem_Adv.value_counts()



 0.7     2
 3.6     2
-8.4     2
 9.8     1
 17.4    1
 8.1     1
 25.6    1
-11.9    1
-30.3    1
-3.0     1
 21.2    1
 4.4     1
-6.0     1
 26.0    1
 11.5    1
-3.9     1
-1.5     1
-4.2     1
 15.4    1
-19.0    1
 11.1    1
-7.0     1
-4.0     1
 13.7    1
-5.8     1
 20.2    1
 13.9    1
 5.2     1
-4.8     1
-5.2     1
-29.9    1
 12.6    1
 2.8     1
 0.4     1
-39.3    1
 66.3    1
-13.2    1
 0.2     1
 1.8     1
-13.6    1
 19.1    1
 3.4     1
-13.7    1
 10.2    1
-9.4     1
-6.7     1
-16.9    1
-7.5     1
Name: Dem_Adv, dtype: int64

In [65]:
gallup_simple = gallup_2012.copy()

In [67]:
gallup_simple['Obama'] = 0 if gallup_simple['Dem_Adv'] < 0 else: 1

SyntaxError: ignored

Now, we run the simulation with this model, and plot it.

In [None]:
predictwise.head()

In [None]:
model = simple_gallup_model(gallup_2012)
model = model.join(predictwise[['Votes']])
prediction = simulate_election(model, 10000)

plot_simulation(prediction)

## PART 3:  Adding Polling Uncertainty to the Predictive Model

The model above is brittle -- it includes no accounting for uncertainty, and thus makes predictions with 100% confidence. This is clearly wrong -- there are numerous sources of uncertainty in estimating election outcomes from a poll of affiliations. 

The most obvious source of error in the Gallup data is the finite sample size -- Gallup did not poll *everybody* in America, and thus the party affilitions are subject to sampling errors. How much uncertainty does this introduce?

On their [webpage](http://www.gallup.com/poll/156437/heavily-democratic-states-concentrated-east.aspx#2) discussing these data, Gallup notes that the sampling error for the states is between 3 and 6%, with it being 3% for most states. (The calculation of the sampling error itself is an exercise in statistics. Its fun to think of how you could arrive at the sampling error if it was not given to you. One way to do it would be to assume this was a two-choice situation and use binomial sampling error for the non-unknown answers, and further model the error for those who answered 'Unknown'.)

This is indeed done even in direct polls just before the election. There too you cant ask everyone and mustworry about sampling.

### QUESTION 3:

Use Gallup's estimate of 3% to build a Gallup model with some uncertainty. Assume that the `Dem_Adv` column represents the mean of a Gaussian, whose standard deviation is 3%. Build the model in the function `uncertain_gallup_model`. *Return a forecast where the probability of an Obama victory is given by the probability that a sample from the `Dem_Adv` Gaussian is positive.*

**Hint**
The probability that a sample from a Gaussian with mean $\mu$ and standard deviation $\sigma$ exceeds a threhold $z$ can be found using the the Cumulative Distribution Function of a Gaussian:

$$
CDF(z) = \frac1{2}\left(1 + {\rm erf}\left(\frac{z - \mu}{\sqrt{2 \sigma^2}}\right)\right) 
$$


In [None]:
"""
Function
--------
uncertain_gallup_model

A forecast that predicts an Obama (Democratic) victory if the random variable drawn
from a Gaussian with mean Dem_Adv and standard deviation 3% is >0

Inputs
------
gallup : DataFrame
    The Gallup dataframe above

Returns
-------
model : DataFrame
    A dataframe with the following column
     * Obama: probability that the state votes for Obama.
    model.index should be set to gallup.index (that is, it should be indexed by state name)
"""
# your code here


In [None]:
model = uncertain_gallup_model(gallup_2012)
model = model.join(predictwise[['Votes']])

In [None]:
prediction = simulate_election(model, 10000)
plot_simulation(prediction)