# 2020 US Presidential Elections Winner Simulator
This project uses state-by-state predictions from [PredictIt](https://www.predictit.org/) to simulate many hypothetical elections to create an overall prediction: *how often Joe Biden will win the US election*.  

If you are unfamiliar with Predictit, it is a financial prediction market where traders can exchange shares based on various political events around the world.  The value of these shares is determined by what traders believe is the likelihood of a particular event.

## Imports

In [1]:
import csv
import io
import json
import numpy
import urllib.request
import re

We need two sets of data: the number of electoral votes allocated to each congressional district, and the market data from PredictIt.

## Download Electoral Votes Data
Get the number of electoral votes assigned to each congressional district.  Remember there are 50 states plus DC.

In [2]:
url_open = urllib.request.urlopen("https://raw.githubusercontent.com/peterhhchan/us-elections-2020/master/notebooks/PredictItSimulator/states.csv")
reader = csv.reader(io.TextIOWrapper(url_open, encoding = 'utf-8'), delimiter=',')

electoral_votes = {}
for v,n,a in reader:
    electoral_votes[a]=int(v)
    
print (electoral_votes)

{'CA': 55, 'TX': 38, 'FL': 29, 'NY': 29, 'IL': 20, 'PA': 20, 'OH': 18, 'GA': 16, 'MI': 16, 'NC': 15, 'NJ': 14, 'VA': 13, 'WA': 12, 'AZ': 11, 'IN': 11, 'MA': 11, 'TN': 11, 'MD': 10, 'MN': 10, 'MO': 10, 'WI': 10, 'AL': 9, 'CO': 9, 'SC': 9, 'KY': 8, 'LA': 8, 'CT': 7, 'OK': 7, 'OR': 7, 'AR': 6, 'IA': 6, 'KS': 6, 'MS': 6, 'NV': 6, 'UT': 6, 'NE': 5, 'NM': 5, 'WV': 5, 'HI': 4, 'ID': 4, 'ME': 4, 'NH': 4, 'RI': 4, 'AK': 3, 'DE': 3, 'MT': 3, 'ND': 3, 'SD': 3, 'VT': 3, 'WY': 3, 'DC': 3}


## Download Market Data
Grab the latest share prices on PredictIt. Prices will range from 1 cents to 99 cents (0.01 - 0.99). Roughly, a price of 1 cent implies Biden has a 1% chance of winning that state, and a price of 99 cents implies Biden has a 99% chance of winning that state.

In [3]:
prices = {}
markets = json.loads(urllib.request.urlopen("https://www.predictit.org/api/marketdata/all/").read())['markets']
for m in markets:
    sn = m['shortName']
    match = re.search("^Which party will win ([A-Z]{2})( in)? 2020\?$", sn)
    if match:
        for c in m['contracts']:
            if c['name'] == 'Democratic':
                prices[match[1]] = c['lastTradePrice']

print (prices)

{'WI': 0.74, 'PA': 0.72, 'FL': 0.59, 'MI': 0.78, 'AZ': 0.68, 'MN': 0.8, 'NH': 0.81, 'NC': 0.54, 'OH': 0.47, 'NV': 0.79, 'VA': 0.91, 'IA': 0.45, 'GA': 0.46, 'CO': 0.9, 'TX': 0.33, 'ME': 0.89, 'IN': 0.1, 'NM': 0.9, 'NJ': 0.94, 'MO': 0.15, 'OR': 0.94, 'UT': 0.08, 'TN': 0.08, 'CT': 0.94, 'AK': 0.23, 'KY': 0.06, 'MD': 0.96, 'MA': 0.96, 'AR': 0.06, 'WA': 0.96, 'MT': 0.13, 'SC': 0.17, 'CA': 0.96, 'NY': 0.93, 'IL': 0.94, 'WV': 0.04, 'OK': 0.04, 'LA': 0.08, 'ID': 0.04, 'NE': 0.07, 'AL': 0.06, 'KS': 0.11, 'MS': 0.09, 'RI': 0.96, 'HI': 0.96, 'WY': 0.04, 'VT': 0.96, 'DE': 0.95, 'ND': 0.04, 'SD': 0.06, 'DC': 0.98}


## Calculate the Implied Probability
PredictIt charges a 10% fee on profits, the savvy bettor understands that in order to break-even on a 50-cent contract, the contract must win more than 52.6% of the time. Here we convert the price to its implied probability. 

In [4]:
win_prob = {}
for s, price in prices.items():
    p = float (price)
    ## predictit charges a 10% fee on profits
    ## the net profit on a 50 cent contract is 45 cents
    ## the bettor must win 52.63%+ of the time to breakeven
    win_prob[s] = p / ((1 - p) * 0.9 + p)
print (win_prob)

{'WI': 0.7597535934291582, 'PA': 0.7407407407407407, 'FL': 0.6152241918665275, 'MI': 0.7975460122699387, 'AZ': 0.7024793388429753, 'MN': 0.8163265306122449, 'NH': 0.8256880733944955, 'NC': 0.5660377358490567, 'OH': 0.49630411826821536, 'NV': 0.8069458631256384, 'VA': 0.9182643794147326, 'IA': 0.47619047619047616, 'GA': 0.48625792811839325, 'CO': 0.9090909090909092, 'TX': 0.3536977491961415, 'ME': 0.8998988877654196, 'IN': 0.10989010989010989, 'NM': 0.9090909090909092, 'NJ': 0.9456740442655935, 'MO': 0.16393442622950818, 'OR': 0.9456740442655935, 'UT': 0.0881057268722467, 'TN': 0.0881057268722467, 'CT': 0.9456740442655935, 'AK': 0.24918743228602383, 'KY': 0.06622516556291391, 'MD': 0.963855421686747, 'MA': 0.963855421686747, 'AR': 0.06622516556291391, 'WA': 0.963855421686747, 'MT': 0.14238773274917854, 'SC': 0.18538713195201745, 'CA': 0.963855421686747, 'NY': 0.9365558912386708, 'IL': 0.9456740442655935, 'WV': 0.04424778761061947, 'OK': 0.04424778761061947, 'LA': 0.0881057268722467, 'ID

## Adjust the Probabilities 
If you think the markets are incorrect, this section is where you could adjust the predictions.

Here we apply a global bias. If we think the markets are over-valueing Biden's chances by 25%, we would set our bias to `0.8`.

In [5]:
bias = 1
def apply_bias(probs):
    for s, prob in probs.items():
        probs [s] = prob * bias
apply_bias(win_prob)

Next, we can apply smooth out our simulations by applying some assumptions to our predictions. 

 * If the market prediction is above `win_threshold`, Biden will always win that district
 * If the market prediction is below `lose_threshold`, Biden will always lose that district.

An interesting set of parameters to use is to set `win_threshold` to **0.8**, and `lose_threshold` to **0.5**.

In [6]:
win_threshold  = 0.95
lose_threshold = 0.05

In [7]:
def apply_thresholds(probs):
    for s, p in probs.items():
        if p > win_threshold:   
            probs [s] = 1.0
        elif p < lose_threshold: 
            probs [s] = 0

apply_thresholds(win_prob) 
print (win_prob)

{'WI': 0.7597535934291582, 'PA': 0.7407407407407407, 'FL': 0.6152241918665275, 'MI': 0.7975460122699387, 'AZ': 0.7024793388429753, 'MN': 0.8163265306122449, 'NH': 0.8256880733944955, 'NC': 0.5660377358490567, 'OH': 0.49630411826821536, 'NV': 0.8069458631256384, 'VA': 0.9182643794147326, 'IA': 0.47619047619047616, 'GA': 0.48625792811839325, 'CO': 0.9090909090909092, 'TX': 0.3536977491961415, 'ME': 0.8998988877654196, 'IN': 0.10989010989010989, 'NM': 0.9090909090909092, 'NJ': 0.9456740442655935, 'MO': 0.16393442622950818, 'OR': 0.9456740442655935, 'UT': 0.0881057268722467, 'TN': 0.0881057268722467, 'CT': 0.9456740442655935, 'AK': 0.24918743228602383, 'KY': 0.06622516556291391, 'MD': 1.0, 'MA': 1.0, 'AR': 0.06622516556291391, 'WA': 1.0, 'MT': 0.14238773274917854, 'SC': 0.18538713195201745, 'CA': 1.0, 'NY': 0.9365558912386708, 'IL': 0.9456740442655935, 'WV': 0, 'OK': 0, 'LA': 0.0881057268722467, 'ID': 0, 'NE': 0.07717750826901874, 'AL': 0.06622516556291391, 'KS': 0.12074643249176728, 'MS':

Alternatively, you can simply update a state's probability via: `win_prob[FL] = 0.5` which means Biden has a 50% chance of winning Florida.

In [8]:
## Override Democrat's chances of winning
# win_prob['FL'] = 0.5
# win_prob['AZ'] = 0.5
# win_prob['NC'] = 0.5
# win_prob['WI'] = 0.5
# win_prob['MI'] = 0.5
# win_prob['PA'] = 0.5
# win_prob['MN'] = 0.5

## Prepare the Data
Arrange the data alphabetically

In [9]:
win_prob_sorted = sorted(win_prob.items())
print (win_prob_sorted)

[('AK', 0.24918743228602383), ('AL', 0.06622516556291391), ('AR', 0.06622516556291391), ('AZ', 0.7024793388429753), ('CA', 1.0), ('CO', 0.9090909090909092), ('CT', 0.9456740442655935), ('DC', 1.0), ('DE', 1.0), ('FL', 0.6152241918665275), ('GA', 0.48625792811839325), ('HI', 1.0), ('IA', 0.47619047619047616), ('ID', 0), ('IL', 0.9456740442655935), ('IN', 0.10989010989010989), ('KS', 0.12074643249176728), ('KY', 0.06622516556291391), ('LA', 0.0881057268722467), ('MA', 1.0), ('MD', 1.0), ('ME', 0.8998988877654196), ('MI', 0.7975460122699387), ('MN', 0.8163265306122449), ('MO', 0.16393442622950818), ('MS', 0.099009900990099), ('MT', 0.14238773274917854), ('NC', 0.5660377358490567), ('ND', 0), ('NE', 0.07717750826901874), ('NH', 0.8256880733944955), ('NJ', 0.9456740442655935), ('NM', 0.9090909090909092), ('NV', 0.8069458631256384), ('NY', 0.9365558912386708), ('OH', 0.49630411826821536), ('OK', 0), ('OR', 0.9456740442655935), ('PA', 0.7407407407407407), ('RI', 1.0), ('SC', 0.185387131952017

Create a `numpy` array of the win probabilities

In [10]:
ps_sorted = numpy.fromiter(dict(win_prob_sorted).values(), dtype=float)
print (ps_sorted)

[0.24918743 0.06622517 0.06622517 0.70247934 1.         0.90909091
 0.94567404 1.         1.         0.61522419 0.48625793 1.
 0.47619048 0.         0.94567404 0.10989011 0.12074643 0.06622517
 0.08810573 1.         1.         0.89989889 0.79754601 0.81632653
 0.16393443 0.0990099  0.14238773 0.56603774 0.         0.07717751
 0.82568807 0.94567404 0.90909091 0.80694586 0.93655589 0.49630412
 0.         0.94567404 0.74074074 1.         0.18538713 0.06622517
 0.08810573 0.35369775 0.08810573 0.91826438 1.         1.
 0.75975359 0.         0.        ]


Create a `numpy` array of the electoral votes

In [11]:
vs_sorted = numpy.fromiter ((v for _,v in sorted(electoral_votes.items())), int)    
print (vs_sorted)

[ 3  9  6 11 55  9  7  3  3 29 16  4  6  4 20 11  6  8  8 11 10  4 16 10
 10  6  3 15  3  5  4 14  5  6 29 18  7  7 20  4  9  3 11 38  6 13  3 12
 10  5  3]


Combining the above steps into a function

In [12]:
def update_arrays():
    global win_prob_sorted, ps_sorted, vs_sorted
    win_prob_sorted = sorted(win_prob.items())
    ps_sorted = numpy.fromiter(dict(win_prob_sorted).values(), dtype=float)
    vs_sorted = numpy.fromiter ((v for _,v in sorted(electoral_votes.items())), int)    

## Create and Run the Simulation
The outcomes of the state votes are not independent events. If the candidate does well in one state, it is likely he did well in the other states as well. Here, `corr_coef` (correlation coefficient) will bias the results randomly towards either candidate.  One way to interpret this variable is that Biden's chances will increase or decreate nationally by `corr_coef * 0.5` percent.

In [13]:
corr_coef = 0.4 ## should be between 0 and 1

Next, we generate a bunch of random numbers equal to the numer of districts times `n` elections.  Each of these random numbers represents the outcome of each congressional district for a particular election.

In [14]:
num_simulations = 10000
rs = numpy.random.rand(num_simulations, len(win_prob_sorted)) * (1 - corr_coef)

Each step of our loop is a simulated election. In each step of the loop we:
1. Generate a `bias` which represents which way the vote shifted in that election.
2. Determine which states won by comparing the state outcomes (generated random numbers), if the implied probability is greater than the random number, give Biden the number of votes that corresponds to that state
3. Sum up all the votes and see if that number is bigger than 269
4. Run steps 1-3 repeatedly and calculate how often Biden wins

In [15]:
sims_won = 0
for r in rs:
    bias = numpy.random.random() * corr_coef
    wins = numpy.greater(ps_sorted , r + bias)
    total_votes_won = numpy.sum(numpy.multiply(wins, vs_sorted))
    if total_votes_won > 269: ## Tie breaks go to republicans
        sims_won+=1

Create a function for the simulation so we can repeat the process.

In [16]:
def simulate(n):
    update_arrays()
    
    corr_coef = 0.4
    rs = numpy.random.rand(n, len(win_prob_sorted)) * (1 - corr_coef)
    sims_won = 0
    for r in rs:
        bias = numpy.random.random() * corr_coef
        wins = numpy.greater(ps_sorted , r + bias)
        total_votes_won = numpy.sum(numpy.multiply(wins, vs_sorted))
        if total_votes_won > 269: ## Tie breaks go to republicans
            sims_won+=1

    return sims_won / n

Run the simulation!

In [17]:
simulate (num_simulations)

0.9397

Biden winning the election 94% of the time seems like an outrageous claim. One problem with our simulation is that it assumes Biden has chances of winning states like Missouri and Utah.

Note: Your results may differ because of changes in the market data.

Let's see if we can make our model better.

## Tweaking the Model ##

We think the market is favoring Biden too much, so we reduce his probabilities by about 10%

In [18]:
bias = 0.9
apply_bias(win_prob)

Let's assume also that Biden has 0 chance of winning states where he does not have a 40% probability

In [19]:
win_threshold  = 0.9
lose_threshold = 0.4
apply_thresholds(win_prob)

Finally, let's refine our assumptions even more, and set our own predictions for some of the states.

In [20]:
win_prob['FL'] = 0.40
win_prob['AZ'] = 0.45
win_prob['NC'] = 0.45
win_prob['WI'] = 0.55
win_prob['MI'] = 0.55
win_prob['PA'] = 0.45
win_prob['MN'] = 0.55

Running the model again:

In [21]:
simulate(100000)

0.60826

There we go, we've made a couple of adjustments and now Trump is the favourite!

## Limitations
1. The state-by-state results are correlated events: if Biden wins a state like Florida, it is likely that he wins Arizona as well.  Likewise, if Trump wins Michigan, it is likely he wins Pennsylvania, Wisconsin and Minnesota.  This simulator uses a simple method of applying a global correlation between the events, a more sophisticated model 

2. Nebraska and Maine uses the congressional district method to assign their votes. However, their electoral votes are assigned using the winner-take-all system in our simulations.

## Further Reading
 * [270towin](https://www.270towin.com/) One of the most useful election sites. If you think a candidate will win, show your work. Build your own path to victory for each of the candidates.
 * [Economist 2020 Forecast](https://projects.economist.com/us-2020-forecast/president)  - some interesting data including state-by-state correlations
 * [Opportunity of a Lifetime?](https://pointysquares.wordpress.com/2020/10/07/opportunity-of-a-lifetime/) My blog on why I think Biden is a huge value bet.

## Special Thanks
[/u/Abe738](https://www.reddit.com/user/Abe738) for showing a simple method of apply a correlation coefficient to the model.  Previously, my model naively modeled the state-by-state results as independent events.  