# Homework 6-1: "Fundamentals" based election prediction

In this homework you will explore an alternate election prediction model, using various economic and political indicators instead of polling data -- and also deal with the challenges of model building when there is very little training data. Political scientists have long analyzed these types of "fundamentals" models, and they can be reasonably accurate. For example, fundamentals [slightly favored](https://fivethirtyeight.com/features/it-wasnt-clintons-election-to-lose/) the Republicans in 2016

Data sources which I used to generate `election-fundamentals.csv`:

- Historical presidential approval ratings (highest and lowest for each president) from [Wikipedia](https://en.wikipedia.org/wiki/United_States_presidential_approval_rating) 
- GDP growth in election year from [World Bank](https://data.worldbank.org/indicator/NY.GDP.MKTP.KD.ZG?locations=US)

Note that there are some timing issues here which more careful forecasts would avoid. The presidential approval rating is for the entire presidential term.The GDP growth is for the entire election year. These variables might have higher predictive power if they were (for example) sampled in the last quarters before the election.

For a comprehensive view of election prediction from non-poll data, and how well it might or might not be able to do, try [this](https://fivethirtyeight.com/features/models-based-on-fundamentals-have-failed-at-predicting-presidential-elections/) from Fivethirtyeight.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [3]:
# First, import data/election-fundamentals.csv and take a look at what we have

df = pd.read_csv('data/election-fundamentals.csv')
df.head()

Unnamed: 0,year,incumbent_president,incumbent_party,term,highest_approval,lowest_approval,year_gdp_growth,winner
0,1960,Esienhower,R,2,79,47,2.6,D
1,1964,Johnson,D,1,79,34,5.8,D
2,1968,Johnson,D,2,79,34,4.8,R
3,1972,Nixon,R,1,66,24,5.3,R
4,1976,Nixon,R,2,66,24,5.4,D


In [4]:
# How many elections do we have data for?

# 15 elections

df.shape

(15, 8)

In [6]:
# Rather than predicting the winning party, we're going to predict whether the same party stays in power or flips
# This is going to be the target variable
df.flips = df.winner != df.incumbent_party

  This is separate from the ipykernel package so we can avoid doing imports until


In [7]:
# Pull out all other numeric columns as features. Create features and target numpy arrays
fields = ['term', 'highest_approval', 'lowest_approval', 'year_gdp_growth']
features = pd.concat([df.term, df.highest_approval, df.lowest_approval, df.year_gdp_growth], axis=1)
target = df.flips

In [8]:
features.head()

Unnamed: 0,term,highest_approval,lowest_approval,year_gdp_growth
0,2,79,47,2.6
1,1,79,34,5.8
2,2,79,34,4.8
3,1,66,24,5.3
4,2,66,24,5.4


In [9]:
# Use 3-fold cross validation to see how well we can do with a RandomForestClassifier. 
# Print out the scores

rf = RandomForestClassifier()
rf.fit(features.values, target.values)

scores = cross_val_score(rf, features, y=target, cv=None) 

In [10]:
scores

array([0.66666667, 0.4       , 1.        ])

How predictable are election results just from these variables, as compared to a coin flip?

(your answer here)

In [11]:
# Now create a logistic regression using all the data
# Normally we'd split into test and training years, but here we're only interested in the coefficients

lr = LogisticRegression()
lr.fit(features.values,target.values)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# What is the influence of each feature?
# Remeber to use np.exp to turn the lr coefficients into odds ratios

coeffs = pd.DataFrame(np.exp(lr.coef_), columns=features.columns)
coeffs

Unnamed: 0,term,highest_approval,lowest_approval,year_gdp_growth
0,3.383662,1.025426,0.957309,0.573158


Describe the effect of each one of our features on whether or not the party in power flips. What feature has the biggest effect? How does economic growth relate? Are there any factors that operate backwards from what you would expect, and if so what do you think is happening?

(your answer here)