# Cleaning the Data

The goal here is to get our data in a format that we can use to make predictions. We want the target variable to be the outcome of the election in each state (percent that democrats/republicans won by), and to then have columns for other factors of interest (generic ballot, presidential approval, characteristics of the state, etc). 

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as  np
pd.options.display.max_columns = None
%matplotlib inline
import datetime as dt

## Senate Election Results 

Senate election data is taken from the MIT-Harvard Elections Data Science Lab. It can be accessed [here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/PEJ5QU)

In [2]:
senate_elections = pd.read_csv('1976-2020-senate.csv', encoding = "ISO-8859-1")

In [23]:
senate_elections.head()

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,district,stage,special,candidate,party_detailed,writein,mode,candidatevotes,totalvotes,unofficial,version,party_simplified
0,1976,ARIZONA,AZ,4,86,61,US SENATE,statewide,gen,False,SAM STEIGER,REPUBLICAN,False,total,321236,741210,False,20210114,REPUBLICAN
1,1976,ARIZONA,AZ,4,86,61,US SENATE,statewide,gen,False,WM. MATHEWS FEIGHAN,INDEPENDENT,False,total,1565,741210,False,20210114,OTHER
2,1976,ARIZONA,AZ,4,86,61,US SENATE,statewide,gen,False,DENNIS DECONCINI,DEMOCRAT,False,total,400334,741210,False,20210114,DEMOCRAT
3,1976,ARIZONA,AZ,4,86,61,US SENATE,statewide,gen,False,ALLAN NORWITZ,LIBERTARIAN,False,total,7310,741210,False,20210114,LIBERTARIAN
4,1976,ARIZONA,AZ,4,86,61,US SENATE,statewide,gen,False,BOB FIELD,INDEPENDENT,False,total,10765,741210,False,20210114,OTHER


In [24]:
#We will start by examining election data starting in 2008 as it the most recent. To keep our data simple and 
#consistent, we will also ignore special elections. 
small_results = senate_elections[(senate_elections['special'] == False) & (senate_elections['year'] >= 2008)]
#California primaries result in the top two candidates advancing, regardless of party affiliation, so
#two democrats often run against each other in the general. Consequently, we will drop california from our model. 
small_results = small_results[ small_results['state'] != 'CALIFORNIA']

In [5]:
def get_single_year(year):
    ''' Creates a dataframe from a single year in small_results with two columns:
            state: the election state
            partisan_score: the democratic margin (percent). Negative values mean the republican won. 
            
        args: 
            year (int): the election year. 
    '''
    year_data = small_results[small_results['year'] == year].copy()
    year_data['perc'] = year_data['candidatevotes'] / year_data['totalvotes']
    year_data = year_data[(year_data['perc'] > .2) & (year_data['perc'] < .8)]
    year_data['party_simplified'] = year_data['party_simplified'].str.replace('OTHER', 'DEMOCRAT')
    year_data = year_data[['state','party_simplified','perc']].pivot_table(index='state', columns=['party_simplified'])
    year_data = year_data.reset_index(level=0)
    if len(year_data.columns) == 4:
        year_data.columns = ['state','democrat','other','republican']
    if len(year_data.columns) == 3:
        year_data.columns = ['state','democrat','republican']    
    year_data['partisan_score_{}'.format(year)] = year_data['democrat'] - year_data['republican']
    return year_data[['state','partisan_score_{}'.format(year)]]

In [6]:
def create_3_past(year):
    '''Creates a dataframe with the partisan score of the current year and the past three for each state. 
    '''
    current_year = get_single_year(year)
    two_ago = get_single_year(year - 2)
    four_ago = get_single_year(year - 4)
    six_ago = get_single_year(year - 6)
    current_year = current_year.merge(two_ago, how='outer', on='state')
    current_year = current_year.merge(four_ago, how='outer', on='state')
    current_year = current_year.merge(six_ago, how='outer', on='state')
    current_year.columns = ['state', 'partisan_score', 'two_ago_score', 'four_ago_score', 'six_ago_score']
    return current_year

In [36]:
def average_3_past(year):
    ''' Averages partisan scores of past three years and returns a new dataframe. 
    '''
    df = create_3_past(year)
    df['old_score_avg'] = df[['two_ago_score', 'four_ago_score', 'six_ago_score']].mean(axis=1)
    df = df[['state','partisan_score','old_score_avg']]
    df.dropna(inplace=True)
    df['year'] = year
    return df[['state','year','partisan_score','old_score_avg']]

In [60]:
average_3_past(2020)

Unnamed: 0,state,year,partisan_score,old_score_avg
0,ALABAMA,2020,-0.203587,-0.280915
1,ALASKA,2020,-0.127032,-0.021296
3,COLORADO,2020,0.093214,0.018603
4,DELAWARE,2020,0.215405,0.178679
5,GEORGIA,2020,-0.01779,-0.107166
6,IDAHO,2020,-0.293759,-0.345323
7,ILLINOIS,2020,0.160676,0.129675
8,IOWA,2020,-0.064782,-0.163868
9,KANSAS,2020,-0.114371,-0.202779
10,KENTUCKY,2020,-0.195338,-0.15008


## Generic Ballot

For the national generic ballot, we will use the average of generic ballot polling for the few weeks preceding the election. The data can be accessed on the Real Clear Politics website [here](https://www.realclearpolitics.com/epolls/other/2020_generic_congressional_vote-6722.html#polls) and on the Center for Politics website [here](https://centerforpolitics.org/crystalball/articles/the-key-to-forecasting-midterms-the-generic-ballot/). 

In [63]:
generic_ballot_avgs = {
    2020: .068,
    2018: .073,
    2016: .006,
    2014: -.024,
    2012: -.002,
    2010: -.1,
    2008: .1,
    2006: .12,
    2004: 0
}

In [58]:
def add_generic_ballot(year):
    df = average_3_past(year)
    df['generic_ballot'] = generic_ballot_avgs[year]
    return df

## Presidential Approval Score

For presidential approval polilng, we will use data from University of California Santa Barbara's American Presidency Project. The data all comes from Gallup polling, and is available [here](https://www.presidency.ucsb.edu/statistics/data).

In [66]:
trump = pd.read_excel('American Presidency Project - Approval Ratings for POTUS.xlsx',sheet_name='Donald Trump')
trump['president'] = 'trump'
obama = pd.read_excel('American Presidency Project - Approval Ratings for POTUS.xlsx', sheet_name='Barack Obama')
obama['president'] = 'obama'
bush = pd.read_excel('American Presidency Project - Approval Ratings for POTUS.xlsx', sheet_name='George W. Bush')
bush['president'] = 'bush'

In [67]:
pres_approval = pd.concat([trump, obama, bush])

In [68]:
pres_approval['pos_score'] = pres_approval['Approving'] - pres_approval['Disapproving']

In [69]:
pres_approval

Unnamed: 0,Start Date,End Date,Approving,Disapproving,Unsure/NoData,president,pos_score
0,2021-01-04,2021-01-15,34,62,4,trump,-28
1,2020-12-01,2020-12-17,39,57,4,trump,-18
2,2020-11-05,2020-11-19,43,55,2,trump,-12
3,2020-10-16,2020-10-27,46,52,2,trump,-6
4,2020-09-30,2020-10-15,43,55,2,trump,-12
...,...,...,...,...,...,...,...
277,2001-03-09,2001-03-11,58,29,13,bush,29
278,2001-03-05,2001-03-07,63,22,15,bush,41
279,2001-02-19,2001-02-21,62,21,17,bush,41
280,2001-02-09,2001-02-11,57,25,18,bush,32
