##### 2018 primary data found at: https://github.com/MEDSL/primaries.
#### Goal: To analyze the correlation between congressional general election votes and the number of presidential election votes. It is widely accepted that there is greater turnout for presidential elections than congressional elections. This project will attempt to analyze congressional and presidential voting records to gain insights on how congressional voting patterns effect presidential election results. 


###  The first dataset I will be using is the U.S. House 1976-2018 dataset from the MIT Election Data and Science Lab. https://electionlab.mit.edu/data

In [6]:
import pandas as pd

raw_data = pd.read_csv("1976-2018-house.csv")
raw_data.head()

Unnamed: 0,year,state,state_po,state_fips,state_cen,state_ic,office,district,stage,runoff,special,candidate,party,writein,mode,candidatevotes,totalvotes,unofficial,version
0,1976,Alabama,AL,1,63,41,US House,1,gen,False,False,Bill Davenport,democrat,False,total,58906,157170.0,False,20171005
1,1976,Alabama,AL,1,63,41,US House,1,gen,False,False,Jack Edwards,republican,False,total,98257,157170.0,False,20171005
2,1976,Alabama,AL,1,63,41,US House,1,gen,False,False,,,True,total,7,157170.0,False,20171005
3,1976,Alabama,AL,1,63,41,US House,2,gen,False,False,J. Carole Keahey,democrat,False,total,66288,156362.0,False,20171005
4,1976,Alabama,AL,1,63,41,US House,2,gen,False,False,,,True,total,5,156362.0,False,20171005


## Preprocessing
#### First things first, I will get rid of the columns that are either superfluous or out of the scope of this project. I will also create a single column to simulatneously identify the state and sitrict.

In [8]:
print(raw_data.columns)

Index(['year', 'state', 'state_po', 'state_fips', 'state_cen', 'state_ic',
       'office', 'district', 'stage', 'runoff', 'special', 'candidate',
       'party', 'writein', 'mode', 'candidatevotes', 'totalvotes',
       'unofficial', 'version'],
      dtype='object')


In [12]:
data = raw_data[['year', 'state', 'state_po', 'district', 'stage', 'candidate',
       'party', 'candidatevotes', 'totalvotes']]

In [14]:
print(data.dtypes)

year                int64
state              object
state_po           object
district            int64
stage              object
candidate          object
party              object
candidatevotes     object
totalvotes        float64
dtype: object


In [18]:
data[['state', 'state_po', 'district', 'stage', 'candidate', 'party']] = \
data[['state', 'state_po', 'district', 'stage', 'candidate', 'party']].astype(str)

In [29]:
data['candidatevotes'] = data['candidatevotes'].str.replace(',', '', regex=False)
data['candidatevotes'] = data['candidatevotes'].astype(int)

In [30]:
data["state_dist"] = data["state_po"] + "_" + data["district"]

In [31]:
data.head()

Unnamed: 0,year,state,state_po,district,stage,candidate,party,candidatevotes,totalvotes,state_dist
0,1976,Alabama,AL,1,gen,Bill Davenport,democrat,58906,157170.0,AL_1
1,1976,Alabama,AL,1,gen,Jack Edwards,republican,98257,157170.0,AL_1
2,1976,Alabama,AL,1,gen,,,7,157170.0,AL_1
3,1976,Alabama,AL,2,gen,J. Carole Keahey,democrat,66288,156362.0,AL_2
4,1976,Alabama,AL,2,gen,,,5,156362.0,AL_2
