# Disparities in State Voter Turnout: Culture or Policy?

Political scientists, for decades have often debated whether the low voter turnout in the USA is a question of cultural norms or voting policies. Many believe that even if we were to make voting in the United States substantially easier, we wouldn’t see large increases in turnout because the culture of voting doesn’t exist. What is often used is state by state data, where some political scientists will argue that because some states make it easier to vote, they have higher voter turnout rates (e.g. Minnesota), and the ones that don’t, see lower turnout rates (e.g. Georgia). Opponents of this theory will say that Minnesotans were always voting at a high level, and that culture of wanting to vote is what drove the policy changes, not the other way around. The way this has been traditionally studied is by grouping states into “cultural/geographic” segments, and then comparing turnout among them. 
	
Our project plan is to use data available from previous elections to test whether the differences among different states is cultural or policy oriented. We plan on doing this by constructing a data set that quantifies cultural things (i.e. demographics, geography, history of voter turnout, education rates, political participation in other forms), and using that to predict voter turnout rates. We’ll create a few training sets for regional groups, and then after training an algorithm on that, we’ll use it to compare different test sets to see if our predicted turnout rates differ from actual turnout rates. If the turnout rate of a state is predicted to be much higher than it is in reality, then one could argue that the cultural factors aren’t as important, and that there are policies in place that are preventing people from voting.

First, we import the librarie we will be using throughout the course of the project:

In [3]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LinearRegression

Here we will read in the 2009 and 2014 datasets, which will serve as the training dand testing data, respectively.

In [5]:
# Read in 2009 and 2014 datasets
filename2009 = 'https://raw.githubusercontent.com/prathmj/SocialCapitalTurnoutStudy/master/Social%20Capital%202009%20with%20turnout.csv'
filename2014 = 'https://raw.githubusercontent.com/prathmj/SocialCapitalTurnoutStudy/master/Social%20Capital%202014%20with%20turnout.csv'
data2009 = pd.read_csv(filename2009)
data2014 = pd.read_csv(filename2014)

print("2009:")
print(data2009.head())
print("2014:")
print(data2014.head())

2009:
   fips  statcode     areaname  relig09  civic09  bus09  pol09  prof09  \
0  1001         1  Autauga, AL     50.0      7.0    3.0    NaN     1.0   
1  1003         1  Baldwin, AL    161.0     21.0    7.0    NaN     1.0   
2  1005         1  Barbour, AL     17.0      1.0    1.0    NaN     NaN   
3  1007         1     Bibb, AL     27.0      NaN    1.0    NaN     NaN   
4  1009         1   Blount, AL     42.0      1.0    1.0    NaN     NaN   

   labor09  bowl09  fitns09  golf09  sport09     pop09  respn10  nccs09  \
0      5.0     1.0      4.0     2.0      NaN   50756.0     0.78   182.0   
1      2.0     2.0     18.0     8.0      NaN  179878.0     0.73   737.0   
2      1.0     1.0      1.0     2.0      NaN   29737.0     0.63   107.0   
3      NaN     NaN      1.0     1.0      NaN   21587.0     0.58    59.0   
4      1.0     NaN      3.0     4.0      NaN   58345.0     0.80   121.0   

     assn09   pvote08  
0  1.438254  0.635648  
1  1.223051  0.608996  
2  0.807075  0.512425  
3 

Before we start to train any models, a bit of preprocessing is required, as performed below:

In [7]:
# Impute missing values with 0
data2009 = data2009.fillna(value=0)
data2014 = data2014.fillna(value=0)

# Separate into x and y
X2009 = data2009.iloc[:, :-1]  # features
y2009 = data2009.iloc[:, -1]  # class (turnout)

X2014 = data2014.iloc[:, :-1]  # features
y2014 = data2014.iloc[:, -1]  # class (turnout)

# Encode
le = LabelEncoder()
le.fit(y2009)
y2009 = le.transform(y2009)
le.fit(y2014)
y2014 = le.transform(y2014)

print("2009:")
print(data2009.head())
print("2014:")
print(data2014.head())

2009:
   fips  statcode     areaname  relig09  civic09  bus09  pol09  prof09  \
0  1001         1  Autauga, AL     50.0      7.0    3.0    0.0     1.0   
1  1003         1  Baldwin, AL    161.0     21.0    7.0    0.0     1.0   
2  1005         1  Barbour, AL     17.0      1.0    1.0    0.0     0.0   
3  1007         1     Bibb, AL     27.0      0.0    1.0    0.0     0.0   
4  1009         1   Blount, AL     42.0      1.0    1.0    0.0     0.0   

   labor09  bowl09  fitns09  golf09  sport09     pop09  respn10  nccs09  \
0      5.0     1.0      4.0     2.0      0.0   50756.0     0.78   182.0   
1      2.0     2.0     18.0     8.0      0.0  179878.0     0.73   737.0   
2      1.0     1.0      1.0     2.0      0.0   29737.0     0.63   107.0   
3      0.0     0.0      1.0     1.0      0.0   21587.0     0.58    59.0   
4      1.0     0.0      3.0     4.0      0.0   58345.0     0.80   121.0   

     assn09   pvote08  
0  1.438254  0.635648  
1  1.223051  0.608996  
2  0.807075  0.512425  
3 

Next, we will visualize the data using various graphs: