# Lab 7: Predicting Voter Turnout

Welcome to lab 7! This week, we will be constructing models to predict voter turnout. We will be using the same dataset as last week's lab (note that I removed `treat`).

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("gerber_huber_2014_data.csv")
data.head()

Unnamed: 0,id,voted06,voted08,voted09,voted10,voted11,voted12,voted13,voted14,age,race_afam,race_hispanic,race_other,race_white,female
0,1989703,0,0,0,0,0,0,0,0,26.0,0,0,0,1,1
1,555323,0,0,0,0,0,0,0,0,37.0,0,0,0,1,1
2,915202,1,1,0,1,0,0,0,1,26.0,1,0,0,0,0
3,839095,0,1,0,0,0,0,0,1,46.0,0,0,0,1,1
4,197647,0,0,0,0,0,0,0,0,21.0,0,0,0,1,0


Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (voted, yes, success, etc.) or 0 (did not vote, no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) (probability that a person voted) as a function of X (many different covariates).

(You could use other algorithms to build a model. We are using logistic regression because it is fast and easy.)

Our goal is to predict whether an individual voter *i* voted in the 2014 election as a function of other features we know about them (age, past vote history, race, and gender). With a model of voting in the 2014 election, a campaign in the future (such as in 2018) could better target voters.

To get started, let's see if there are any big differences in who votes in 2014. Run the below code.

In [18]:
data.groupby('voted14').mean()

Unnamed: 0_level_0,id,voted06,voted08,voted09,voted10,voted11,voted12,voted13,age,race_afam,race_hispanic,race_other,race_white,female
voted14,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,1202580.0,0.227614,0.409396,0.0,0.161182,0.0,0.216056,0.0,38.491964,0.049669,0.027084,0.100088,0.823158,0.698015
1,1214615.0,0.396483,0.661373,0.0,0.470221,0.0,0.723908,0.0,45.64962,0.04325,0.017867,0.143222,0.795661,0.766024


**Question 1.** What do you find? Do any individual variables seem to be more or less predictive of voting in 2014? Interpret the above table.

**Answer the question here.**

We are now going to build a simple model. We first need to define our outcome variable (did someone vote in 2014 or not, call this *Y*) and our set of predictor variables (call this a matrix *X*). In this first simple case, we will only include `voted12` and `age` in X.

In [29]:
cols = ['voted12', 'age']
X = data[cols]
X.head()

Unnamed: 0,voted12,age
0,0,26.0
1,0,37.0
2,0,26.0
3,0,46.0
4,0,21.0


In [20]:
y = data['voted14']
y.head()

0    0
1    0
2    1
3    1
4    0
Name: voted14, dtype: int64

We will now build our model. Note that we are now using a new module in Python called `statsmodels`. You can learn more about this module [here](https://www.statsmodels.org/stable/index.html).

In [21]:
import statsmodels.api as sm
## ignore the warning; nothing to worry about
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.531522
         Iterations 6
                         Results: Logit
Model:              Logit            No. Iterations:   6.0000    
Dependent Variable: voted14          Pseudo R-squared: 0.030     
Date:               2019-03-04 13:44 AIC:              31599.7950
No. Observations:   29722            BIC:              31616.3943
Df Model:           1                Log-Likelihood:   -15798.   
Df Residuals:       29720            LL-Null:          -16285.   
Converged:          1.0000           Scale:            1.0000    
------------------------------------------------------------------
              Coef.   Std.Err.     z      P>|z|    [0.025   0.975]
------------------------------------------------------------------
voted12       1.7948    0.0300   59.8963  0.0000   1.7361   1.8535
age          -0.0395    0.0005  -81.6226  0.0000  -0.0405  -0.0386



**Question 2.** Can you interpret this output? What do you think this means? (You might want to give [this](https://www.juanshishido.com/logisticcoefficients.html) a read.)

**Answer the question here.**

With the below code, we can construct model scores. Remember that the output of this can be understood as the probability that someone votes in 2014 as a function of their age and whether they voted in 2012.

In [22]:
y_pred = result.predict(X)
y_pred.head()

0    0.263463
1    0.188010
2    0.263463
3    0.139572
4    0.303572
dtype: float64

**Question 3.** Make a plot showing the relationship between your predicted model (`y_pred`) and whether someone actually voted in 2014. In text, make sure you interpret this plot.

In [23]:
# insert plot and interpretation

To more formally assess this model, let's make a 2x2 confusion matrix, like we did in class. The confusion matrix needs to have 4 cells: number of true positives, number of true negatives, number of false positives, number of false negatives. (If you want a reminder on a confusion matrix, see the slides from lecture or read [this](https://www.python-course.eu/confusion_matrix.php).)

For this exercise, we are going to define our threshold as 0.5. That means that if someone's predicted turnout (`y_pred`) is greater than 0.5, we are going to say that we predicted they vote. If their score is less than or equal to 0.5, we are going to say we predicted they did not vote. (Note that this threshold is somewhat arbitrary. We could define different cut-offs.)

**Question 4.** Make the confusion matrix. Calculate the number of true positives, number of true negatives, number of false positives, number of false negatives. Fill in the table.

In [24]:
# insert your code here

Fill in the below table with the correct numbers.

|                 | Predicted Negative | Predicted Positive |
|-----------------|--------------------|--------------------|
| **Actual Negative** | ?                  | ?                  |
| **Actual Positive** | ?                  | ?                  |

**Question 5.** Based on this table, calculate the model's accuracy, precision, and recall. Interpret this. 

In [25]:
# insert your code here

**Put your interpretation here.**

**Question 6.** Can a different model do better? It is now your turn to build a model from scratch. Follow the same steps as above. Select your predictor variables. Build your model. Construct a confusion matrix. Calculate accuracy, precision, and recall. How does this model do? Does it do better or worse than the original model? Would you use it?

In [26]:
# insert your code here

**Put your interpretation here.**

**Question 7.** Find a way to plot the differences between the two models we built.

In [27]:
# insert your code here

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

**To turn in your lab, you will need to submit a PDF through Canvas. You can download a notebook by opening it, turning Edit mode on, then navigating to File -> Download as -> PDF.**