# Lab 7: Predicting Voter Turnout

---

Welcome to Lab 7! This week, we will be constructing models to predict voter turnout. We will be using the same dataset as last week's lab (note that I removed `treat`).

# Before You Begin: Disable AI Assistance

To ensure you learn the concepts and complete this assignment based on your own understanding, you are *strongly* encouraged to turn off Google Colab's built-in generative AI features before you begin. This will also help you prepare for the midterms and final project.

**Follow these steps:**
1.  Go to the **Edit** menu at the top of the page.
2.  Click on **Notebook settings**.
3.  Check the box next to **"Hide generative AI features"**.
5.  Click **Save**.

This will prevent Google Gemini from suggesting or writing code for you, allowing you to focus on solving the problems yourself.




# Set-up

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab7/gerber_huber_2014_data.csv")
data.head()

# 1. Model-based predictions





Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (voted, yes, success, etc.) or 0 (did not vote, no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) (probability that a person voted) as a function of X (many different covariates).


(You could use other algorithms to build a model. We are using logistic regression because it is fast and easy.)

Our goal is to predict whether an individual voter *i* voted in the 2014 election as a function of other features we know about them (age, past vote history, race, and gender). With a model of voting in the 2014 election, a campaign in the future (such as in 2018) could better target voters.

To get started, let's see if there are any big differences in who votes in 2014. Run the below code.

In [None]:
data.groupby('voted14').mean()

**Question 1.** What do you find? Do any individual variables seem to be more or less predictive of voting in 2014? Interpret the above table.

**Answer the question here.**

We are now going to build a simple model. We first need to define our outcome variable (did someone vote in 2014 or not, call this *Y*) and our set of predictor variables (call this a matrix *X*). In this first simple case, we will only include `voted12` and `age` in X.

In [None]:
cols = ['voted12', 'age', 'intercept']
data['intercept'] = 1 # add a column of 1's for the intercept term
X = data[cols]
X.head()

In [None]:
y = data['voted14']
y.head()

We will now build our model. Note that we are now using a new module in Python called `statsmodels`. You can learn more about this module [here](https://www.statsmodels.org/stable/index.html).

In [None]:
import statsmodels.api as sm
## ignore the warning; nothing to worry about
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

**Question 2.** Can you interpret this output? What do you think this means? (You might want to give [this](https://www.juanshishido.com/logisticcoefficients.html) a read.)

**Answer the question here.**

With the below code, we can construct model scores. Remember that the output of this can be understood as the probability that someone votes in 2014 as a function of their age and whether they voted in 2012.

In [None]:
y_pred = result.predict(X)
y_pred.head()

**Question 3.** Make a plot showing the relationship between your predicted model (`y_pred`) and whether someone actually voted in 2014. In text, make sure you interpret this plot.

*Hint.* You will probably want to use the [Pandas.cut() method](https://www.geeksforgeeks.org/pandas-cut-method-in-python/). This will allow you to create bins of `y_pred`. Remember the figures we discussed in the Likely Voter slides. You probably want to make a similar plot to this, like the below figure. (You don't need to exactly copy this, just an example of what you could do.) **This is a challenging question.**

**This example shows more or less how the final plots should look.**

![](https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab7/sample_plot.png)


In [None]:
# insert plot and interpretation

# 2. Confusion matrix and model metrics

To more formally assess this model, let's make a 2x2 confusion matrix, like we did in class. The confusion matrix needs to have 4 cells: number of true positives, number of true negatives, number of false positives, number of false negatives. (If you want a reminder on a confusion matrix, see the slides from lecture or read [this](https://www.python-course.eu/confusion_matrix.php).)

For this exercise, we are going to define our threshold as 0.5. That means that if someone's predicted turnout (`y_pred`) is greater than 0.5, we are going to say that we predicted they vote. If their score is less than or equal to 0.5, we are going to say we predicted they did not vote. (Note that this threshold is somewhat arbitrary. We could define different cut-offs.)

**Question 4.** Make the confusion matrix. Calculate the number of true positives, number of true negatives, number of false positives, number of false negatives. Fill in the table.

In [None]:
# insert your code here

Fill in the below table with the correct numbers.

- Actual Negative and Predicted Negative (what do we call this?):
- Actual Negative and Predicted Positive (what do we call this?):
- Actual Positive and Predicted Negative (what do we call this?):
- Actual Positive and Predicted Positive (what do we call this?):

**Question 5.** Based on this table, calculate the model's accuracy, precision, and recall. Interpret this.

In [None]:
# insert your code here

**Put your interpretation here.**

**Question 6.** Can a different model do better? It is now your turn to build a model from scratch. Follow the same steps as above. Select your predictor variables. Build your model. Construct a confusion matrix. Calculate accuracy, precision, and recall. How does this model do? Does it do better or worse than the original model? Would you use it?

In [None]:
# insert your code here

**Put your interpretation here.**

**Question 7.** Find a way to plot the differences between the two models we built.

In [None]:
# insert your code here

# 3. Are we overfitting?

If you don't remember what overfitting means, review your notes from class or give this a read: https://www.ibm.com/topics/overfitting

Let's see how these models do out-of-sample. Are we overfitting?

To test this, you will need to run your model on a new data set (test set). You can then assess the confusion matrix and accuracy/precision/recall of your model on that test set.

Note that a logistic regression is written as:

$p(x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x)}}$, where $e$ is Euler's number, approximately 2.71828. You can get Euler's number in Python using `np.exp()`.

We will need to manually re-create this logistic regression. From my model, the coefficients are:

- voted12: 2.1814
- age: 0.0145
- intercept: -2.7879

We could therefore recreate our logistic regression by calculating:

$p(voted14) = \frac{1}{1+e^{-(-2.7879 + 0.0145*age + 2.1814*voted12)}}$

Let's see this in action.

In [None]:
# Calculate y_pred by hand; call it y_pred2
y_pred2 = 1/(1+np.exp(-(-2.787942 + 2.181396 *X["voted12"] + .0145461 * X["age"])))

# Confirm that y_pred is the same as y_pred2
y_preds = pd.concat([pd.DataFrame(y_pred), pd.DataFrame(y_pred2)], axis = 1)
y_preds.columns = ['y_pred', 'y_pred2']
y_preds['diff'] = abs(y_preds['y_pred']) - y_preds['y_pred2']

# Print out this data frame
print(y_preds.head())
print(y_preds["diff"].mean()) # Small differences due to rounding, but this is essentially a 0
# These columns are the same

I now want you to test if your model is over-fit. Don't cheat by recreating your model. Just use what you have above. I care about the process, not the actual performance.

The data you used above came from South Dakota. I want you to see how the model performs in Wisconsin.

In [None]:
test = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab7/test_data.csv")
test.head()

**Question 8.** Evaluate how your model performs out-of-sample using `test_data.csv` and the coefficients from your model. Create a confusion matrix and calculate the accuracy/precision/recall. How does your model perform out-of-sample?

# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

All assignments in the course will be distributed as notebooks like this one, and you will submit your work as a PDF.  

# How to Convert your Colab notebook to a PDF and download it

Follow these instructions exactly to make sure your notebook is correctly converted to a PDF and saved to your computer.

---

## 1. Check everything is in order and all the code runs

Before starting the conversion process, make sure your notebook is complete and error-free.

1. **Open your notebook** in Google Colab.
2. **Save your work**:
   - Go to **File → Save** or press `Ctrl + S` (Windows) / `Cmd + S` (Mac).
   - This ensures that all your recent changes are stored.
3. **Run all the cells** to confirm there are no errors:
   - In the top menu, select **Cell → Run All**.
   - Colab will execute every cell in order.
4. **Watch for errors**:
   - If you see any red error messages, fix them before proceeding.
   - A notebook with errors will **not** convert to PDF correctly.
5. Once all cells run without errors, you can proceed to the next step.

---

## 2. Make sure your notebook is saved in Google Drive and named correctly
Before converting, confirm that your notebook is stored in your Google Drive inside the `Colab Notebooks` folder.

1. Look at the top-left corner of the Colab page — you’ll see the notebook’s current name next to two yellow circle icons.
2. Rename the file by directly clicking on the name, type the new name, and press **Enter**.  
  - Please rename the notebook to LASTNAME_FIRSTNAME_LAB#.pdf. So for this lab, I would call it Alberto_Stefanelli_Lab7.ipynb.
  - The name must end with `.ipynb`
3. Ensure the notebook is in your Google Drive (not in Colab’s temporary session storage):  
   - In the menu, click **File → Locate in Drive**.  
   - This will open the folder in Drive where the notebook is stored.  
   - If it’s not in `My Drive/Colab Notebooks`, move it there for easier access.

---

## 3. Install the required tools

We wrote some code (see below) to automatically convert your Notebook to PDF. When you run the provided code cell, the first step will install some essential pieces of software (i.e., Pandoc and Latex) inside your Colab environment. There is no need to exactly understand what is happening

**Important:**
- This installation will take between 2 to 5 minutes.
- Do **not** close or refresh the Colab page while it runs.

---

## 4. Mount your Google Drive in Colab

After installing the requirements, the code will ask Colab to **mount** your Google Drive. You can use both your personal or Yale account. This is needed because the notebook you are converting must be saved in Drive before it can be converted to PDF.

You will see a pop-up with a **link**:
1. Click the link.
2. Sign in to your Google account (use the same account where your Colab notebook is saved).
3. If prompted, copy the long **authorization code** provided.
  - Paste that code into the input box in Colab and press **Enter**.
5. This will connect your Google Drive to Colab
6. If the link does not appear, make sure your browser is not blocking pop-ups.

---

## 5. Enter your notebook’s file name

The code will now ask to enter your notebook’s exact file name

1. Type the **full name** of your notebook, including the `.ipynb` ending.  
  - Example: `Alberto_Stefanelli_Lab7.ipynb.`
2. Make sure the name matches exactly, including capitalization and underscores.
3. Press **Enter**.

---

## 7. Convert the notebook to PDF and download it

The code will convert the notebook file to a pdf. After the PDF is created, your browser will show a download pop-up or automatically save the file to your Downloads folder. You can now open the PDF with any PDF reader. Once you have your PDF and made sure eveything is in order, you can then upload it to Canvas.

---

## 8. If you see an error:

Double-check that:

- All cells run without errors
- Mounted Google Drive without errors
- Saved notebook to Google Drive and not locally or on Github
- Entered correct notebook file name with `.ipynb`.
- Conversion completed without any errors.
- PDF downloaded to your computer (download and pop-ups are not blocked by your browser)


**Fallback:** If the PDF export method below fails (for example, due to LaTeX or pandoc errors), you can use https://convert.ploomber.io/ as a fallback option. However, I strongly suggest trying the methods below first and using this fallback only as a last resort.

**If you run into any issues, please reach out for help**

In [None]:
# Install requirements
!apt-get -qq update
!apt-get install -y pandoc texlive-xetex texlive-fonts-recommended texlive-plain-generic

from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Ask for the notebook name
notebook_name = input(
    "Enter your notebook’s exact file name,\n"
    "exactly as shown in the top-left corner of the Colab page (next to the two yellow circle icons): "
)

# Build paths
input_path = f"/content/drive/MyDrive/Colab Notebooks/{notebook_name}"
output_path = input_path.replace(".ipynb", ".pdf")

# Convert to PDF
!jupyter nbconvert --to pdf "{input_path}"

# Download the PDF
files.download(output_path)
