# Midterm 2: 2024

This miderm has two parts. In the first part, I want you to analyze how well 538's *2022* US House model does at predicting the winner of the *2024* US House elections.

The second part is based on a real experiment I conducted during the 2022 election.

Before turning in your work, please re-run all of the code and download as a PDF. **Review your PDF** before uploading to make sure everything is printing properly. Once you are finished, **please upload to Canvas**. You are responsible for properly uploading and turning in your work before the end of class. Please make sure we can read and grade all of your code and results. If an answer does not print, you will not get credit. If necessary, it is ok to take screenshots of your answers and upload those to Canvas.

This midterm is open book, notes, prior labs, and Internet. You can use any resource (including ChatGPT and Gemini) except for another live human being.

This midterm has six questions (one in Part 1; five in Part 2). The midterm is out of 100 points, with the following point breakdown:

- Part 1: 25%
- Part 2, Question 1: 15%
- Part 2, Question 2: 15%
- Part 2, Question 3: 15%
- Part 2, Question 4: 15%
- Part 2, Question 5: 15%

If you cannot quite get the code right, you are strongly encouraged to add comments explaining the approach you were attempting. Let us know what you were trying to do and you may be able to get partial credit.

You have the entire 110 minutes of class time to complete this exam. Good luck!

## Part 1: 538's Model
I am giving you [538's 2022 model predictions](https://projects.fivethirtyeight.com/2022-election-forecast/house/) from October 1, 2022. I want you to determine how good 538's 2022 model was at predicting the 2024 US House elections. (Not a typo. I am giving you 2022 predictions and 2024 election results for you to test how stable predictions and election results are across cycles.)

The data (`538_data.csv`) has two columns in it.

| Variable Name      | Description |
| ----------- | ----------- |
| `district`      | The district (there are 435 districts)       |
| `prob_rep_wins`   | A 0-1 probability the Republican wins the election        |

The data (`winners_2024_ush.csv`) has three columns in it.

| Variable Name      | Description |
| ----------- | ----------- |
| `state` | The congressional district's state |
| `cd`      | The district number (there are 410 districts)       |
| `winner`   | The party of the actual winner        |

For this task, you first need to come up with a way to merge together `538_data.csv` with `winners_2024_ush.csv`. Note that `winners_2024_ush.csv` only has 410 districts, since at the time of the writing of this midterm, ballots are still being counted. For the purposes of this exercise, you should only include the 410 districts with a winner in your analysis. You can drop the remaining 25 districts.

**Analyze the 538 model. How did it perform?** Your answer should be a mix of both code and interpretation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
data538 = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/data/538_data.csv")
data538.head()

Unnamed: 0,district,prob_rep_wins
0,AK-1,0.373125
1,AL-1,0.999775
2,AL-2,0.999975
3,AL-3,1.0
4,AL-4,1.0


In [None]:
results24 = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/data/winners_2024_ush.csv")
results24.head()

Unnamed: 0,state,cd,winner
0,AL,1,R
1,AL,2,D
2,AL,3,R
3,AL,4,R
4,AL,5,R


## Part 2: Experiment

In 2022, I conducted a randomized experiment in partnership with a non-profit organization to increase the voter turnout of US citizens living abroad in Canada. In this experiment, registered US voters living abroad in Canada were randomly assigned to three experimental conditions:

- Control: no contact from the partner organization (`treat = Control`).
- How to Mailer: one letter from the partner organization that provided instructions on how to vote from abroad (`treat = HowToOnly`).
- How to Mailer + GOTV Mailer: The above "How To" mailer and a second get-out-the-vote reminder mailer (`treat = BothMailers`).

After the election, I consulted the publicly-available state voter files to measure whether the mailers increased voter turnout.

This part of the midterm has five questions:

1. Was the experiment properly implemented? You should check for pre-treatment covariate balance across the three experimental conditions (`Control`; `HowToOnly`; `BothMailers`).
2. Did the `HowToOnly` condition increase turnout compared to `Control`? Is this increase statistically significant?
3. Did the `BothMailers` condition increase turnout compared to `Control`? Is this increase statistically significant?
4. Did the `BothMailers` condition increase turnout compared to `HowToOnly`? Is this increase statistically significant?
5. Compared to `Control`, what is the cost per vote of the `HowToOnly` only? What is the cost per vote of the `BothMailers` condition? `HowToOnly` cost \$0.50 and `BothMailers` cost \$1 per person treated.

You must use permutation inference when calculating statistical significance. You can use either one-tailed or two-tailed tests when calculating your p-values.

To get started, you will need the file `experiment` for this midterm. Below we load this file and inspect it.

In [None]:
experiment = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/refs/heads/master/data/experiment.csv")

In [None]:
experiment.head()

Unnamed: 0,bl_id,vf_age,vf_vote_gen_12,vf_vote_gen_16,vf_vote_gen_18,vf_vote_gen_20,vf_female,vf_white,voted_2022,treat
0,1,68,1,1,1,0,1,1,0,BothMailers
1,2,46,0,0,0,1,1,1,0,HowToOnly
2,3,56,0,0,0,0,0,1,0,BothMailers
3,4,51,1,1,0,1,0,1,0,HowToOnly
4,5,63,0,1,0,0,1,1,0,HowToOnly


| Variable Name      | Description |
| ----------- | ----------- |
| `bl_id`      | A unique identifier for each person      |
| `vf_age`   | An individual's age, from the voter file.       |
| `vf_vote_gen_12`   | An individual's turnout in the 2012 general election, from the voter file. One of: 0 (did not vote) or 1 (voted)      |
| `vf_vote_gen_16`   | An individual's turnout in the 2016 general election, from the voter file. One of: 0 (did not vote) or 1 (voted)      |
| `vf_vote_gen_18`   | An individual's turnout in the 2018 general election, from the voter file. One of: 0 (did not vote) or 1 (voted)      |
| `vf_vote_gen_20`   | An individual's turnout in the 2020 general election, from the voter file. One of: 0 (did not vote) or 1 (voted)      |
| `vf_female`   | An indicator if the voter is female, from the voter file.       |
| `vf_white`   | An indicator if the voter is white, from the voter file.       |
| `voted_2022`   | An individual's turnout in the 2022 general election, from the voter file. One of: 0 (did not vote) or 1 (voted)      |
| `treat`   | Experimental condition, see above for description     |

## Part 2, Question 1: Balance Check

Do the three experimental conditions look similar to one another? Conduct a balance check on the pre-treatment covariates. A complete answer will provide a table and a few sentences answering the question (interpretation).

## Part 2, Question 2: Did the `HowToOnly` condition increase turnout compared to `Control`? Is this increase statistically significant?

Answer the question. A complete answer will provide a numerical answer and a few sentences answering the question (interpretation).

## Part 2, Question 3: Did the `BothMailers` condition increase turnout compared to `Control`? Is this increase statistically significant?

Answer the question. A complete answer will provide a numerical answer and a few sentences answering the question (interpretation).

## Part 2, Question 4: Did the `BothMailers` condition increase turnout compared to `HowToOnly`? Is this increase statistically significant?

Answer the question. A complete answer will provide a numerical answer and a few sentences answering the question (interpretation).

## Part 2, Question 5: Compared to `Control`, what is the cost per vote of the `HowToOnly` only? What is the cost per vote of the `BothMailers` condition? `HowToOnly` cost \$0.50 and `BothMailers` cost \$1 per person treated.

Answer the question. A complete answer will provide a numerical answer and a few sentences answering the question (interpretation).