# Lab 6: Analyzing an Experiment

Welcome to lab 6! This week, we will go over analyzing an experiment. Much of this experiment is covered in [Chapter 12](https://www.inferentialthinking.com/chapters/12/Comparing_Two_Samples.html) of the textbook as well as Chapter 2 of Gerber and Green.

For this lab, we are going to be re-analyzing the experiment presented in "[The Generalizability of Social Pressure Effects on Turnout Across High-Salience Electoral Contexts: Field Experimental Evidence From 1.96 Million Citizens in 17 States](https://journals.sagepub.com/doi/10.1177/1532673X16686556)" by Alan Gerber, Greg Huber, Albert Fang, and Andrew Gooch.

Here is the abstract of the paper:

> Prior experiments show that campaign communications revealing subjects’ past turnout and applying social pressure to vote (the “Self” treatment) increase turnout. However, nearly all existing studies are conducted in low-salience elections, raising concerns that published findings are not generalizable and are an artifact of sample selection and publication bias. Addressing the need for further replication in high-salience elections, we analyze a field experiment involving 1.96 million subjects where a nonpartisan campaign randomly sent Self treatment mailers, containing a subject’s vote history and a comparison of each subject’s history with their state median registrant’s turnout behavior, in high-salience elections across 17 states in 2014. Sending the Self mailer increases turnout by 0.7 points or 2.2%. This effect is largely consistent across states, with somewhat larger effects observed in states with lower ex ante election salience. Our study provides precise evidence that social pressure effects on turnout are generalizable.

Voters were randomly assigned to a control group that received no mail or to a treatment group that received the below mailer, with the goal of increasing their turnout in 2014:

![](https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab6/mailer.png)

We are going to analyze the data from South Dakota. In this lab, you will answer three broad questions:

- Was the experiment properly implemented?
- What was the average effect of the mail on increasing turnout in 2014?
- Was the mail especially effective or ineffective among certain subgroups?

**All of your answers in this lab should have a mix of both code and text. You need to make sure you interpret what you find.**


# Before You Begin: Disable AI Assistance

To ensure you learn the concepts and complete this assignment based on your own understanding, you are *strongly* encouraged to turn off Google Colab's built-in generative AI features before you begin. This will also help you prepare for the midterms and final project.

**Follow these steps:**
1.  Go to the **Edit** menu at the top of the page.
2.  Click on **Notebook settings**.
3.  Check the box next to **"Hide generative AI features"**.
5.  Click **Save**.

This will prevent Google Gemini from suggesting or writing code for you, allowing you to focus on solving the problems yourself.

# Set-up

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

To begin, let's load the data. The most important variables are `treat` (1 = received mail; 0 = control) and `voted14` (1 = voted in 2014; 0 = didn't vote).

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/joshuakalla/data_science_campaigns/master/Colab/Lab6/gerber_huber_2014_data.csv")
data.head()

## 1. Was the experiment properly implemented?

**Question 1.** Describe who was in the experiment using the **pre-treatment covariates**. What are their demographics? Do you think this is representative of voters in South Dakota (see [Census data](https://www.census.gov/library/visualizations/2016/comm/citizen_voting_age_population/cb16-tps18_sd.html))? Why or why not? Compare and contrast `voted10` and `voted11`. (Hint: see what elections were taking place during which years.)

Note that this answer should **not** be separated by control/treat, but overall for the entire data.

**Question 2.** In expectation, if the experiment was properly implemented, the treatment and control groups should look similar on observed demographics. Check to see if they do. Make a table where the columns are treatment and control and the rows are means for each **pre-treatment covariate** included in the data. Can you calculate these means by writing a function rather than taking the mean by hand many times? (This should be separate by control/treat). You can use [groupby](https://realpython.com/pandas-groupby/) in your answer to this question.

The table might look like this:

|  | Treatment | Control |
|-|-|-|
| voted08 | Calculate treatment group mean for voted08 and put here. | Calculate control group mean for voted08 and put here. |
| age | Calculate treatment group mean for age and put here. | Calculate control group mean for age and put here. |
| etc. | Calculate treatment group mean for remaining variables and put here. | Calculate control group mean for remaining variables and put here. |

Before answering, let me give you three hints.

1. To create a data frame called `table_name` in pandas with column `voted_09`, you would use the following sample code:

In [None]:
table_name = ([])
table_name = pd.DataFrame(data = table_name)
table_name['c1'] = data["voted09"]
table_name

2. When using the apply function, make sure you understand the axis argument:
`table_name.apply(function)` will apply the function along the columns of your table.
`table_name.apply(function, axis = 1)` will apply the function along the rows of your table.

3. An elegant way to solve this would be through the use of the groupby command in Pandas. ([Example](https://towardsdatascience.com/pandas-groupby-explained-453692519d0))


In [None]:
# Answer the question here.

**Apply Function Explained:**

In the previous lab, we needed to set `raw=True` when using the `apply` function. This is because, by default, apply converts each row into a Series object. However, in Lab 5, the function we were working with couldn't evaluate a Series object.

By setting `raw=True`, each row or column is passed as a NumPy array instead of a Series. This can be useful in the following cases:

1. Your function works with array-like data (such as NumPy arrays) rather than Series objects.
2. Your function doesn’t need to reference column names.

If you use `raw=False` (the default setting), your function will receive a Series, which includes both the data values and their associated labels (i.e., column names).

Here’s an example to illustrate this:

In [None]:
# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Define a function that works with a Series (default behavior)
def my_function(row):
    return row['A'] + row['B']

# Apply the function row-wise (axis=1) using Series objects
# (raw=False, default behaviour)
result_series = df.apply(my_function, axis=1)
print("Result with Series (raw=False):")
print(result_series)


In [None]:
# Let's try to use apply with the same function but setting raw=True
result_raw = df.apply(my_function, axis=1, raw=True)
print("\nResult with Series data (raw=True):")
print(result_raw)

In [None]:
# The previous function should not work as the function is expeting a Series.
# Let's see how we can modify the function to work with raw data:
def my_function_raw(row):
    return row[0] + row[1]

# Apply the function row-wise (axis=1) using raw data (raw=True)
result_raw = df.apply(my_function_raw, axis=1, raw=True)
print("\nResult with raw data (raw=True):")
print(result_raw)

## 2. What was the average effect of the mail on increasing turnout in 2014?

**Question 3.** What was the average turnout rate in 2014 for the treatment group? For the control group? What was the average treatment effect of the mail?

In [None]:
# Answer the question here.

**Question 4.** Can you find a way to visually display your answer?

In [None]:
# Answer the question here.

## Is this effect statistically significant?
Before answering this question, please be sure to re-read this section: https://inferentialthinking.com/chapters/12/1/AB_Testing.html. Note that this code uses the `datascience` library in Python, while we instead need to use `pandas` and `NumPy`. You will need to make some changes to the code from this textbook in order to get it to run for you.

**Question 5.** What is the null hypothesis? What is the alternative hypothesis? What is the test statistic you will use to determine if the effect is statistically significant? Use permutation tests.

***Answer the question here (this needs text; no code. To write text double click on this text block):***




**Question 6.** Was the effect of the mail statistically significant?

In [None]:
# Answer the question here.

## 4. Was the mail especially effective or ineffective among certain subgroups?

**Question 7.** Pick 3 different demographic groups that you think might have bigger or smaller treatment effects than the overall average. First, explain why you chose these three groups. What is your theory? Can you justify your expectations by pointing to prior research?

***Answer the question here (this needs text; no code. To write text double click on this text block):***



**Question 8.** Now looking at the data, do these 3 groups have substantially bigger or smaller treatment effects than the overall avereage? Explain what you find.

In [None]:
# Answer the question here.

**Question 9.** Can you find a way to visually display your answer?

In [None]:
# Answer the question here.


# Congratulations!

You are done with the lab. Before you finish and submit, please fill out this brief evaluation:

- I spent around XXXX hours on this lab,.
- This lab was (too easy, too hard, just about the right difficulty).

All assignments in the course will be distributed as notebooks like this one, and you will submit your work as a PDF.  

# How to Convert your Colab notebook to a PDF and download it

Follow these instructions exactly to make sure your notebook is correctly converted to a PDF and saved to your computer.

---

## 1. Check everything is in order and all the code runs

Before starting the conversion process, make sure your notebook is complete and error-free.

1. **Open your notebook** in Google Colab.
2. **Save your work**:
   - Go to **File → Save** or press `Ctrl + S` (Windows) / `Cmd + S` (Mac).
   - This ensures that all your recent changes are stored.
3. **Run all the cells** to confirm there are no errors:
   - In the top menu, select **Cell → Run All**.
   - Colab will execute every cell in order.
4. **Watch for errors**:
   - If you see any red error messages, fix them before proceeding.
   - A notebook with errors will **not** convert to PDF correctly.
5. Once all cells run without errors, you can proceed to the next step.

---

## 2. Make sure your notebook is saved in Google Drive and named correctly
Before converting, confirm that your notebook is stored in your Google Drive inside the `Colab Notebooks` folder.

1. Look at the top-left corner of the Colab page — you’ll see the notebook’s current name next to two yellow circle icons.
2. Rename the file by directly clicking on the name, type the new name, and press **Enter**.  
  - Please rename the notebook to LASTNAME_FIRSTNAME_LAB#.pdf. So for this lab, I would call it Alberto_Stefanelli_Lab6.ipynb.
  - The name must end with `.ipynb`
3. Ensure the notebook is in your Google Drive (not in Colab’s temporary session storage):  
   - In the menu, click **File → Locate in Drive**.  
   - This will open the folder in Drive where the notebook is stored.  
   - If it’s not in `My Drive/Colab Notebooks`, move it there for easier access.

---

## 3. Install the required tools

We wrote some code (see below) to automatically convert your Notebook to PDF. When you run the provided code cell, the first step will install some essential pieces of software (i.e., Pandoc and Latex) inside your Colab environment. There is no need to exactly understand what is happening

**Important:**
- This installation will take between 2 to 5 minutes.
- Do **not** close or refresh the Colab page while it runs.

---

## 4. Mount your Google Drive in Colab

After installing the requirements, the code will ask Colab to **mount** your Google Drive. You can use both your personal or Yale account. This is needed because the notebook you are converting must be saved in Drive before it can be converted to PDF.

You will see a pop-up with a **link**:
1. Click the link.
2. Sign in to your Google account (use the same account where your Colab notebook is saved).
3. If prompted, copy the long **authorization code** provided.
  - Paste that code into the input box in Colab and press **Enter**.
5. This will connect your Google Drive to Colab
6. If the link does not appear, make sure your browser is not blocking pop-ups.

---

## 5. Enter your notebook’s file name

The code will now ask to enter your notebook’s exact file name

1. Type the **full name** of your notebook, including the `.ipynb` ending.  
  - Example: `Alberto_Stefanelli_Lab6.ipynb.`
2. Make sure the name matches exactly, including capitalization and underscores.
3. Press **Enter**.

---

## 7. Convert the notebook to PDF and download it

The code will convert the notebook file to a pdf. After the PDF is created, your browser will show a download pop-up or automatically save the file to your Downloads folder. You can now open the PDF with any PDF reader. Once you have your PDF and made sure eveything is in order, you can then upload it to Canvas.

---

## 8. If you see an error:

Double-check that:

- All cells run without errors
- Mounted Google Drive without errors
- Saved notebook to Google Drive and not locally or on Github
- Entered correct notebook file name with `.ipynb`.
- Conversion completed without any errors.
- PDF downloaded to your computer (download and pop-ups are not blocked by your browser)


**Fallback:** If the PDF export method below fails (for example, due to LaTeX or pandoc errors), you can use https://convert.ploomber.io/ as a fallback option. However, I strongly suggest trying the methods below first and using this fallback only as a last resort.

**If you run into any issues, please reach out for help**



In [None]:
# Install requirements
!apt-get -qq update
!apt-get install -y pandoc texlive-xetex texlive-fonts-recommended texlive-plain-generic

from google.colab import drive, files

# Mount Google Drive
drive.mount('/content/drive')

# Ask for the notebook name
notebook_name = input(
    "Enter your notebook’s exact file name,\n"
    "exactly as shown in the top-left corner of the Colab page (next to the two yellow circle icons): "
)

# Build paths
input_path = f"/content/drive/MyDrive/Colab Notebooks/{notebook_name}"
output_path = input_path.replace(".ipynb", ".pdf")

# Convert to PDF
!jupyter nbconvert --to pdf "{input_path}"

# Download the PDF
files.download(output_path)