# 🗳️ HW6 Lab: Regularization and causal inference

## ✅ Setup and data import
In this lab, we will apply regularization methods to a voting prediction problem.

We will also have a short problem on causal inference related to breast cancer screenings.

In [None]:
# Load in additional functions
library(tidyverse)
library(lubridate)

if (! require(ROCR)) {
  install.packages('ROCR')
}
library(ROCR)

# Takes a couple minutes to install glmnet in Google Colab
if (! require(glmnet)) {
  install.packages('glmnet')
}
library(glmnet)

# Load in helper functions for fitting lasso and ridge, and computing AUC.
source('https://jdgrossman.com/assets/hw6-helpers.R')

# Use three digits past the decimal point,
# and don't use scientific notation.
options(digits = 3, scipen = 999)

# Format plots with a white background and dark features.
theme_set(theme_bw())

# Increase the default text size of plots.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
theme_update(text = element_text(size = 20))

# Increase the default plot width and height.
# If you are *not* working in Google Colab, we recommend commenting
# out this line of code.
options(repr.plot.width=12, repr.plot.height=8)

# Read in the data
data = read_csv('https://jdgrossman.com/assets/survey_small_regularization.csv')

# peek at 10 random rows
sample_n(data, 10)

## 🚀 Exercise 1

Write a function that does the following:
1. Separates the voter data into an 80% training set and a 20% validation set.
2. Fits a logistic regression model to the training dataset. The model should predict whether `voted_for_candidate` is true based all available features.
3. Calculates the AUC of the model on the held-out validation set. You may find the `compute_auc` function helpful.

The function should only have one input: a dataframe of voter data.

Run your function once, and print the resulting validation set AUC.

In [None]:
# Your code here!

# `compute_auc(preds, actual)` will compute the AUC for a vector of
# predicted probabilities (`preds`) and a vector of true labels (`actual`).



## 🚀 Exercise 2

Using the data and model specification from Exercise 1, write a function that uses 5-fold cross-validation (CV) to estimate the validation set error.

> Do not import an external library that runs CV for you. Write your own code to perform CV.

The function should only have two inputs: a dataframe of voter data, and the number of folds.

Run your function once, and print the resulting AUC.

In [None]:
# Your code here!



## 🚀 Exercise 3

Run your functions from Exercises 1 and 2 one hundred times each, and save the resulting AUC estimates. Plot the AUC estimates on one set of axes. 

> Make sure that your plot makes it easy to distinguish between estimates generated from the single validation set approach versus the CV approach.

How do the estimates compare for the validation set approach versus the cross-validation approach? What is one advantage of cross-validation versus a single validation set? What is one disadvantage? Answer in no more than four sentences.

In [None]:
# Your code here!



In [None]:
# Write your written answer as a code comment here!
#
#



## 🚀 Exercise 4

Next, we will explore ridge (L2) regularization.

The `ridge_glm` function can be used to fit an L2 regularized regression model, and the `lasso_ridge_predict` function can be used to generate predictions from a model obtained from `ridge_glm`.

> For example, `ridge_glm(y ~ ., data=df, lambda=1)` will fit a ridge regression model with outcome `y` using all other columns in dataframe `df` as covariates, using a `lambda` value of 1.
>
> `ridge_lasso_predict(ridge_glm_model, newdata=new_df)` will generate estimated probabilities for the observations in dataframe `new_df` using the fitted model `ridge_glm_model`. No need to use `type='response'`.

Suppose you want to fit a ridge regression model that predicts whether someone voted for the candidate based on all other columns in the voter data. 

To fit a performant model, you should first identify which of the following lambda values produces the lowest cross-validated AUC estimate:

```
lambda = 10^seq(-10, 10, 1)
```

When estimating the CV AUC, use 10 folds. 

> You may find it helpful to re-use and modify the function you constructed in Exercise 2.
>
> If there is an approximate tie between multiple `lambda` values, it is generally good practice to choose the largest `lambda` value of the ties.

To show the optimal value of lambda, make a plot with `log10(lambda)` on the x-axis and `AUC` on the y-axis.

In [None]:
# Your code here!



## 🚀 Exercise 5

Repeat Exercise 4, but fit a lasso model instead of a ridge model.

In [None]:
# Your code here!



## 🚀 Exercise 6

Create a plot comparing the coefficients of your optimal ridge model, your optimal lasso model, and a logistic regression model fit with the same specification as the lasso and ridge models. Each model should be fit to the entirety of the voter data.

> You may want to create an additional zoomed-in plot to see changes in small coefficients.

In no more than three sentences, describe similarities and differences among the coefficients of each model.

In [None]:
# Your code here!



In [None]:
# Write your text answer as a code comment here!
#
#



## 🚀 Exercise 7

As it turns out, there is a much larger dataset of voters drawn from the same population. The data can be found at [https://jdgrossman.com/assets/survey_complete_regularization.csv](https://jdgrossman.com/assets/survey_complete_regularization.csv).

What is the AUC of each of the three models from Exercise 6 when applied to this test dataset? Answer in a one-sentence code comment.

> Note that you should not fit any new models in this exercise.

In [None]:
# Write your code and written answer in this cell!



## 🚀 Exercise 8

We will now change gears and complete a short problem on causal inference.

(Adapted from
<a href="http://www.cambridge.org/us/academic/subjects/politics-international-relations/research-methods-politics/natural-experiments-social-sciences-design-based-approach">
Natural Experiments in the Social Sciences</a>, Chapter 5, Problem 5.2)

In the 1960s, the Health Insurance Plan of Greater New York clinical trial
studied the effects of screening for breast cancer. 

* Researchers invited
about 31,000 women between the ages of 40 and 64 for annual clinical visits
and mammographies, which are X-rays designed to detect breast cancer. 

* About 20,200 women or two-thirds of these women accepted the invitation to be
screened, while one-third refused. 

* In the control group, 31,000 women
received the status quo health care. (None of them received mammographies of
their own initiative; screening for breast cancer was rare in the 1960s.)

* Among the 62,000 women in the study group, the invitation for screening was
issued at random. 

The table below shows numbers of deaths and death rates
from breast cancer five years after the start of the trial. It also shows
deaths from other causes, among women in the treatment group who accepted
the invitation for screening and those who refused.

<div class="datatable-begin"></div>

| | Group size | Deaths from breast cancer | Death rate from breast cancer, per 1,000 women | Deaths from other causes | Death rate from other causes, per 1,000 women |
| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
| **Assigned to treatment** |  |  |  |  |  |
| Accepted screening | 20,200 | 23 | 1.14 | 428 | 21.19 |
| Refused screening | 10,800 | 16 | 1.48 | 409 | 37.87 |
| Total | 31,000 | 39 | 1.26 | 837 | 27.00 |
| **Assigned to control** |  |  |  |  |  |
| Would have accepted screening | N/A | N/A | N/A | N/A | N/A |
| Would have refused screening | N/A | N/A | N/A | N/A | N/A |
| Total | 31,000 | 63 | 2.03 | 879 | 28.25 |

<div class="datatable-end"></div>

## 🚀 Exercise 8a

It might seem natural to compare women who were screened with women who
were not screened. 

* Why, in general, is this a bad idea? 

* Is there any specific evidence in the table that suggests this is in fact a bad idea?

**There is no maximum sentence length for any of the parts of Exercise 8, but you should be able to answer each question in no more than a few sentences.** 

In [None]:
# Write your answer as a code comment here! 
#



## 🚀 Exercise 8b

In class, we learned about the average treatment effect ($\text{ATE}$) and the average
treatment effect on the treated ($\text{ATT}$, also known as $\text{ATE}_\text{c}$, where $\text{c}$ denotes compliers).

* Calculate the intention-to-treat estimate (i.e, the average effect of being assigned to treatment).

* What is a potential limitation of intention-to-treat analysis?

In [None]:
# Write any necessary calculations and your text-based answer here! 



## 🚀 Exercise 8c

In the first column of the table, there are two unobserved quantities among the women assigned to the control group: (1) the number of women who would have accepted screening, and (2) the number who would have refused.

* Why are these quantities unobserved? 

* Find an unbiased
estimate for each of these two quantities and fill in the corresponding
cells of the table with these estimates. 

* What is the rationale for your
estimates (i.e., why are they unbiased)?

In [None]:
# Print any necessary calculations and write your text-based answer here! 



## 🚀 Exercise 8d

What is the proportion of always-treats, never-treats, and compliers in
the study group?

In [None]:
# Write your answer as a code comment here! 
#
# 



## 🚀 Exercise 8e

What is the death rate from breast cancer among compliers in the assigned-to-treatment
group? (The death rate per 1,000 women is simply the number of deaths
divided by the group size, times 1,000.)

In [None]:
# Calculate and print your answer here!



## 🚀 Exercise 8f

Now, estimate the death rate from breast cancer among compliers and,
separately,
among never-treats in the control group. How to do this:

* First, estimate the number of never-treats in the control group
who died from breast cancer. Why is this quantity unobserved?
What is the rationale for your estimate?

* Now, use this information to estimate the number of deaths from
breast cancer among compliers in the control group. 

* Finally,
estimate the death rate per 1,000 women among compliers in the
control group,
and also estimate the death rate per 1,000 women among
never-treats in the control group.

In [None]:
# Calculate and write your answers here!



## 🚀 Exercise 8g

Estimate the effect of treatment on compliers in terms of death rates,
using the information computed above.

In [None]:
# Calculate and print your answer here!



## 🚀 Exercise 8h

Using several of the quantities you derived above, find the average
treatment effect for
compliers by directly applying the instrumental-variables estimator:

$$\frac{[\text{mean outcome in treatment group}] - [\text{mean outcome
in control group}]}{[\text{fraction treated in treatment group}] - [\text{fraction treated
in control group}]}.$$

This should be identical to your answer in (g). 

In [None]:
# Calculate and print your answer here!

