In [1]:
import numpy as np
import copy
from tqdm import tqdm
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(0)

## Assignment 1

In [2]:
def biased_ate_estimation():
    parent_education = np.random.uniform(0, 4, 1000)
    child_ivy = np.random.binomial(n=1, p = 0.1 + parent_education / 10, size=1000)
    child_future_income = np.random.normal(100000 + 50000 * parent_education, 10000)
    y_0 = copy.deepcopy(child_future_income)
    child_future_income = child_future_income + np.random.normal(50000, 10000) * child_ivy
    y_1 = copy.deepcopy(child_future_income)

    mean_income_no_ivy = np.mean(child_future_income[child_ivy == 0])
    mean_income_ivy = np.mean(child_future_income[child_ivy == 1])

    two_means_estimator = mean_income_ivy - mean_income_no_ivy
    ci_lower = mean_income_ivy - mean_income_no_ivy - 1.96 * np.sqrt(np.var(child_future_income[child_ivy == 0]) / np.sum(child_ivy == 0) + np.var(child_future_income[child_ivy == 1]) / np.sum(child_ivy == 1))
    ci_upper = mean_income_ivy - mean_income_no_ivy + 1.96 * np.sqrt(np.var(child_future_income[child_ivy == 0]) / np.sum(child_ivy == 0) + np.var(child_future_income[child_ivy == 1]) / np.sum(child_ivy == 1))
    return two_means_estimator, ci_lower, ci_upper

biases = []
coverage = 0
ate_estimates = []
for _ in range(100):
    ate, ci_lower, ci_upper = biased_ate_estimation()
    ate_estimates.append(ate)
    if ci_lower < 50000 < ci_upper:
        coverage += 1
    biases.append(ate - 50000)
print("Average ATE estimate: ", np.mean(ate_estimates))
print("Average bias: ", np.mean(biases))
print("Coverage: ", coverage / 100)
print("Standard deviation of ATE estimates: ", np.std(ate_estimates))



Average ATE estimate:  82560.76963818411
Average bias:  32560.769638184105
Coverage:  0.01
Standard deviation of ATE estimates:  10292.931602241


### Data generating Process:
1. **Parent Education** (\(P\)): 
   - $P \sim \text{Uniform}(0, 4)$
2. **Treatment Variable** (\(D\)): 
   - $D \sim \text{Bernoulli}(0.1 + \frac{P}{10})$
3. **Potential Outcomes**:
   - $Y(0) = 100000 + 50000 \cdot P + \epsilon, \quad \epsilon \sim \text{Normal}(0, 10000^2)$
   - $Y(1) = Y(0) + 50000 + \nu, \quad \nu \sim \text{Normal}(0, 10000^2)$

### Explanation of Bias:
The treatment assignment \(D\) is correlated with \(P\), which affects \(Y\). As a result:
- The treated group (\(D = 1\)) is not comparable to the untreated group (\(D = 0\)) since \(P\) influences both treatment assignment and potential outcomes.
- The observed difference between treated and untreated outcomes confounds the true treatment effect with differences driven by \(P\).

### Real-World Example:
A real world example that might have a similar data generating process is a study on the impact of ivy league education on income. In this case, parental education could be a confounding variable that affects both the likelihood of attending an ivy league school and future income. If the study fails to account for parental education, it might overestimate the effect of ivy league education on income.

### Results

- **Coverage:** 0.01
- **Bias:** 32561.77
- **Standard Deviation:** 10292.93

## Assignment 2

In [17]:
# Calculation for age groups 16-17:

NV = 58
NU = 61
RV = 0/NV
RU = 1/NU
VE = (RU - RV) / RU
print(f"Overall vaccine efficacy is {VE:.4f}")


Overall vaccine efficacy is 1.0000


Yes, in this case for the age group 16-17 years we recover the same vaccine efficacy (100%) as the CDC.

In [18]:
# Calculation for age groups 18-64:

NV = 14445
NU = 14566
RV = 8/NV
RU = 149/NU
VE = (RU - RV) / RU
print(f"Overall vaccine efficacy is {VE:.4f}")

Overall vaccine efficacy is 0.9459


In this case, the vaccine efficiency we get is the same as the CDC with rounding (us: 94.59%, them: 94.6%).

In [24]:
# Delta method CI for age groups 16-17:

NV = 58
NU = 61 
RV = 0/NV
RU = 1/NU
VE = (RU - RV) / RU

var_RV = np.var([0] * 58) / 58
var_RU = np.var([0] * 60 + [1]) / 61

V_hat = np.array([[var_RV, 0], [0, var_RU]])

G = np.array([-(1/RU), (RV/(RU**2))])

var_delta = G @ V_hat @ G.T

std_err_delta = np.sqrt(var_delta / (NV + NU))

CI = VE - 1.96 * std_err_delta, VE + 1.96 * std_err_delta

print(f"CI for vaccine efficacy in age group 16-17 is {CI}")

# Delta method CI for age groups 18-64:

NV = 14445
NU = 14566
RV = 8/NV
RU = 149/NU
VE = (RU - RV) / RU

var_RV = np.var([0] * 58) / 58
var_RU = np.var([0] * 60 + [1]) / 61

V_hat = np.array([[var_RU, 0], [0, var_RV]])

G = np.array([-(1/RU), (RV/(RU**2))])

var_delta = G @ V_hat @ G.T

std_err_delta = np.sqrt(var_delta / (NV + NU))

# TODO: figure out whether division by sample size is correct

CI = VE - 1.96 * std_err_delta, VE + 1.96 * std_err_delta

print(f"CI for vaccine efficacy in age group 18-64 is {CI}")


CI for vaccine efficacy in age group 16-17 is (np.float64(1.0), np.float64(1.0))
CI for vaccine efficacy in age group 18-64 is (np.float64(0.9275691460112211), np.float64(0.9641488028854269))


The vaccine efficiency is $\frac{RU-RV}{RU}$. Using our observed risks, we can construct a plug-in estimate $\hat{VE} = \frac{\hat{RU}-\hat{RV}}{\hat{RU}} = 1 - \frac{\hat{RV}}{\hat{RU}}$.

The gradient of this with respect to the observed risks, is $G = \left( -\frac{1}{\hat{RU}}, \frac{\hat{RV}}{\hat{RU}^2} \right)$.

For the variance, we can use the delta method, which gives us the approximation: $\sqrt{n}(\hat{VE} - VE)\sim^a \mathcal{N}(0, GVG^T)$, where $V$ is the variance-covariance matrix of the observed risks.

We know that the distribution of the observed risks under some mild regularity conditions can be described by:$\sqrt{n}(\hat{R} - \mu) \sim^a \mathcal{N}(0, \hat{V})$, where $\mu$ is the vector of true population risks, and $V$ is a consistent estimate of their covariance matrix: $\hat{V} = \begin{pmatrix} \frac{Var(Y|D=0)}{P(D=)} & 0 \\ 0 & \frac{Var(Y|D=1)}{P(D=1)} \end{pmatrix}$.

To obtain the final CI based on the delta method, we therefore compute $\hat{VE} \pm 1.96 * \sqrt{G\hat{V}G^T / n}$ where n is the total number of patients in the respective age group and $\sqrt{G\hat{V}G^T / n}$ is the standard error. 

## Assignment 3