# 3. Potential Outcomes and RCTs (R Implementation)

This notebook implements analysis of potential outcomes and randomized controlled trials using R.

## Assignment Requirements:
1. **Data Simulation (3 points)**: Simulate dataset with covariates, treatment, and outcome
2. **Estimating Average Treatment Effect (3 points)**: Simple and controlled regression estimates
3. **LASSO and Variable Selection (3 points)**: Use LASSO for covariate selection and ATE estimation

In [None]:
# Load required libraries
library(glmnet)
library(dplyr)
library(ggplot2)
library(broom)
library(knitr)

# Set random seed for reproducibility
set.seed(123)

cat("üì¶ Libraries loaded successfully\n")

## 3.1 Data Simulation (3 points)

We simulate a dataset with n = 1000 individuals with:
- Covariates X‚ÇÅ, X‚ÇÇ, X‚ÇÉ, X‚ÇÑ (continuous or binary)
- Treatment assignment D ~ Bernoulli(0.5)
- Outcome variable: Y = 2D + 0.5X‚ÇÅ - 0.3X‚ÇÇ + 0.2X‚ÇÉ + Œµ, where Œµ ~ N(0,1)

In [None]:
# Set parameters
n <- 1000

# Generate covariates
X1 <- rnorm(n, mean = 2, sd = 1)        # Continuous covariate
X2 <- rnorm(n, mean = 0, sd = 1.5)      # Continuous covariate  
X3 <- rbinom(n, 1, 0.3)                 # Binary covariate
X4 <- rbinom(n, 1, 0.6)                 # Binary covariate

# Generate treatment assignment
D <- rbinom(n, 1, 0.5)                  # Treatment ~ Bernoulli(0.5)

# Generate error term
epsilon <- rnorm(n, 0, 1)

# Generate outcome variable: Y = 2D + 0.5X1 - 0.3X2 + 0.2X3 + Œµ
Y <- 2*D + 0.5*X1 - 0.3*X2 + 0.2*X3 + epsilon

# Create data frame
data <- data.frame(
  Y = Y,
  D = D,
  X1 = X1,
  X2 = X2,
  X3 = X3,
  X4 = X4
)

cat("üìä Dataset simulated successfully\n")
cat("Sample size:", nrow(data), "\n")
cat("Treatment group size:", sum(data$D), "\n")
cat("Control group size:", sum(1 - data$D), "\n\n")

# Display first few rows
head(data)

### Balance Check (1 point)

We perform a balance check by comparing the means of X‚ÇÅ, X‚ÇÇ, X‚ÇÉ, X‚ÇÑ across treatment and control groups.

In [None]:
# Balance check: compare means across treatment groups
balance_results <- data.frame(
  Variable = c("X1", "X2", "X3", "X4"),
  Control_Mean = c(
    mean(data$X1[data$D == 0]),
    mean(data$X2[data$D == 0]),
    mean(data$X3[data$D == 0]),
    mean(data$X4[data$D == 0])
  ),
  Treatment_Mean = c(
    mean(data$X1[data$D == 1]),
    mean(data$X2[data$D == 1]),
    mean(data$X3[data$D == 1]),
    mean(data$X4[data$D == 1])
  ),
  Difference = c(
    mean(data$X1[data$D == 1]) - mean(data$X1[data$D == 0]),
    mean(data$X2[data$D == 1]) - mean(data$X2[data$D == 0]),
    mean(data$X3[data$D == 1]) - mean(data$X3[data$D == 0]),
    mean(data$X4[data$D == 1]) - mean(data$X4[data$D == 0])
  )
)

# Perform t-tests for each covariate
t_test_results <- list(
  X1 = t.test(data$X1[data$D == 1], data$X1[data$D == 0]),
  X2 = t.test(data$X2[data$D == 1], data$X2[data$D == 0]),
  X3 = t.test(data$X3[data$D == 1], data$X3[data$D == 0]),
  X4 = t.test(data$X4[data$D == 1], data$X4[data$D == 0])
)

# Add p-values to balance results
balance_results$p_value <- c(
  t_test_results$X1$p.value,
  t_test_results$X2$p.value,
  t_test_results$X3$p.value,
  t_test_results$X4$p.value
)

cat("üîç Balance Check Results:\n")
print(balance_results)

cat("\nüìà Balance is good if differences are small and p-values are > 0.05\n")

## 3.2 Estimating the Average Treatment Effect (3 points)

We estimate the Average Treatment Effect (ATE) using two approaches:
1. Simple regression: Y ~ D
2. Controlled regression: Y ~ D + X‚ÇÅ + X‚ÇÇ + X‚ÇÉ + X‚ÇÑ

In [None]:
# 1. Simple regression: Y ~ D
simple_model <- lm(Y ~ D, data = data)
simple_summary <- summary(simple_model)

cat("üìä Simple Regression Results (Y ~ D):\n")
print(simple_summary)

# Extract ATE and standard error
simple_ate <- coef(simple_model)["D"]
simple_se <- simple_summary$coefficients["D", "Std. Error"]

cat("\nüéØ Simple ATE estimate:", round(simple_ate, 4), "\n")
cat("üìè Standard Error:", round(simple_se, 4), "\n")

In [None]:
# 2. Controlled regression: Y ~ D + X1 + X2 + X3 + X4
controlled_model <- lm(Y ~ D + X1 + X2 + X3 + X4, data = data)
controlled_summary <- summary(controlled_model)

cat("üìä Controlled Regression Results (Y ~ D + X1 + X2 + X3 + X4):\n")
print(controlled_summary)

# Extract ATE and standard error
controlled_ate <- coef(controlled_model)["D"]
controlled_se <- controlled_summary$coefficients["D", "Std. Error"]

cat("\nüéØ Controlled ATE estimate:", round(controlled_ate, 4), "\n")
cat("üìè Standard Error:", round(controlled_se, 4), "\n")

In [None]:
# Compare the two estimates
comparison <- data.frame(
  Model = c("Simple (Y ~ D)", "Controlled (Y ~ D + X1 + X2 + X3 + X4)"),
  ATE_Estimate = c(simple_ate, controlled_ate),
  Standard_Error = c(simple_se, controlled_se),
  R_Squared = c(simple_summary$r.squared, controlled_summary$r.squared)
)

cat("üìã Comparison of ATE Estimates:\n")
print(comparison)

cat("\nüîç Analysis:\n")
cat("‚Ä¢ ATE change:", round(controlled_ate - simple_ate, 4), "\n")
cat("‚Ä¢ Standard error change:", round(controlled_se - simple_se, 4), "\n")
cat("‚Ä¢ The true ATE is 2.0 (from our data generating process)\n")
cat("‚Ä¢ Controlling for covariates should improve precision and reduce bias\n")

## 3.3 LASSO and Variable Selection (3 points)

We use LASSO to select covariates and then re-estimate the ATE with only the selected variables.

In [None]:
# Prepare data for LASSO (excluding treatment D)
X_matrix <- as.matrix(data[, c("X1", "X2", "X3", "X4")])
Y_vector <- data$Y

# Fit LASSO model using cross-validation
cv_lasso <- cv.glmnet(X_matrix, Y_vector, alpha = 1, nfolds = 10)

# Plot cross-validation results
plot(cv_lasso, main = "LASSO Cross-Validation")

# Get optimal lambda
lambda_min <- cv_lasso$lambda.min
lambda_1se <- cv_lasso$lambda.1se

cat("üéØ Optimal lambda (min):", lambda_min, "\n")
cat("üéØ Optimal lambda (1se):", lambda_1se, "\n")

In [None]:
# Get coefficients at lambda_min
lasso_coef <- coef(cv_lasso, s = "lambda.min")
selected_vars <- rownames(lasso_coef)[lasso_coef[,1] != 0][-1]  # Remove intercept

cat("üìä LASSO Coefficients at lambda_min:\n")
print(as.matrix(lasso_coef))

cat("\n‚úÖ Variables selected by LASSO:", selected_vars, "\n")

# Check if any variables were selected
if(length(selected_vars) == 0) {
  cat("‚ö†Ô∏è  No variables selected by LASSO at lambda_min\n")
  cat("Trying lambda_1se...\n")
  
  lasso_coef_1se <- coef(cv_lasso, s = "lambda.1se")
  selected_vars <- rownames(lasso_coef_1se)[lasso_coef_1se[,1] != 0][-1]
  cat("Variables selected at lambda_1se:", selected_vars, "\n")
}

In [None]:
# Re-estimate ATE with LASSO-selected covariates
if(length(selected_vars) > 0) {
  # Create formula with selected variables
  formula_str <- paste("Y ~ D +", paste(selected_vars, collapse = " + "))
  lasso_formula <- as.formula(formula_str)
  
  # Fit model with selected variables
  lasso_model <- lm(lasso_formula, data = data)
  lasso_summary <- summary(lasso_model)
  
  cat("üìä LASSO-Selected Model Results:", formula_str, "\n")
  print(lasso_summary)
  
  # Extract ATE and standard error
  lasso_ate <- coef(lasso_model)["D"]
  lasso_se <- lasso_summary$coefficients["D", "Std. Error"]
  
  cat("\nüéØ LASSO ATE estimate:", round(lasso_ate, 4), "\n")
  cat("üìè Standard Error:", round(lasso_se, 4), "\n")
  
} else {
  cat("‚ö†Ô∏è  No variables selected by LASSO - using simple model\n")
  lasso_ate <- simple_ate
  lasso_se <- simple_se
}

In [None]:
# Final comparison of all three estimates
final_comparison <- data.frame(
  Model = c("Simple", "Controlled", "LASSO-Selected"),
  ATE_Estimate = c(simple_ate, controlled_ate, lasso_ate),
  Standard_Error = c(simple_se, controlled_se, lasso_se),
  Variables_Used = c("None", "X1, X2, X3, X4", 
                    ifelse(length(selected_vars) > 0, paste(selected_vars, collapse = ", "), "None"))
)

cat("üìã Final Comparison of All ATE Estimates:\n")
print(final_comparison)

cat("\nüîç Discussion:\n")
cat("‚Ä¢ True ATE: 2.0\n")
cat("‚Ä¢ LASSO helps with variable selection in high-dimensional settings\n")
cat("‚Ä¢ In this case, we know X4 has no true effect (coefficient = 0)\n")
cat("‚Ä¢ LASSO should ideally select X1, X2, X3 and exclude X4\n")
cat("‚Ä¢ Benefits of LASSO: reduces overfitting, improves interpretability\n")
cat("‚Ä¢ LASSO may improve precision by removing irrelevant variables\n")