# 3. Potential Outcomes and RCTs (Julia Implementation)

This notebook implements analysis of potential outcomes and randomized controlled trials using Julia.

## Assignment Requirements:
1. **Data Simulation (3 points)**: Simulate dataset with covariates, treatment, and outcome
2. **Estimating Average Treatment Effect (3 points)**: Simple and controlled regression estimates
3. **LASSO and Variable Selection (3 points)**: Use LASSO for covariate selection and ATE estimation

In [None]:
# Import required packages
using DataFrames
using GLMNet
using GLM
using Distributions
using Random
using Statistics
using Plots
using StatsBase
using HypothesisTests
using Printf

# Set random seed for reproducibility
Random.seed!(123)

println("📦 Packages loaded successfully")

## 3.1 Data Simulation (3 points)

We simulate a dataset with n = 1000 individuals with:
- Covariates X₁, X₂, X₃, X₄ (continuous or binary)
- Treatment assignment D ~ Bernoulli(0.5)
- Outcome variable: Y = 2D + 0.5X₁ - 0.3X₂ + 0.2X₃ + ε, where ε ~ N(0,1)

In [None]:
# Set parameters
n = 1000

# Generate covariates
X1 = rand(Normal(2, 1), n)          # Continuous covariate
X2 = rand(Normal(0, 1.5), n)       # Continuous covariate
X3 = rand(Bernoulli(0.3), n)       # Binary covariate
X4 = rand(Bernoulli(0.6), n)       # Binary covariate

# Generate treatment assignment
D = rand(Bernoulli(0.5), n)        # Treatment ~ Bernoulli(0.5)

# Generate error term
ε = rand(Normal(0, 1), n)

# Generate outcome variable: Y = 2D + 0.5X1 - 0.3X2 + 0.2X3 + ε
Y = 2 .* D .+ 0.5 .* X1 .- 0.3 .* X2 .+ 0.2 .* X3 .+ ε

# Create DataFrame
data = DataFrame(
    Y = Y,
    D = D,
    X1 = X1,
    X2 = X2,
    X3 = X3,
    X4 = X4
)

println("📊 Dataset simulated successfully")
println("Sample size: ", nrow(data))
println("Treatment group size: ", sum(data.D))
println("Control group size: ", sum(1 .- data.D))
println()

# Display first few rows
println("First 5 rows of the dataset:")
first(data, 5)

### Balance Check (1 point)

We perform a balance check by comparing the means of X₁, X₂, X₃, X₄ across treatment and control groups.

In [None]:
# Balance check: compare means across treatment groups
control_group = data[data.D .== 0, :]
treatment_group = data[data.D .== 1, :]

balance_results = DataFrame(
    Variable = String[],
    Control_Mean = Float64[],
    Treatment_Mean = Float64[],
    Difference = Float64[],
    p_value = Float64[]
)

for var in [:X1, :X2, :X3, :X4]
    control_mean = mean(control_group[!, var])
    treatment_mean = mean(treatment_group[!, var])
    difference = treatment_mean - control_mean
    
    # Perform t-test
    t_test = UnequalVarianceTTest(treatment_group[!, var], control_group[!, var])
    p_val = pvalue(t_test)
    
    push!(balance_results, (
        String(var),
        control_mean,
        treatment_mean,
        difference,
        p_val
    ))
end

println("🔍 Balance Check Results:")
balance_results

In [None]:
println("📈 Balance is good if differences are small and p-values are > 0.05")
println("All p-values > 0.05: ", all(balance_results.p_value .> 0.05))

## 3.2 Estimating the Average Treatment Effect (3 points)

We estimate the Average Treatment Effect (ATE) using two approaches:
1. Simple regression: Y ~ D
2. Controlled regression: Y ~ D + X₁ + X₂ + X₃ + X₄

In [None]:
# 1. Simple regression: Y ~ D
simple_model = lm(@formula(Y ~ D), data)

println("📊 Simple Regression Results (Y ~ D):")
println(simple_model)

# Extract ATE and standard error
simple_ate = coef(simple_model)[2]  # Coefficient for D
simple_se = stderror(simple_model)[2]  # Standard error for D

println("\n🎯 Simple ATE estimate: ", round(simple_ate, digits=4))
println("📏 Standard Error: ", round(simple_se, digits=4))

In [None]:
# 2. Controlled regression: Y ~ D + X1 + X2 + X3 + X4
controlled_model = lm(@formula(Y ~ D + X1 + X2 + X3 + X4), data)

println("📊 Controlled Regression Results (Y ~ D + X1 + X2 + X3 + X4):")
println(controlled_model)

# Extract ATE and standard error
controlled_ate = coef(controlled_model)[2]  # Coefficient for D
controlled_se = stderror(controlled_model)[2]  # Standard error for D

println("\n🎯 Controlled ATE estimate: ", round(controlled_ate, digits=4))
println("📏 Standard Error: ", round(controlled_se, digits=4))

In [None]:
# Compare the two estimates
comparison = DataFrame(
    Model = ["Simple (Y ~ D)", "Controlled (Y ~ D + X1 + X2 + X3 + X4)"],
    ATE_Estimate = [simple_ate, controlled_ate],
    Standard_Error = [simple_se, controlled_se],
    R_Squared = [r2(simple_model), r2(controlled_model)]
)

println("📋 Comparison of ATE Estimates:")
comparison

In [None]:
println("\n🔍 Analysis:")
println("• ATE change: ", round(controlled_ate - simple_ate, digits=4))
println("• Standard error change: ", round(controlled_se - simple_se, digits=4))
println("• The true ATE is 2.0 (from our data generating process)")
println("• Controlling for covariates should improve precision and reduce bias")

## 3.3 LASSO and Variable Selection (3 points)

We use LASSO to select covariates and then re-estimate the ATE with only the selected variables.

In [None]:
# Prepare data for LASSO (excluding treatment D)
X_matrix = Matrix([data.X1 data.X2 data.X3 data.X4])
Y_vector = data.Y

# Standardize features for LASSO
X_scaled = StatsBase.standardize(StatsBase.ZScoreTransform, X_matrix, dims=1)

# Fit LASSO model using cross-validation
cv_result = glmnetcv(X_scaled, Y_vector, alpha=1.0, nfolds=10)

# Plot cross-validation results
plot(cv_result.lambda, cv_result.meanloss, 
     xscale=:log10, xlabel="λ (log scale)", ylabel="Mean Squared Error",
     title="LASSO Cross-Validation", linewidth=2, label="CV Error")
vline!([cv_result.lambda[cv_result.ibest]], 
       linestyle=:dash, color=:red, 
       label="Optimal λ = $(round(cv_result.lambda[cv_result.ibest], digits=6))")

optimal_lambda = cv_result.lambda[cv_result.ibest]
println("🎯 Optimal λ: ", optimal_lambda)

In [None]:
# Get coefficients at optimal lambda
lasso_fit = glmnet(X_scaled, Y_vector, alpha=1.0, lambda=[optimal_lambda])
lasso_coef = lasso_fit.betas[:, 1]

variable_names = ["X1", "X2", "X3", "X4"]
selected_vars = String[]

println("📊 LASSO Coefficients:")
for (i, var) in enumerate(variable_names)
    coef_val = lasso_coef[i]
    println("$var: $(round(coef_val, digits=6))")
    if abs(coef_val) > 1e-6
        push!(selected_vars, var)
    end
end

println("\n✅ Variables selected by LASSO: ", selected_vars)

if length(selected_vars) == 0
    println("⚠️  No variables selected by LASSO at optimal λ")
    println("This might indicate that the penalty is too strong or variables are not predictive")
end

In [None]:
# Re-estimate ATE with LASSO-selected covariates
if length(selected_vars) > 0
    # Create formula string with selected variables
    selected_vars_julia = Symbol.(selected_vars)
    
    if length(selected_vars) == 1
        formula_str = "Y ~ D + $(selected_vars[1])"
        lasso_model = lm(term(:Y) ~ term(:D) + term(selected_vars_julia[1]), data)
    elseif length(selected_vars) == 2
        formula_str = "Y ~ D + $(selected_vars[1]) + $(selected_vars[2])"
        lasso_model = lm(term(:Y) ~ term(:D) + term(selected_vars_julia[1]) + term(selected_vars_julia[2]), data)
    elseif length(selected_vars) == 3
        formula_str = "Y ~ D + $(selected_vars[1]) + $(selected_vars[2]) + $(selected_vars[3])"
        lasso_model = lm(term(:Y) ~ term(:D) + term(selected_vars_julia[1]) + term(selected_vars_julia[2]) + term(selected_vars_julia[3]), data)
    else  # All 4 variables selected
        formula_str = "Y ~ D + X1 + X2 + X3 + X4"
        lasso_model = lm(@formula(Y ~ D + X1 + X2 + X3 + X4), data)
    end
    
    println("📊 LASSO-Selected Model Results: ", formula_str)
    println(lasso_model)
    
    # Extract ATE and standard error
    lasso_ate = coef(lasso_model)[2]  # Coefficient for D
    lasso_se = stderror(lasso_model)[2]  # Standard error for D
    
    println("\n🎯 LASSO ATE estimate: ", round(lasso_ate, digits=4))
    println("📏 Standard Error: ", round(lasso_se, digits=4))
    
else
    println("⚠️  No variables selected by LASSO - using simple model")
    lasso_ate = simple_ate
    lasso_se = simple_se
    selected_vars = String[]
end

In [None]:
# Final comparison of all three estimates
selected_vars_str = length(selected_vars) > 0 ? join(selected_vars, ", ") : "None"

final_comparison = DataFrame(
    Model = ["Simple", "Controlled", "LASSO-Selected"],
    ATE_Estimate = [simple_ate, controlled_ate, lasso_ate],
    Standard_Error = [simple_se, controlled_se, lasso_se],
    Variables_Used = ["None", "X1, X2, X3, X4", selected_vars_str]
)

println("📋 Final Comparison of All ATE Estimates:")
final_comparison

In [None]:
println("\n🔍 Discussion:")
println("• True ATE: 2.0")
println("• LASSO helps with variable selection in high-dimensional settings")
println("• In this case, we know X4 has no true effect (coefficient = 0)")
println("• LASSO should ideally select X1, X2, X3 and exclude X4")
println("• Benefits of LASSO: reduces overfitting, improves interpretability")
println("• LASSO may improve precision by removing irrelevant variables")

# Visualization of results
models = final_comparison.Model
estimates = final_comparison.ATE_Estimate
errors = final_comparison.Standard_Error

p = scatter(1:3, estimates, yerror=errors, 
           xlabel="Model", ylabel="ATE Estimate",
           title="Comparison of ATE Estimates with Error Bars",
           xticks=(1:3, models), markersize=6, 
           label="ATE Estimates")
hline!([2.0], linestyle=:dash, color=:red, linewidth=2, label="True ATE = 2.0")
p