## **Potential Outcomes and RCTs**

# Data Simulation
Simulate a dataset with (n = 1000) individuals. Generate:
*   Covariates $X_1$, $X_2$, $X_3$, $X_4$ (continuous or binary)
*   Treatment assignment $D \sim \text{Bernoulli}(0.5)$
*   Outcome variable:
$Y = 2D + 0.5X_1 - 0.3X_2 + 0.2X_3 + \epsilon, \quad \epsilon \sim N(0, 1)$

Save everything in a `data.frame`.

Perform a balance check: compare the means of ($X_1$, $X_2$, $X_3$, $X_4$) across treatment and control groups (e.g., using `t.test` or regression).

In [1]:
# Data Simulation
set.seed(42)
n <- 1000

# Generar covariables
X1 <- rnorm(n, 0, 1)
X2 <- rnorm(n, 0, 1)
X3 <- rnorm(n, 0, 1)
X4 <- rnorm(n, 0, 1)

# Tratamiento (Bernoulli)
D <- rbinom(n, 1, 0.5)

# Término de error
epsilon <- rnorm(n, 0, 1)

# Outcome variable
Y <- 2 * D + 0.5 * X1 - 0.3 * X2 + 0.2 * X3 + epsilon

# Data frame
df <- data.frame(Y = Y, D = D, X1 = X1, X2 = X2, X3 = X3, X4 = X4)
df

Y,D,X1,X2,X3,X4
<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>
-0.1644381,0,1.37095845,2.32505849,0.25057807,-0.68566166
-1.7095530,0,-0.56469817,0.52412218,-0.27792405,-0.79271447
-0.4741293,0,0.36312841,0.97073342,-1.72473573,-0.40700415
-0.9520352,0,0.63286260,0.37697340,-2.00670494,-1.14867061
1.4316401,0,0.40426832,-0.99593340,-1.29180833,1.11576047
3.7095170,1,-0.10612452,-0.59748291,0.36583823,-0.87945678
1.5053523,1,1.51152200,0.16525142,-0.15220325,1.27932294
1.3904665,1,-0.09465904,-2.92847718,-0.73409409,-1.45427725
0.6221851,0,2.01842371,-0.84791423,-0.78197206,0.84222877
-0.2203141,0,-0.06271410,0.79858451,0.55156703,-0.60321501


In [2]:
# Prueba de balance
cat("Balance Check\n")
cat(paste(rep("=", 60), collapse = ""), "\n")
cat(sprintf("%-10s %-12s %-10s %-12s %-10s\n", "Variable", "Tratamiento", "Control", "Diferencia", "p-value"))
cat(paste(rep("-", 60), collapse = ""), "\n")

covariate_names <- c("X1", "X2", "X3", "X4")

for (cov_name in covariate_names) {
    mean_treated <- mean(df[df$D == 1, cov_name])
    mean_control <- mean(df[df$D == 0, cov_name])
    diff <- mean_treated - mean_control

    # CORRECCIÓN: D_data como variable dependiente
    cov_data <- df[, cov_name] - mean(df[, cov_name])
    D_data <- df$D - mean(df$D)
    model <- lm(D_data ~ 0 + cov_data)  # <- CAMBIO AQUÍ
    p_value <- summary(model)$coefficients[1, 4]

    cat(sprintf("%-10s %-12.4f %-10.4f %-12.4f %-10.4f\n",
                cov_name, mean_treated, mean_control, diff, p_value))
}

Balance Check
Variable   Tratamiento  Control    Diferencia   p-value   
------------------------------------------------------------ 
X1         -0.0093      -0.0410    0.0317       0.6176    
X2         0.0010       -0.0111    0.0121       0.8458    
X3         0.0430       -0.0458    0.0888       0.1730    
X4         -0.0792      0.0312     -0.1103      0.0780    


# Estimating the Average Treatment Effect
Estimate the treatment effect (ATE) using a simple regression:
$Y \sim D$

Estimate the ATE controlling for all covariates:
$Y \sim D + X_1 + X_2 + X_3 + X_4$

Compare the two estimates. Answer the following:
*   Does the ATE change?
*   What happens to the standard errors?

In [3]:
# Estimación del ATE
cat("\n", paste(rep("=", 60), collapse = ""), "\n")
cat("Average Treatment Effect Estimation\n")
cat(paste(rep("=", 60), collapse = ""), "\n")

# Modelo 1: Regresión simple Y ~ D
cat("Model 1: Y ~ D\n")
model1 <- lm(Y ~ D, data = df)
print(summary(model1)$coefficients)
cat("\n", paste(rep("-", 40), collapse = ""), "\n")

# Modelo 2: Regresión con covariables Y ~ D + X1 + X2 + X3 + X4
cat("Model 2: Y ~ D + X1 + X2 + X3 + X4\n")
model2 <- lm(Y ~ D + X1 + X2 + X3 + X4, data = df)
print(summary(model2)$coefficients)


Average Treatment Effect Estimation
Model 1: Y ~ D
               Estimate Std. Error    t value      Pr(>|t|)
(Intercept) -0.04010874 0.05212529 -0.7694679  4.417977e-01
D            2.06436233 0.07523638 27.4383545 5.864527e-124

 ---------------------------------------- 
Model 2: Y ~ D + X1 + X2 + X3 + X4
               Estimate Std. Error     t value      Pr(>|t|)
(Intercept) -0.01580038 0.04439831  -0.3558780  7.220074e-01
D            2.03682807 0.06417105  31.7406078 3.038903e-153
X1           0.49611755 0.03192753  15.5388652  6.586741e-49
X2          -0.35610521 0.03250887 -10.9540937  1.923414e-26
X3           0.16206514 0.03111858   5.2079864  2.320027e-07
X4          -0.01588207 0.03244319  -0.4895348  6.245711e-01


# **Comparación de las estimaciones del ATE**

### ¿Cambia el ATE?

Sí, el ATE cambia ligeramente.
- **Modelo 1** (sin controles): ATE = 2.0644
- **Modelo 2** (con controles): ATE = 2.0368

La diferencia es de aproximadamente 0.0276. El ATE se ajusta ligeramente cuando controlamos por las covariables, acercándose aún más al valor real de 2.0 que usamos en la simulación de datos.

### ¿Qué pasa con los errores estándar?

Los errores estándar mejoran (disminuyen).
- **Modelo 1**: std err = 0.0752
- **Modelo 2**: std err = 0.0642

El error estándar se reduce en aproximadamente 14.6%. Esto indica que:

1. Mayor precisión: La estimación del ATE es más precisa cuando incluimos controles
2. Intervalos de confianza más estrechos:
   - Modelo 1: [1.917, 2.212]
   - Modelo 2: [1.911, 2.163]
3. Menor variabilidad: Los controles explican parte de la varianza residual en Y

### Conclusión

Incluir covariables relevantes (especialmente X₁, X₂, X₃, que son predictores de Y) no solo mejora la precisión del estimador, sino que también corrige ligeramente el sesgo de asignación, acercando el ATE estimado al valor verdadero del efecto causal.

# LASSO and Variable Selection


- Use `cv.glmnet` to fit a LASSO model of ($Y$) on the covariates \($X_1$, ..., $X_q$), excluding the treatment.  
  - Report which covariates are selected at ($\lambda_{min}$).

- Re-estimate the ATE with only the covariates selected by LASSO:  

  \[
  $Y \sim D + X_{selected}$
  \]

- Compare this estimate with those from Part B. Discuss whether the accuracy changes and what advantages using LASSO might have in this context.


In [5]:
install.packages("glmnet")
library(glmnet)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘iterators’, ‘foreach’, ‘shape’, ‘RcppEigen’


Loading required package: Matrix

Loaded glmnet 4.1-10



In [16]:
# Matriz covariables (excluye D) y el outcome
X <- as.matrix(df[, c("X1", "X2", "X3", "X4")])
y <- df$Y

# LASSO con validación cruzada
set.seed(123)
cv.lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10, standardize = TRUE)

lambda_min <- cv.lasso$lambda.min
coef_min <- coef(cv.lasso, s = "lambda.min")

# Variables seleccionadas
sel_idx <- which(coef_min[-1, 1] != 0)  # -1 intercepto
selected_vars <- colnames(X)[sel_idx]

cat("λ_min:", lambda_min, "\n")
cat("Covariables seleccionadas:",
    if (length(selected_vars) > 0) paste(selected_vars, collapse = ", ") else "Ninguna", "\n")

if (length(selected_vars) > 0) {
  fml <- as.formula(paste("Y ~ D +", paste(selected_vars, collapse = " + ")))
} else {
  fml <- Y ~ D
}

model3 <- lm(fml, data = df)
summary(model3)

λ_min: 0.002294139 
Covariables seleccionadas: X1, X2, X3, X4 



Call:
lm(formula = fml, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3777 -0.6501  0.0112  0.6876  3.3780 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01580    0.04440  -0.356    0.722    
D            2.03683    0.06417  31.741  < 2e-16 ***
X1           0.49612    0.03193  15.539  < 2e-16 ***
X2          -0.35611    0.03251 -10.954  < 2e-16 ***
X3           0.16207    0.03112   5.208 2.32e-07 ***
X4          -0.01588    0.03244  -0.490    0.625    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.011 on 994 degrees of freedom
Multiple R-squared:  0.5892,	Adjusted R-squared:  0.5871 
F-statistic: 285.1 on 5 and 994 DF,  p-value: < 2.2e-16


Al comparar el Modelo 2 (3.2) con el Modelo 3 (LASSO λ_min), los resultados del ATE son prácticamente idénticos (≈ 2.04 con SE ≈ 0.06), lo que indica que la precisión no cambia. Aunque en el modelo verdadero X4 no influye en Y, LASSO la mantiene porque λ_min favorece un ajuste más amplio y porque, en la muestra simulada, X4 muestra correlación espuria con Y. El coeficiente de X4 no resulta significativo, por lo que su inclusión no altera la interpretación del ATE.