## **Potential Outcomes and RCTs**


# Data Simulation
Simulate a dataset with (n = 1000) individuals. Generate:
*   Covariates $X_1$, $X_2$, $X_3$, $X_4$ (continuous or binary)
*   Treatment assignment $D \sim \text{Bernoulli}(0.5)$
*   Outcome variable:
$Y = 2D + 0.5X_1 - 0.3X_2 + 0.2X_3 + \epsilon, \quad \epsilon \sim N(0, 1)$

Save everything in a `data.frame`.

Perform a balance check: compare the means of ($X_1$, $X_2$, $X_3$, $X_4$) across treatment and control groups (e.g., using `t.test` or regression).

In [None]:
# Importar librerías
import numpy as np
import pandas as pd
from statsmodels import api as sms

In [None]:
# Data Simulation
np.random.seed(42)
n = 1000

# Generar covariables
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)
X3 = np.random.normal(0, 1, n)
X4 = np.random.normal(0, 1, n)

# Tratamiento
D = np.random.binomial(1, 0.5, n)

# Término de error
epsilon = np.random.normal(0, 1, n)

# Outcome variable
Y = 2 * D + 0.5 * X1 - 0.3 * X2 + 0.2 * X3 + epsilon

# Data frame
df = pd.DataFrame({
    'Y': Y,
    'D': D,
    'X1': X1,
    'X2': X2,
    'X3': X3,
    'X4': X4
})
df

Unnamed: 0,Y,D,X1,X2,X3,X4
0,-0.456602,0,0.496714,1.399355,-0.675178,-1.907808
1,-1.639432,0,-0.138264,0.924634,-0.144519,-0.860385
2,2.633656,1,0.647689,0.059630,-0.792420,-0.413606
3,0.944632,1,1.523030,-0.646937,-0.307962,1.887688
4,-0.755624,1,-0.234153,0.698223,-1.893615,0.556553
...,...,...,...,...,...,...
995,-0.351478,0,-0.281100,1.070150,0.077481,0.028458
996,-0.667194,0,1.797687,-0.026521,0.257753,-2.077812
997,-2.169940,0,0.640843,-0.881875,-1.241761,-0.320298
998,2.127628,1,-0.571179,-0.163067,0.334176,1.643378


In [None]:
# Prueba de balance
print("Balance Check")
print("=" * 60)
print(f"{'Variable':<10} {'Tratamiento':<12} {'Control':<10} {'Diferencia':<12} {'p-value':<10}")
print("-" * 60)

covariate_names = ['X1', 'X2', 'X3', 'X4']

for cov_name in covariate_names:
    mean_treated = df.loc[df['D'] == 1, cov_name].mean()
    mean_control = df.loc[df['D'] == 0, cov_name].mean()
    diff = mean_treated - mean_control
    cov = df[cov_name].values.reshape(-1, 1)
    model = sms.OLS(cov - cov.mean(), df['D'].values.reshape(-1, 1) - df['D'].mean())
    results = model.fit()
    p_value = results.pvalues[0]

    print(f"{cov_name:<10} {mean_treated:>10.4f} {mean_control:>10.4f} {diff:>12.4f} {p_value:>10.4f}")

Balance Check
Variable   Tratamiento  Control    Diferencia   p-value   
------------------------------------------------------------
X1            -0.0493     0.0858      -0.1350     0.0291
X2             0.0929     0.0495       0.0434     0.4915
X3             0.0197    -0.0076       0.0274     0.6599
X4            -0.0075    -0.0296       0.0222     0.7329


# Estimating the Average Treatment Effect
Estimate the treatment effect (ATE) using a simple regression:
$Y \sim D$

Estimate the ATE controlling for all covariates:
$Y \sim D + X_1 + X_2 + X_3 + X_4$

Compare the two estimates. Answer the following:
*   Does the ATE change?
*   What happens to the standard errors?

In [None]:
# Estimación del ATE
print("\n" + "=" * 60)
print("Average Treatment Effect Estimation")
print("=" * 60)

# Modelo 1: Regresión simple Y ~ D
print("Model 1: Y ~ D")
model1 = sms.OLS(df['Y'], sms.add_constant(df['D']))
results1 = model1.fit()
print(results1.summary().tables[1])
print("\n" + "-" * 40)

# Modelo 2: Regresión con covariables Y ~ D + X1 + X2 + X3 + X4
print("Model 2: Y ~ D + X1 + X2 + X3 + X4")
model2 = sms.OLS(df['Y'], sms.add_constant(df[['D', 'X1', 'X2', 'X3', 'X4']]))
results2 = model2.fit()
print(results2.summary().tables[1])


Average Treatment Effect Estimation
Model 1: Y ~ D
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0268      0.051     -0.526      0.599      -0.127       0.073
D              1.9728      0.073     27.116      0.000       1.830       2.116

----------------------------------------
Model 2: Y ~ D + X1 + X2 + X3 + X4
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0505      0.044     -1.151      0.250      -0.137       0.036
D              2.0428      0.063     32.623      0.000       1.920       2.166
X1             0.4530      0.032     14.152      0.000       0.390       0.516
X2            -0.3079      0.031     -9.807      0.000      -0.370      -0.246
X3             0.2125      0.032      6.686      0.000       0.150       0.275
X4

# **Comparación de las estimaciones del ATE**

### ¿Cambia el ATE?

Sí, el ATE cambia ligeramente.
- **Modelo 1** (sin controles): ATE = 1.9728
- **Modelo 2** (con controles): ATE = 2.0428

La diferencia es de aproximadamente **0.07**. El ATE aumenta cuando controlamos por las covariables, acercándose más al valor real de 2.0 que usamos en la simulación de datos.

### ¿Qué pasa con los errores estándar?

Los errores estándar mejoran (disminuyen).
- **Modelo 1**: std err = 0.073
- **Modelo 2**: std err = 0.063

El error estándar se reduce en aproximadamente 14%. Esto indica que:

1. Mayor precisión: La estimación del ATE es más precisa cuando incluimos controles
2. Intervalos de confianza más estrechos:
   - Modelo 1: [1.830, 2.116]
   - Modelo 2: [1.920, 2.166]
3. Menor variabilidad: Los controles explican parte de la varianza residual en Y

### Conclusión

Incluir covariables relevantes (especialmente X₁, X₂, X₃, que son predictores de Y) no solo mejora la precisión del estimador, sino que también corrige ligeramente el sesgo de asignación, acercando el ATE estimado al valor verdadero del efecto causal.