# Selection on Observables: Simulated Data

This notebook follows the class slides on simulated data, matching, reweighting, and regression-based estimators under selection on observables.

In [None]:
# Setup: rpy2 for R interop, plus core Python libraries
!pip -q install rpy2

%load_ext rpy2.ipython

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

## Simulated data

We first simulate the data in R and then bring it into Python.

In [None]:
%%R
library(tibble)

set.seed(123)
n <- 5000
select.dat <- tibble(
  x     = runif(n, 0, 1),
  z     = rnorm(n, 0, 1),
  w     = (x > 0.65),
  y     = -2.5 + 4*w + 1.5*x + rnorm(n, 0, 1),
  w_alt = (x + z > 0.35),
  y_alt = -2.5 + 4*w_alt + 1.5*x + 2.25*z + rnorm(n, 0, 1)
)

head(select.dat)

In [None]:
# Bring the R data.frame `select.dat` into Python as `select_dat`
from rpy2.robjects import r
from rpy2.robjects import pandas2ri

pandas2ri.activate()
select_dat = pandas2ri.rpy2py(r['select.dat'])
select_dat.head()

## Simulation: nearest neighbor matching with inverse-variance (R)

We start with nearest-neighbor matching using the **Matching** package in R with inverse-variance weighting (`Weight = 1`).

In [None]:
%%R
if (!requireNamespace("Matching", quietly = TRUE)) {
  install.packages("Matching")
}
library(Matching)

nn.est1 <- Matching::Match(
  Y        = select.dat$y,
  Tr       = select.dat$w,
  X        = select.dat$x,
  M        = 1,
  Weight   = 1,
  estimand = "ATE"
)

summary(nn.est1)

## Simulation: nearest neighbor matching with Mahalanobis distance (R)

Now we repeat nearest-neighbor matching using Mahalanobis distance (`Weight = 2`).

In [None]:
%%R
nn.est2 <- Matching::Match(
  Y        = select.dat$y,
  Tr       = select.dat$w,
  X        = select.dat$x,
  M        = 1,
  Weight   = 2,
  estimand = "ATE"
)

summary(nn.est2)

## Simulation: inverse probability weighting (IPW)

We next construct an IPW estimator by modeling the treatment probability with a logit and using weights $1/p(x)$ for treated and $1/(1-p(x))$ for controls.

In [None]:
%%R
# Propensity score model in R
logit.model <- glm(w ~ x, family = binomial, data = select.dat)
ps <- fitted(logit.model)

select.dat <- select.dat %>%
  dplyr::mutate(ipw = dplyr::case_when(
    w == 1 ~ 1 / ps,
    w == 0 ~ 1 / (1 - ps),
    TRUE   ~ NA_real_
  ))

mean.w1 <- select.dat %>%
  dplyr::filter(w == 1) %>%
  dplyr::summarize(mean_y = stats::weighted.mean(y, ipw))

mean.w0 <- select.dat %>%
  dplyr::filter(w == 0) %>%
  dplyr::summarize(mean_y = stats::weighted.mean(y, ipw))

ipw_ate <- mean.w1$mean_y - mean.w0$mean_y
ipw_ate

In [None]:
# Pull updated `select.dat` (with ipw) back into Python
select_dat = pandas2ri.rpy2py(r['select.dat'])
select_dat.head()

In [None]:
# Manual IPW in Python on the same data
logit_res = smf.logit("w ~ x", data=select_dat).fit(disp=False)
select_dat = select_dat.copy()
select_dat["ps_py"] = logit_res.predict(select_dat)

eps = 1e-6
select_dat["ps_py"] = select_dat["ps_py"].clip(eps, 1 - eps)

select_dat["ipw_py"] = np.where(
    select_dat["w"] == 1,
    1.0 / select_dat["ps_py"],
    1.0 / (1.0 - select_dat["ps_py"])
)

treated  = select_dat[select_dat["w"] == 1]
controls = select_dat[select_dat["w"] == 0]

mean_w1_py = np.average(treated["y"],  weights=treated["ipw_py"])
mean_w0_py = np.average(controls["y"], weights=controls["ipw_py"])
ate_ipw_py = mean_w1_py - mean_w0_py

ate_ipw_py

## Simulation: regression adjustment

We now estimate separate outcome regressions by treatment status and construct an ATE by comparing predicted outcomes under treatment and control for every unit.

In [None]:
%%R
reg1.dat <- dplyr::filter(select.dat, w == 1)
reg1 <- lm(y ~ x, data = reg1.dat)

reg0.dat <- dplyr::filter(select.dat, w == 0)
reg0 <- lm(y ~ x, data = reg0.dat)

pred1 <- predict(reg1, newdata = select.dat)
pred0 <- predict(reg0, newdata = select.dat)
ATE_reg <- pred1 - pred0
summary(ATE_reg)

In [None]:
# Regression adjustment in Python on the same data
reg1_data = select_dat[select_dat["w"] == 1]
reg0_data = select_dat[select_dat["w"] == 0]

reg1_py = smf.ols("y ~ x", data=reg1_data).fit()
reg0_py = smf.ols("y ~ x", data=reg0_data).fit()

pred1_py = reg1_py.predict(select_dat)
pred0_py = reg0_py.predict(select_dat)
ATE_indiv_py = pred1_py - pred0_py
ATE_hat_py = ATE_indiv_py.mean()
ATE_hat_py

## Simulation: regression with IPW

Finally, we run a regression of $Y$ on $W$ using IPW as regression weights.

In [None]:
%%R
ipw.reg <- lm(y ~ w, data = select.dat, weights = ipw)
summary(ipw.reg)

In [None]:
# IPW regression in Python
ipw_mod = smf.wls("y ~ w", data=select_dat, weights=select_dat["ipw"]).fit()
ipw_mod.summary().tables[1]

## Violation of selection on observables

We now switch to the alternative treatment/outcome pair $(w_{alt}, y_{alt})$ where selection on observables fails, and compare matching and regression estimators.

In [None]:
%%R
nn.est3 <- Matching::Match(
  Y        = select.dat$y_alt,
  Tr       = select.dat$w_alt,
  X        = select.dat$x,
  M        = 1,
  Weight   = 2,
  estimand = "ATE"
)
summary(nn.est3)

In [None]:
%%R
reg1.dat.alt <- dplyr::filter(select.dat, w_alt == 1)
reg1.alt <- lm(y_alt ~ x, data = reg1.dat.alt)

reg0.dat.alt <- dplyr::filter(select.dat, w_alt == 0)
reg0.alt <- lm(y_alt ~ x, data = reg0.dat.alt)

pred1_alt <- predict(reg1.alt, newdata = select.dat)
pred0_alt <- predict(reg0.alt, newdata = select.dat)
ATE_alt <- pred1_alt - pred0_alt
summary(ATE_alt)