# IV with Simulated Data

Empirical-only notebook with executable **R** (via `%%R`) and **Python** cells.


## 1. Setup (R)

Load required R packages.


In [None]:
%%R
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, fixest)


## 2. Setup (Python)

Load required Python packages.


In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

# Optional (recommended) for 2SLS in Python:
# !pip install linearmodels


## Simulated data

Generate a simple DGP with an endogenous treatment `d` and instrument `z`.


In [None]:
%%R
set.seed(123)

n <- 5000
b.true <- 5.25

iv.dat <- tibble(
  z = rnorm(n,0,2),
  eps = rnorm(n,0,1),
  d = as.integer((z + 1.5*eps + rnorm(n,0,1)) > 0.25),
  y = 2.5 + b.true*d + eps + rnorm(n,0,0.5)
)

dplyr::glimpse(iv.dat)


In [None]:
# Python: simulate data
n = 5000
b_true = 5.25
rng = np.random.default_rng(123)

iv_dat = pd.DataFrame({
    "z":   rng.normal(0, 2, n),
    "eps": rng.normal(0, 1, n),
})

iv_dat["d"] = (iv_dat["z"] + 1.5*iv_dat["eps"] + rng.normal(0,1,n) > 0.25).astype(int)
iv_dat["y"] = 2.5 + b_true*iv_dat["d"] + iv_dat["eps"] + rng.normal(0,0.5,n)

iv_dat.head()


## OLS and IV Estimates

Estimate the treatment effect using OLS and 2SLS.


In [None]:
%%R
# OLS
ols_r <- lm(y ~ d, data = iv.dat)
summary(ols_r)

# IV (2SLS)
iv_r <- feols(y ~ 1 | d ~ z, data = iv.dat)
summary(iv_r)


In [None]:
# Python: OLS
ols_py = smf.ols("y ~ d", data=iv_dat).fit()
print(ols_py.summary())


In [None]:
# Python: IV / 2SLS
try:
    from linearmodels.iv import IV2SLS
    iv_py = IV2SLS.from_formula("y ~ 1 + [d ~ z]", data=iv_dat).fit(cov_type="robust")
    print(iv_py.summary)
except ImportError:
    import statsmodels.api as sm
    # manual 2SLS (point estimates ok; SEs not IV-robust)
    fs = sm.OLS(iv_dat["d"], sm.add_constant(iv_dat["z"])).fit()
    iv_dat["d_hat"] = fs.fittedvalues
    ss = sm.OLS(iv_dat["y"], sm.add_constant(iv_dat["d_hat"])).fit()
    print("First stage (manual):")
    print(fs.summary())
    print("\nSecond stage (manual):")
    print(ss.summary())


## Logical Diagnostic: "Reduced Form"

Estimate the reduced-form relationship between the outcome and the instrument.


In [None]:
%%R
rf_r <- lm(y ~ z, data = iv.dat)
summary(rf_r)


In [None]:
rf_py = smf.ols("y ~ z", data=iv_dat).fit()
print(rf_py.summary())


## Instrument Relevance: First Stage

Estimate the first-stage relationship between the endogenous regressor and the instrument.


In [None]:
%%R
fs_r <- lm(d ~ z, data = iv.dat)
summary(fs_r)


In [None]:
fs_py = smf.ols("d ~ z", data=iv_dat).fit()
print(fs_py.summary())


## Two-stage equivalence (manual 2SLS)

Show the two-step procedure: predict `d` from `z`, then regress `y` on predicted `d`.


In [None]:
%%R
step1 <- lm(d ~ z, data=iv.dat)
iv.dat$d_hat <- predict(step1)
step2 <- lm(y ~ d_hat, data=iv.dat)

summary(step1)
summary(step2)


In [None]:
# Python: manual two-step
step1_py = smf.ols("d ~ z", data=iv_dat).fit()
iv_dat["d_hat"] = step1_py.fittedvalues
step2_py = smf.ols("y ~ d_hat", data=iv_dat).fit()

print(step1_py.summary())
print(step2_py.summary())
