# IV with Simulated Data

This notebook implements basic IV with simulated data, following the class slides.

## Simulated data

**R (reference)**
```r
n <- 5000
b.true <- 5.25
iv.dat <- tibble(
  z = rnorm(n,0,2),
  eps = rnorm(n,0,1),
  d = (z + 1.5*eps + rnorm(n,0,1) >0.25),
  y = 2.5 + b.true*d + eps + rnorm(n,0,0.5)
)
```

- endogenous `eps`: affects treatment and outcome
- `z` is an instrument: affects treatment but no direct effect on outcome

In [None]:
# Python: simulate data
import numpy as np
import pandas as pd

n = 5000
b_true = 5.25
rng = np.random.default_rng(123)  # reproducibility

iv_dat = pd.DataFrame({
    "z":   rng.normal(0, 2, n),
    "eps": rng.normal(0, 1, n),
})

iv_dat["d"] = (
    iv_dat["z"] + 1.5 * iv_dat["eps"] + rng.normal(0, 1, n) > 0.25
).astype(int)

iv_dat["y"] = 2.5 + b_true * iv_dat["d"] + iv_dat["eps"] + rng.normal(0, 0.5, n)

iv_dat.head()


## OLS and IV Estimates

Recall that the **true** treatment effect is `b_true`.

**R (reference)**
```r
lm(y~d, data=iv.dat)
feols(y ~ 1 | d ~ z, data=iv.dat)
```

In [None]:
import statsmodels.formula.api as smf

# OLS: y ~ d
ols = smf.ols("y ~ d", data=iv_dat).fit()
print(ols.summary())


### IV / 2SLS

We use `linearmodels` if available; otherwise we fall back to a manual two-stage implementation.

**R (reference)**
```r
feols(y ~ 1 | d ~ z, data=iv.dat)
```

In [None]:
# IV/2SLS: y ~ 1 + [d ~ z]
try:
    from linearmodels.iv import IV2SLS

    iv = IV2SLS.from_formula("y ~ 1 + [d ~ z]", data=iv_dat).fit(cov_type="robust")
    print(iv.summary)
except ImportError as e:
    print("Package 'linearmodels' not found; using a manual 2SLS fallback via statsmodels.")
    import statsmodels.api as sm

    # First stage: d ~ z
    first = sm.OLS(iv_dat["d"], sm.add_constant(iv_dat["z"])).fit()
    iv_dat["d_hat"] = first.fittedvalues

    # Second stage: y ~ d_hat
    second = sm.OLS(iv_dat["y"], sm.add_constant(iv_dat["d_hat"])).fit()

    print("\nFirst stage (d ~ z):")
    print(first.summary())
    print("\nSecond stage (y ~ d_hat):")
    print(second.summary())


## Logical Diagnostic: The "Reduced Form"

- "Reduced form" means the conditional relationship between the outcome and the instrument (in practice, replacing the endogenous variable with the instrument in a regression).
- While not a formal statistical test of instrument validity, researchers often inspect the reduced form as a coherence check.
  - A **zero** (or wrong-signed) reduced form casts doubt on either **relevance** or **exclusion**, but does not tell you which.
  - A **nonzero** and properly signed reduced form is necessary for a meaningful IV estimand, but does not validate exclusion/exogeneity.

**R (reference)**
```r
lm(y~z, data=iv.dat)
```

In [None]:
rf = smf.ols("y ~ z", data=iv_dat).fit()
print(rf.summary())


## Instrument Relevance: First Stage

**R (reference)**
```r
lm(d~z, data=iv.dat)
```

In [None]:
fs = smf.ols("d ~ z", data=iv_dat).fit()
print(fs.summary())


## Two-stage equivalence

This reproduces the literal two-step procedure:

**R (reference)**
```r
step1 <- lm(d ~ z, data=iv.dat)
d.hat <- predict(step1)
step2 <- lm(y ~ d.hat, data=iv.dat)
```

Note: the point estimate matches 2SLS under standard conditions, but the **standard errors from the second-stage OLS are not IV-robust** (see Wooldridge, *Econometric Analysis of Cross Section and Panel Data*, 2e, Ch. 5).

In [None]:
# First stage
step1 = smf.ols("d ~ z", data=iv_dat).fit()
iv_dat["d_hat"] = step1.fittedvalues

# Second stage (using fitted treatment)
step2 = smf.ols("y ~ d_hat", data=iv_dat).fit()

print("First stage (d ~ z):")
print(step1.summary())
print("\nSecond stage (y ~ d_hat):")
print(step2.summary())
