# ECON5280 Lecture 9 Instrumental Variable

<font size="5">Junlong Feng</font>

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/junlong-feng/econ5280/main?filepath=Lecture9_IV.ipynb)

## Outline

* Motivation: What shall we do when treatment is not exogenous?
* Noncompliance and LATE: Average treatment effect for a subpopulation.
* Identification of LATE: When is LATE identified (in a linear model)?
* Instrumental Variable Regression, 2SLS and GMM: How to estimate a linear model with instruments?
* Applications: How to find an instrument and how to implement the methods in R

## 1. Noncompliance and LATE (Binary Treatment and Binary IV)

Sometimes we can guarantee the treatment assignment is fully random, but not everyone comply.

- Suppose the government freely assign HPV vaccine to citizens by a lottery. If Alice wins the lottery, she can get vaccinated for free.
- This lottery is fully random and independent of everything else.
- Suppose I'm a health economist and want to study the effect of receiving HPV vaccine on newborns' birthweight $Y$.
- Can I use this lottery as $Z$ and estimate $Y=\beta_{0}+\beta_{1}Z+\varepsilon$ by OLS?
- No because:
  - i) Winning the lottery DOES NOT imply the individual must go and get the vaccine; people who don't want to get vaccinated won't go anyways.
  - ii) Not winning the lottery DOES NOT imply the individual cannot get the vaccine; people who want to get vaccianted can pay for it.
  - So $\beta_{1}$ is the ATE of winning the lottery on birthweight, not the ATE of getting HPV vaccine on birthweight. 
- This ATE is called intent-to-treat (ITT).

So can we get ATE? No because whether people get vaccinated ($D$) is determined by a lot of factors, which are in the error $\varepsilon$. However, we can get the average effect of a sub-population.

- Suppose Alice is on the margin: She's quite indifferent whether to get the vaccine or not.
- This means the marginal cost of getting vaccination (money, time, etc) is equal to her marginal benefit.
- Now suppose Alice wins the lottery, marginal cost becomes lower, and she decides to get vaccinated.
- This decision $D=1$ is made solely based on winning the lottery, independent of everything else.
- So for Alice, $D$ is as good as random.
- Now suppose Bob doesn't want to get vaccinated anyways. He has already made up his mind based on his $\varepsilon$. No matter whether his $Z=1$ or not, he won't do it. So his decision $D=1$ is correlated with $\varepsilon$, not randomly assigned.

The above example shows that it's useful to divide the population into subgroups more carefully. Define $D(z)$ as the potential treatment. $D=D(z)$ if and only if $Z=z$. 

The above example shows that we can divide the population into several groups 

| Always Taker  |  Never Taker  |  Complier   |   Defier    |
| :-----------: | :-----------: | :---------: | :---------: |
| $D(1)>D(0)=1$ | $D(1)=D(0)=0$ | $D(1)>D(0)$ | $D(1)<D(0)$ |

- We never know which group a given individual $i$ belongs to because we can never observe her both potential treatments. Same logic as the *fundamental problem of causal inference*.
- Compliers and defiers are relative; you can switch their definitions according to applications. In the HPV example, it's more natural to define complier in the above way.

From the vaccination example, we learned that the randomly assigned $Z$, which is called an **instrumental variable, or instrument, or IV**, can represent $D$ for compliers. So we may conjecture that the ATE for compliers might be identified. Let's first define the ATE for this subgroup:

**Definition (LATE)**. The local average treatment effect (LATE) is defined as 
$$
LATE\equiv \mathbb{E}\left(Y_{i}(1)-Y_{i}(0)|D_{i}(1)>D_{i}(0)\right).
$$
where the conditioning part means that $i\in Compliers$.

* Without the conditioning part, the RHS is ATE.
* If everyone in the population is a complier, then $D_{i}(1)>D_{i}(0)$ holds with probability 1 so $LATE=ATE$.
* LATE cannot be **directly** estimated based on the above definition because we don't know $\mathbb{E}(\cdot|\cdot)$ and we even don't know who are compliers.
* However, it is possible to identify LATE, i.e., express it using some directly identifiable quantities. Then we can estimate LATE from there.

## 2. Identification of LATE (Binary Treament and Binary IV)

Again, let $Y=g(D,U)$. Also assume $D=h(Z,V)$ where $V$ is a vector of unknowns. We make the following assumptions:

- $Z\perp (U,V)$. Or equivalently, $Z\perp (Y(1),Y(0),D(1),D(0))$. 
  - This assumption says the instrument is completely randomized. 
  - We can extend it to conditional randomized as in Lecture 5. Will do later.
- $\mathbb{E}(D|Z=1)\neq \mathbb{E}(D|Z=0)$. 
  - This assumption says not all individuals are always takers or never takes. Because otherwise, $D=1$ or 0 with probability 1 so $\mathbb{E}(D|Z=1)=\mathbb{E}(D|Z=1)$.
- $D(1)\geq D(0)$ with probability 1.
  - This means there are no defiers in the population.
  - Also called **monotonicity**.

**Theorem (LATE, Imbens and Angrist, 1994)**. Under the above assumptions, LATE is identified as
$$
LATE=\frac{\mathbb{E}(Y|Z=1)-\mathbb{E}(Y|Z=0)}{\mathbb{E}(D|Z=1)-\mathbb{E}(D|Z=0)}.
$$

- This result is called **identification** because the unknown parameter of interest LATE is uniquely linked to some population quantity that is directly estimable.

**Proof** (optional). Recall that $D=D(0)$ if $Z=0$ and $D=D(1)$ if $Z=1$, where $D(0),D(1)$ are potential treatment. Equivalently, $D=ZD(1)+(1-Z)D(0)$. Meanwhile, $Y=DY(1)+(1-D)Y(0)$. Therefore,
$$
\begin{align*}
\mathbb{E}(Y|Z=1)-\mathbb{E}(Y|Z=0)=&\mathbb{E}\left[D(1)Y(1)+(1-D(1))\cdot Y(0)|Z=1\right]\\
&-\mathbb{E}\left[D(0)Y(1)+(1-D(0))\cdot Y(0)|Z=0\right]\\
\text{By independence:}=&\mathbb{E}\left[D(1)Y(1)+(1-D(1))\cdot Y(0)\right]\\
&-\mathbb{E}\left[D(0)Y(1)+(1-D(0))\cdot Y(0)\right]\\
=&\mathbb{E}\left[\left(D(1)-D(0)\right)\times \left(Y(1)- Y(0)\right)\right]\\
\text{By monotonicity: }=&\mathbb{E}\left[Y(1)- Y(0)|D(1)>D(0)\right]\times \Pr(D(1)-D(0)=1)\\
\text{By independence: }=&\mathbb{E}\left[Y(1)- Y(0)|D(1)>D(0)\right]\times \left[\mathbb{E}(D|Z=1)-\mathbb{E}(D|Z=0)\right].
\end{align*}
$$
Done by dividing $\mathbb{E}(D|Z=1)-\mathbb{E}(D|Z=0)$ on both side because it's nonzero by assumption. Q.E.D.

### 2.1 Linear Model

The identification result for LATE looks formidable: a lot of expectations. It is possible to have a linear representation for it, just as ATE.

- $\gamma\equiv \mathbb{E}(D|Z=1)-\mathbb{E}(D|Z=0)$. By identity
  $$
  D=\mathbb{E}(D|Z=0)+[\mathbb{E}(D|Z=1)-\mathbb{E}(D|Z=0)] \times Z+(D-\mathbb{E}(D|Z)).
  $$
  We have
  $$
  D=\gamma_{0}+\gamma Z+\nu,
  $$
  where $cov(Z,\nu)=0$ is ALWAYS true without any assumption. (Can you prove it?)

  - By assumption $\gamma\neq 0$.

- $\delta\equiv \mathbb{E}(Y|Z=1)-\mathbb{E}(Y|Z=0)$. Similarly, we can write
  $$
  Y=\delta_{0}+\delta Z+\mu,
  $$
  where $\delta_{0}=\mathbb{E}(Y|Z=0)$ and $\mu=Y-\mathbb{E}(Y|Z)$ so $cov(Z,\mu)=0$ without any assumptions as well.

- Combining them, let $\beta=LATE=\delta/\gamma$ (this last equality is **by the LATE theorem**), we have
  $$
  \begin{align*}
  Y=&\delta_{0}+\delta\frac{D-\gamma_{0}-\nu}{\gamma}+\mu\\
  =&\left(\delta_{0}-\delta\frac{\gamma_{0}}{\gamma}\right)+\frac{\delta}{\gamma}D+\left(\mu-\frac{\delta}{\gamma}\nu\right)\\
  \eqqcolon&\beta_{0}+\beta D+\varepsilon.
  \end{align*}
  $$

- **Important**. $cov(D,\varepsilon)\neq 0$ because $D$ is related to $\nu$, but $cov(\varepsilon,Z)=0$.

**Theorem**. When $D$ and $Z$ are binary and under the assumptions for the LATE theorem, there exists a unique linear model $Y=\beta_{0}+\beta D+\varepsilon$ such that i) $\beta=LATE$ and ii) $cov(Z,\varepsilon)=0$.

### 2.2 Conditional LATE (CLATE)

Similar to ATE and CATE, we can relax the assumptions for the LATE theorem to hold conditionally. Suppose we have a vector of control variables $W$ such that for $Y=g(D,W,U)$ and $D=h(D,W,V)$, 

- $Z\perp (U,V)|W$. Or equivalently, $Z\perp (Y(1),Y(0),D(1),D(0))|W$. 
  - This assumption says the instrument is conditional randomized. 
- $\mathbb{E}(D|Z=1,W)\neq \mathbb{E}(D|Z=0,W)$. 
- $D(1)\geq D(0)$ with probability 1 conditional on $W$.

Then defining the conditional LATE (CLATE) as 
$$
CLATE(W)\equiv \mathbb{E}\left(Y_{i}(1)-Y_{i}(0)|D_{i}(1)>D_{i}(0),W\right).
$$
We can show that 
$$
CLATE(W)=\frac{\mathbb{E}(Y|Z=1,W)-\mathbb{E}(Y|Z=0,W)}{\mathbb{E}(D|Z=1,W)-\mathbb{E}(D|Z=0,W)}.
$$
**Important**. Since $W$ can be continuous or discrete. CLATE no longer always has a representation in linear model, just like CATE. Similarly, even for LATE, when $D$ or $Z$ is more than binary, linearity becomes an assumption.

## 3 Instrumental Variable Regression, GMM and 2SLS

Now we focus on linear models with an i.i.d. sample $\{(Y_{i},X_{i},Z_{i}):i=1,\ldots,n\}$ like the following:
$$
Y_{i}=X_{i}'\beta+\varepsilon_{i},\ \ \ \ \mathbb{E}(Z_{i}\varepsilon_{i})=0.
$$

- $X_{i}$ containing constant 1, is a $k\times 1$ vector.
- $Z_{i}$ containing constant 1, is an $l\times 1$ vector.
- We require $l\geq k$. Will see the reason later.
- The restriction $\mathbb{E}(Z_{i}\varepsilon_{i})=0$ is equivalent to $cov(Z_{i},\varepsilon_{i})=0$ because it is without loss of generality to set $\mathbb{E}(\varepsilon_{i})=0$ as $X_{i}$ contains a constant. (Why?)
- $X_{i}$ and $Z_{i}$ can have common components. For instance, suppose $X_{i}=(1,D_{i},W_{i})$ and $\mathbb{E}(W_{i}\varepsilon_{i})=0$, then $Z_{i}$ can also contain $W_{i}$. A special case is $Z=X$ and then we are back to the linear model in Lecture 6.
- Components in $X_{i}$ that make $\mathbb{E}(X_{i}\varepsilon_{i})\neq 0$ are called **endogenous variables**. For instance in the previous bullet point, $D_{i}$ is endogenous if $\mathbb{E}(D_{i}\varepsilon_{i})\neq 0$, or $cov(D_{i},\varepsilon_{i})\neq 0$.
- Solving the issue of **endogeneity** is central in econometrics and applied economics.
- We have already shown that when $X_{i}=(1,D_{i})'$ and $D_{i}$, $Z_{i}$ are binary, $\beta$ on $Z_{i}$ is LATE. 
  - Although in Section 2.1
- In all other cases, $\beta$ has causal interpretation **only under** stronger assumptions, for instance the marginal effect is constant (when $D$ is continuous), or CATE is constant in $W$.
- This section focuses on consistently estimating $\beta$, no matter what $\beta$ means.

 ### 3.1 Moment Conditions and Over-, Just- and Underidentification

By our model and the restriction, we have the following moment conditions for $\beta$ ($k\times 1$):
$$
\begin{align*}
\mathbb{E}(Z_{i}(Y_{i}-X_{i}'\beta))=0\\
\implies \mathbb{E}(Z_{i}X_{i}')\beta=\mathbb{E}(Z_{i}Y_{i}).
\end{align*}
$$
This is a moment condition for $\beta$. 

- When $X=Z$, we can solve for $\beta$ and get $\beta=[\mathbb{E}(X_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Y_{i})$. This is Lecture 6. We needed the assumption $\mathbb{E}(X_{i}X_{i}')$ is invertible, or, full rank.

- When $X\neq Z$ but $l=k$, $\mathbb{E}(Z_{i}X_{i}')$ is a square matrix, so we can also assume inveritibility, i.e., assume $\mathbb{E}(Z_{i}X_{i}')$ is full rank. Then $\beta=[\mathbb{E}(Z_{i}X_{i}')]^{-1}\mathbb{E}(Z_{i}Y_{i})$.

  - This case is called **just-identification**.

  - Example. Let $X=(1,D)'$ and $Z=(1,\tilde{Z})'$. This is the case we discussed in Section 2 ($\tilde{Z}$ here is the $Z$ there, sorry about the abuse of notation). Now $l=k=2$, and matrix $\mathbb{E}(Z_{i}X_{i}')$ is
    $$
    \mathbb{E}(Z_{i}X_{i}')=\begin{pmatrix}1&\mathbb{E}(\tilde{Z}_{i})\\\mathbb{E}(D_{i})&\mathbb{E}(\tilde{Z}_{i}D_{i})\end{pmatrix}.
    $$
    So $\mathbb{E}(Z_{i}X_{i}')$ is full rank if and only if $\mathbb{E}(\tilde{Z}_{i}D_{i})\neq \mathbb{E}(\tilde{Z}_{i})\mathbb{E}(D_{i})$, i.e., $cov(\tilde{Z}_{i},D_{i})\neq 0$.

  - To push it even further, recall in Section 2 $D$ and $\tilde{Z}$ are both binary. 
    $$
    \begin{align*}
    \mathbb{E}(\tilde{Z}_{i}D_{i})=&\mathbb{E}(\tilde{Z}_{i}D_{i}|\tilde{Z}_{i}=1)\Pr(\tilde{Z}_{i}=1)+0\\
    =&\mathbb{E}(D_{i}|\tilde{Z}_{i}=1)\Pr(\tilde{Z}_{i}=1),
    \end{align*}
    $$
    and
    $$
    \mathbb{E}(D_{i})=\mathbb{E}(D_{i}|\tilde{Z}_{i}=1)\Pr(\tilde{Z}_{i}=1)+\mathbb{E}(D_{i}|\tilde{Z}_{i}=0)[1-\Pr(\tilde{Z}_{i}=1)].
    $$
    Therefore, noting that $\mathbb{E}(\tilde{Z}_{i})=\Pr(\tilde{Z}_{i}=1)$ since $\tilde{Z}_{i}$ is binary, $\mathbb{E}(Z_{i}X_{i}')$ is full rank if and only if
    $$
    \begin{align*}
    &\left[\left(1-\Pr(\tilde{Z}_{i}=1)\right)\Pr(\tilde{Z}_{i}=1)\right]\cdot \mathbb{E}(D_{i}|\tilde{Z}_{i}=1)\\
    \neq &\left[\left(1-\Pr(\tilde{Z}_{i}=1)\right)\Pr(\tilde{Z}_{i}=1)\right]\cdot \mathbb{E}(D_{i}|\tilde{Z}_{i}=0).
    \end{align*}
    $$
    The term in the bracket is the variance of $\tilde{Z}_{i}$, which is strictly positive. Therefore, in the binary IV binary $D$ case, $\mathbb{E}(Z_{i}X_{i}')$ is full rank if and only if $\mathbb{E}(D_{i}|\tilde{Z}_{i}=1)\neq \mathbb{E}(D_{i}|\tilde{Z}_{i}=0)$, this is **exactly the same as the second assumption for the LATE theorem!**

- When $l>k$, the $l\times k$ matrix $\mathbb{E}(Z_{i}X_{i}')$ is no longer square, so not invertible even if it's full rank. We still need to assume full rankness but need to use the *generalized inverse*.

  - This case is called **overidentification**.

  - Recall the rank of a matrix is no greater than $\min\{l,k\}=k$.

  - Let $W$ be an $l\times l$  p.d. matrix.Then $\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')$ is a $k\times k$ matrix of rank $k$. Left-multiply the moment condition by $\mathbb{E}(X_{i}Z_{i}')W$; we have
    $$
    \mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')\beta=\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i}).
    $$
    Therefore, $\beta=[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i})$.

- When $l<k$, i.e., you don't have enough instruments for the endogenous variables, then the rank of $\mathbb{E}(Z_{i}X_{i}')$ is at most $\min\{l,k\}=l$. It is not square, you can do the same step as before and get
  $$
  \mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')\beta=\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i}).
  $$
  However, now the rank of the $k\times k$ matrix $\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')$ is at most $l$ because $rank(AB)\leq \min(rank(A),rank(B))$ and we have shown the rank of $\mathbb{E}(Z_{i}X_{i}')$ is at most $l$. Therefore, although $\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')$ is a square matrix, it's not invertible, leaving $\beta$ **unidentified**.

  - This case is called **underidentification**.
  - To avoid underidentification, when in your model, there are nonlinear terms for your endogenous variable, for instance, $D^{2}$, or interaction terms $DW$, then a common strategy is to square up your IV (if it's not discrete), or make interaction between IV and $W$ as well. 
  - e.g. $Y=\beta_{0}+\beta_{1}D+\beta_{2}D^{2}+\beta_{3}DW+\beta_{4}W+\varepsilon$. You believe $\mathbb{E}(\varepsilon W)=0$ and you have an IV for $D$, called $\tilde{Z}$. Then your $Z$ can be $(1,Z,Z^{2},WZ,W)$.


**Theorem**. Under $\mathbb{E}(Z_{i}\varepsilon_{i})=0$ and $rank(\mathbb{E}(X_{i}Z_{i}'))=k$, for any p.d. $W$, we have
$$
\beta=[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i}).
$$

- Condition $\mathbb{E}(Z_{i}\varepsilon_{i})=0$ is sometimes called the **exclusion restriction**. It means $Z$ is excluded from $\varepsilon$, so is uncorrelated with the latter.

- Condition $rank(\mathbb{E}(X_{i}Z_{i}'))=k$ is called the **relevance condition**. It implies $X$ and $Z$ are correlated and $Z$ have rich variation.

  - This implicitly says $l\geq k$ because otherwise the rank is always smaller than $k$.

- When $l=k$, $W$ does not matter because for three invertible matrices $A,B,C$, $(ABC)^{-1}=C^{-1}B^{-1}A^{-1}$ and thus
  $$
  \begin{align*}
  \beta=&[\mathbb{E}(Z_{i}X_{i}')]^{-1}W^{-1}[\mathbb{E}(X_{i}Z_{i}')]^{-1}\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i})\\
  =&[\mathbb{E}(Z_{i}X_{i}')]^{-1}\mathbb{E}(Z_{i}Y_{i}).
  \end{align*}
  $$

### 3.2 Linear GMM, 2SLS and Their Properties

Estimation of $\beta$ is straightforward as we are now familiar with the MM estimation framework:

- From the moment condition we know the **true** $\beta$ satisfies $\beta=[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')]^{-1}\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}Y_{i})$.
- We simply replace $\mathbb{E}$ by $\sum_{i}/n$ and $W$ by a consistent estimator $W_{n}\to_{p}W$.
- $\hat{\beta}^{GMM}=[(X'Z)W_{n}(Z'X)]^{-1}[X'ZW_{n}Z'Y]$.

Remark. Since the moment conditions cannot directly solve for $\beta$ when $l>k$ and we manipulated a bit and get $\beta$ by taking the generalized inverse, we call the corresponding estimator the **generalized method of moment** estimator, or simply GMM.

#### 3.2.1 Consistency, Asymptotic Normality, and Inference

 Similar to OLS, it's convenient to substitute $Y=X\beta+\varepsilon$ into $\hat{\beta}$ and obtain
$$
\begin{align*}
\hat{\beta}^{GMM}=&[(X'Z)W_{n}(Z'X)]^{-1}(X'ZW_{n}Z'X)\beta+[(X'Z)W_{n}(Z'X)]^{-1}(X'ZW_{n}Z'\varepsilon)\\
\implies \hat{\beta}^{GMM}-\beta=&[(X'Z)W_{n}(Z'X)]^{-1}(X'ZW_{n}Z'\varepsilon)\\
=&\left[\left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\left(\frac{1}{n}\sum_{i}Z_{i}X_{i}'\right)\right]^{-1}\left[\left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\left(\frac{1}{n}\sum_{i}Z_{i}\varepsilon_{i}\right)\right].
\end{align*}
$$
**Consistency**. 

- WLLN and CMT: 
  $$
  \begin{align*}\left[\left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\left(\frac{1}{n}\sum_{i}Z_{i}X_{i}'\right)\right]^{-1}\to_{p}&\left[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')\right]^{-1},\\
  \left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\to_{p}&\mathbb{E}(X_{i}Z_{i}')W,\\
  \frac{1}{n}\sum_{i}Z_{i}\varepsilon_{i}\to_{p}&\mathbb{E}(Z_{i}\varepsilon_{i})=0.
  \end{align*}
  $$

- CMT: $\hat{\beta}^{GMM}-\beta\to_{p}0$.

**Asymptotic Normality**.

Multiplying $\sqrt{n}$ on both sides of $\hat{\beta}^{GMM}-\beta$, we get
$$
\hat{\beta}^{GMM}-\beta=\left[\left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\left(\frac{1}{n}\sum_{i}Z_{i}X_{i}'\right)\right]^{-1}\left[\left(\frac{1}{n}\sum_{i}X_{i}Z_{i}'\right)W_{n}\left(\frac{\sqrt{n}}{n}\sum_{i}Z_{i}\varepsilon_{i}\right)\right].
$$

- CLT and by $\mathbb{E}(Z_{i}\varepsilon_{i})=0$:
  $$
  \frac{\sqrt{n}}{n}\sum_{i}Z_{i}\varepsilon_{i}=\frac{\sqrt{n}}{n}\sum_{i}(Z_{i}\varepsilon_{i}-\mathbb{E}(Z_{i}\varepsilon_{i}))\to_{d}N(0,\mathbb{E}(\varepsilon_{i}^{2}Z_{i}Z_{i}')).
  $$

- WLLN, CMT and delta method:
  $$
  \begin{align*}
  &\sqrt{n}(\hat{\beta}^{GMM}-\beta)\to_{d}N(0,\Sigma),\\
  &\Sigma=N(0,\left[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')\right]^{-1}\cdot \left[\mathbb{E}(X_{i}Z_{i}')W\right]\cdot \mathbb{E}(\varepsilon_{i}^{2}Z_{i}Z_{i}')\cdot\left[W\mathbb{E}(Z_{i}X_{i}')\right]\cdot \left[\mathbb{E}(X_{i}Z_{i}')W\mathbb{E}(Z_{i}X_{i}')\right]^{-1}).
  \end{align*}
  $$

- Looks very complicated, but still a sandwich formula.

- You can verify that this is exactly equal to the asymptotic variance formula for OLS if $Z=X$.

**Homoscedasticity, optimal weighting matrix, and 2SLS**.

An interesting result is that consistency is invariant to the choice of $W$; you can see no matter what $W$ is, as long as it's p.d., we all have consistency. However, $W$ DOES affect the variance because it enters $\Sigma$. So it's possible to choose an $W$ such that the variance is smallest (so that it's easier to get a significant estimate).

We'll not discuss the general optimal weighting matrix. We only focus on a special case: Homoscedasticity.

**Definition**. In IV's setup, homoscedasticity means that $\mathbb{E}(\varepsilon^{2}_{i}|Z_{i})=\mathbb{E}(\varepsilon^{2}_{i})\equiv \sigma^{2}$.

Under homo, $\mathbb{E}(\varepsilon_{i}^{2}Z_{i}Z_{i}')$ in $\Sigma$ can be simplified to be $\mathbb{E}(\varepsilon_{i}^{2}Z_{i}Z_{i}')=\mathbb{E}[\mathbb{E}(\varepsilon_{i}^{2}Z_{i}Z_{i}'|Z_{i})]=\mathbb{E}[\mathbb{E}(\varepsilon^{2}|Z_{i})Z_{i}Z_{i}']=\sigma^{2}\mathbb{E}(Z_{i}Z_{i}')$.

Therefore, if we choose $W=[\mathbb{E}(Z_{i}Z_{i}')]^{-1}$, you can see that
$$
\Sigma^{homo}_{W=[\mathbb{E}(Z_{i}Z_{i}')]^{-1}}=\sigma^{2}\cdot\left[\mathbb{E}(X_{i}Z_{i}')[\mathbb{E}(Z_{i}Z_{i}')]^{-1}\mathbb{E}(Z_{i}X_{i}')\right]^{-1}.
$$
Much simpler, isn't it? In fact this is **the smallest possible** variance matrix under homoscedasticity, meaning that $\Sigma^{homo}_{W}-\Sigma^{homo}_{W=[\mathbb{E}(Z_{i}Z_{i}')]^{-1}}$ is always p.s.d.

A consistent estimator $W_{n}$ for this optimal weighting matrix under homo is $\sum_{i}Z_{i}Z_{i}'/n$, namely $ZZ'/n$. Substitute it into the GMM estimator, we get what is called the **two-stage-least-square (2SLS)** estimator:
$$
\hat{\beta}^{2SLS}=(X'Z(Z'Z)^{-1}Z'X)^{-1}(X'Z(Z'Z)^{-1}Z'Y).
$$
Therefore, 2SLS is just a **special case** of GMM with a specific weighting matrix.

* 2SLS is the optimal GMM under homo.
* However, homo is not true just all too often.
* Hence, in practice, when your 2SLS is not significant, try the optimal GMM instead. I'll give you R examples later.

**Just-identification and the IV estimator**.

When $l=k$, as we have shown in [Section 3.1](#3.1 Moment Conditions and Over-, Just- and Underidentification), $W$ does not matter in any sense; it will be cancelled out. So in this case, 2SLS is the same as GMM with any $W_{n}$, and this special form is called **the IV estimator**:
$$
\hat{\beta}^{IV}=(Z'X)^{-1}(Z'Y).
$$
Note that you can write $\hat{\beta}^{IV}$ as $[(Z'Z)^{-1}(Z'X)]^{-1}[(Z'Z)^{-1}(Z'Y)]$, which is, roughly speaking, the OLS estimator of $Y$ on $Z$ divided by the OLS of $X$ on $Z$, echoing the derivation in [Section 2.1](#2.1 Linear Model). 

**Inference**. 

By asymototic normality, you can do t-test and Wald-test for any component or function of $\beta$ using exactly the same way (and same code) as in OLS. The only difference is the $\Sigma$, and thus the se are calculated following the current formulas in your computer.

**Summary**. 

- In the linear IV model we study, 2SLS and IV are both GMM estimators. 2SLS is a GMM with a specific weighting matrix. IV estimator is the GMM under just-identification. 
- You never really run *two stages* for 2SLS. The name totally misses the point.
- These estimators are all consistent and asymptotically normal under i.i.d, exclusion, and relevance. 
- When you have over-identification and your estimates are not significant by 2SLS, try using optimal GMM.

## 3.3 Testing the Two Key Assumptions

Exclusion and relevance, sometimes, can be tested.

### 3.3.1 Exclusion: Over-identification Test

When you have over-identification, you can test whether all the instrument are exogenous, i.e. $\mathbb{E}(Z_{i}\varepsilon_{i})=0$. The idea is as follows:

- If $\mathbb{E}(Z_{i}\varepsilon_{i})=0$, then it must be the case that
  $$
  \mathbb{E}(Z_{i}(Y_{i}-X_{i}'\beta))=0.
  $$

- This is an $l\times 1$ vector. Using the idea of Wald test, we can transform it into the following weighted Euclidean distance squared ($W$ is the optimal weighting matrix):
  $$
  [\mathbb{E}(Z_{i}(Y_{i}-X_{i}'\beta))]'W[\mathbb{E}(Z_{i}(Y_{i}-X_{i}'\beta))]=0.
  $$

- So if we setup the null as $\mathbb{H}_{0}:\mathbb{E}(Z_{i}\varepsilon_{i})=0$, then under the null, the following statistic should be small:
  $$
  J_{n}\equiv n\cdot \left(\frac{1}{n}\sum Z_{i}\left(Y_{i}-X_{i}'\hat{\beta}^{GMM}\right)\right)'W_{n}\left(\frac{1}{n}\sum Z_{i}\left(Y_{i}-X_{i}'\hat{\beta}^{GMM}\right)\right).
  $$

- So we just obtain $\hat{\beta}^{GMM}$ with $W_{n}$ and calculate $J_{n}$ and see whether it is indeed small.

- To know how small is small, we need to know $J_{n}$'s asymptotic distribution. One can show (let me know if you're interested in the proof) that
  $$
  J_{n}\to_{d}\chi_{l-k}^{2}.
  $$

- So we can compare $J_{n}$ with the $(1-\alpha)$-th quantile of $\chi_{l-k}^{2}$.

The test only works under overidentification, i.e., $l>k$. Some intuition for the reason:

- You have $k$ unknowns to solve ($\beta$), so at least you need $k$ equations. Otherwise, your $\beta$ is even not unique.
- So when you have just-identification, you have to believe all the moment conditions are correct, i.e., $\mathbb{E}(Z_{i}\varepsilon_{i})=0$ since the moment conditions are derived from this equality.
- When you have over-identificaton however, you have the freedom to believe $k$ equations, solve for $\beta$, and try whether the $\beta$ also make the other $l-k$ equations hold as well. If there's contradiction, then it means some of the $l$ instruments are not valid, rejecting $\mathbb{E}(Z_{i}\varepsilon_{i})=0$.

Two drawbacks of the test:

- The test has notoriously low power. So when you cannot reject the null, it's still possible that some of the instruments are not valid.
- When you do reject the null, on the other hand, you only know there are some bad IVs, but you don't know which are bad and which are good.

### 3.3.2 Relevance: Weak Instrument and First Stage F-value

When $Z$ and $X$ are correlated but not much correlated, $Z$ are **weak instruments**. Weak IVs cause a lot of problems, e.g. bias, poor normal approximation of the distribution, standard errors too small, etc.

There are formal test to run, but when you only have one endogenous variable, you can regress the endogenous variable on $Z$, and look at the F-value of that regression. 

- F-value is asymptotically equivalent to the Wald test divided by $l$ under the null that all coefficients of $Z$ are 0.
- If the F-value is greater than 10, it suggests the IV are not weak.

Weak IV is still an active research area.

## 4. Applications

Instrumental variable regression is the workhorse in economics. It revolutionizes the entire discipline, making causal inference possible for social sciences in many important areas where it's impossible to run experiment.

- Two decades ago you get a PhD degree and a job in a nice econ department if you find a smart IV.
- IV is still one of the most popular method in applied econ, and on top of that, some other equally popular method, e.g., regression discontinuity design (RDD), are just special IVs in particular applications.

So, where can we find an IV?

### 4.1 IVs from RCTs

Treatment assignment in randomized controlled trials provide a perfect source of instruments. It is fully randomized (complete or conditional), but sometimes people do not fully comply. 

Example: Lottery to head start.

- Head start is an early childhood education (before primary school) program in the U.S.
- In the early 2000s, the U.S. government lauched a lottery program providing access to head-start to eligible families.
- If a family won the lottery, their kid can attend the program.
- It turns out there's no full compliance.
- Can use lottery as an instrument to study the effect of early childhood education on people's later development.
- Kline and Walters (2016), "Evaluating public programs with close substitute: the case of head start", *Quarterly Journal of Economics*.

### 4.2 Natural/Quasi-Experiment

Sometime nature or policy does experiment for you.

- Natural experiment: You cannot control natural phenomenon like earthquakes, typhoons, or even rainfalls. They may affect the treatment but are uncorrelated with all other factors that affect the outcome.
- Quasi-experiment: Policies are usually made without considering people’s idiosyncratic heterogeneity ($U_{i}$), so can be viewed as exogenous.

Example of natural experiment: Rainfall, poverty, and crime.

- Hypothesis: Low-income may incites violence, so usually we can see the regional income level is negatively correlated with the crime rate.
- Regressing crime rate on regional income is problematic. Maybe a high crime rate also causes low-income. 
  - Reverse causality.
- Some economists then look at agriculturally-dependent regions, and use rainfall level as the instrument.
  - Miguel, Staynath and Sergenti (2004), "Economic shocks and civil conflict: an instrumental variables approach", *Journal of Political Economy*.
- Exclusion: Rainfall may not affect crime rate through other channels.
  - Counterexample: Sarsons (2015), "Rainfall and conflict: a cautionary tale", *Journal of Development Economics*, finds that crime rate is still highly correlated with rainfall level in regions whose income is not sensitive to rainfall (e.g., downstream of irrigation dams).

Example of quasi experiment: Maternal care and birthweight.

- Hypothesis: More maternal care (healthcare service for mother-to-be during pregnancy) may causally contribute to high birthweight of newborns. (Birthweight is an important indicator for babies' future cognitive and noncognitive development.)
- Regressing birthweight on maternal care is problematic because women who seek better maternal care may have healthier lifestyles, better economic status, etc. 
  - Self-selection.
- The U.S. government reduced cost/provided free access to maternal care service to the poor. In the late 80s and early 90s, they abruptly lowered the threshold in the definition of being “poor”.
- Exclusion: The policy has nothing to do with any other factors that affect birthweight.
  - Questionable: Different states chose different implementation date, so some families moved to get the benefit. Then we have self-selection again.

### 4.3 More Exogenous Variables than the Treatment

When there’s no way to conduct RCT and no way to find a good natural/quasi-experiment (which is very common), this is almost the last option. The quality of such instruments is usually more questionable than the previous ones, but at least they might tell more the truth than regressing on the treatment directly.

Example: Return to Education by Nobel Laureates.

- Hypothesis: more education causally lead to higher income.
- Regressing income on education is problematic because ability is omitted.
- Instrument 1: Quarter of birth.
  - Angrist and Krueger (1995), "Does compulsory school attendance affect schooling and earnings?", *Quarterly Journal of Economics*.
  - Exclusion: Quarter of birth may be uncorrelated with any cognitive/noncognitive ability of children, and may be also uncorrelated with family background.
  - Relevance: If you have to be 6 years old or above to enter elementary school, those who are born in the last quarter are on average 1 year older than others. They may be more likely to attain less education to earn money, support the family, etc. Could be weak.
- Instrument 2: Distance to a college when children are in high-middle school.
  - David Card (1995), "Using geographic variation in college proximity to estimate the return to schooling", in *Aspects of Labour Market Behacio~lr:Essays in Honour of John Vanderkamp*.
  - Exclusion: Whether you live near a college may be uncorrelated with any cognitive/noncognitive ability of children. It may be correlated with family background, but these data are available and can be included as controls.
  - Relevance: Families who choose to live near a college may have a strong emphasis on education.

Some other common sources: 

- More aggregated level variables: Data are individual-level or household level but use neighborhood level variation as instrument.
  - Example: Household consumption of nutrients, instrumented by consumption of nutrients in reference groups.
  - Dubois, Griffith, and Nevo (2014), "Do prices and attributes explain international differences in food purchases?", *American Economic Review*.
- Hausman-type instrument.
  - Example: Endogenous price instrumented by prices of the same product in other markets (districts).
  - Nevo (2001), "Measuring market power in the ready-to-eat cereal industry", *Econometrica*.

### 4.4 Real Data Example

In [None]:
library(ivreg)
library(sandwich)
library(lmtest)
library(nlWaldTest)
data(SchoolingReturns, package = 'ivreg')
summary(SchoolingReturns[, 1:8])

m_ols <- lm(log(wage) ~ education + poly(experience, 2) + ethnicity + smsa + south,
            data = SchoolingReturns)
summary(m_ols)

m_iv <- ivreg(log(wage) ~ ethnicity + smsa + south | education + poly(experience, 2) |
                nearcollege + poly(age, 2), data = SchoolingReturns)
summary(m_iv)

cov=vcovHC(m_iv,type="HC3")
sqrt(diag(cov))
coeftest(m_iv, vcov.=vcovHC(m_iv))