# Instrumental Variables

by Jonas Peters, Niklas Pfister, Martin Emil Jakobsen and Rune Christiansen, 21.06.2019

This notebook aims to give you a basic understanding of the instrumental variable approach and when it can be used to infer causal relations.

In the following, let all variables have 
* zero mean,  
* finite second moments, and
* their joint distribution is absolutely continuous with respect to the Lebesgue measure.

In [5]:
#.libPaths("C:/ProgramData/Anaconda3/Lib/R/library");
#install.packages("AER",repos = "http://cran.us.r-project.org");
library(AER);

## Instrumental Variable Model

The goal of this method is to estimate the causal effect of a predictor variable $X$ on a target variable $Y$ if the effect from $X$ to $Y$ is confounded. The idea of the instrumental variable approach is to account for this confounding by considering an additional variable $I$ called an instrument. Although there exist numerous extensions, here, we focus on the classical case. We provide two definitions.


First, assume the following SCM
\begin{align}
I &:= N_I\\
H &:= N_H\\ 
X &:= I \gamma  + H \delta_X + N_X\\
Y &:= X \beta + H \delta_Y + N_Y.\\
\end{align}
(All variables except $Y$ could be multi-dimensional, in which case, they should be written as row vectors: $1 \times d$.) If all variables are $1$-dimensional, the corresponding DAG looks as follows.
\begin{align}
    &\phantom{0}\\
    &\begin{array}{ccc}
       & & &H                 & \\
       & &\phantom{abcdefgh}\overset{\delta_X}{\swarrow} &            & \overset{\delta_Y}{\searrow}\phantom{abcdefgh}\\
       & &                    &               & \\
       I &\overset{\gamma}{\longrightarrow} &X                  & \overset{\beta}{\longrightarrow} & Y\\
        \end{array}\\
      &\phantom{0}
\end{align}
Here, $I$ is called an instrumental variable for the causal effect from $X$ to $Y$. It is essential that $I$ effects $Y$ only via $X$ (and not directly).



Second, it is possible to define instrumental variables without SCMs, too. Let us therefore write
\begin{equation}
Y =  X \beta + \epsilon_Y
\end{equation}
(this can always be done). Here, $\epsilon_Y$ is allowed to depend on $X$ (if there is a confounder $H$ between $X$ and $Y$, this is usually the case). We then call a variable $I$ an instrumental variable if it satisfies the following three conditions:
1. $\operatorname{cov}(X,I)$ is of full rank (relevance).
2. $\operatorname{cov}(\epsilon_Y,I)=0$ (exogenity).
3. $\operatorname{cov}(I)$ is of full rank. 

Informally speaking, these conditions again mean that $I$ affects $Y$ ''only through its effect on $X$''.

## Estimation

We now want to illustrate how the existence of an instrumental variable $I$ can be used to estimate the causal effect $\beta$ in the model above. Let us therefore assume that we have received data in matrix form
* $\mathbf{Y}$ - the target variable $n \times 1$ 
* $\mathbf{X}$ - the covariates $n \times d$
* $\mathbf{I}$ - the instruments $n \times m$

where $n > \max(m, d)$.

We now assume that $I$ is a valid instrument (we come back to this question in Exercise 2 below). To estimate the causal effect of $X$ on $Y$, there are several options of writing down the same estimator. 

OPTION 1: The following estimator is sometimes called the generalized methods of moments (GMM)
$$
\hat{\beta}^{GMM}_n := (\mathbf{X}^t \mathbf{I} (\mathbf{I}^t \mathbf{I})^{-1} \mathbf{I}^t \mathbf{X})^{-1} \, \mathbf{X}^t \mathbf{I} (\mathbf{I}^t \mathbf{I})^{-1} \mathbf{I}^t \mathbf{Y}
$$

OPTION 2: 
we can use a so-called 2-stage least squares (2SLS) procedure. Step 1: Regress $X$ on $I$ and compute the corresponding fitted values $\hat{X}$. Step 2: Regress $Y$ on $\hat{X}$. Use the regression coefficients from step 2. 

The following four exercises go over some of the details of the 2SLS and apply it to a real data set.

### Exercise 1
Assume that the data are i.i.d. from the following two structural assignments 
\begin{align*}
Y &:= X \cdot \beta + \epsilon_Y \\
X &:= I \cdot \gamma + \epsilon_X,
\end{align*}
where $X$ and $I$ are written as $1 \times d$ and $1 \times m$ vectors, respectively. Here, $\epsilon_X$ and $\epsilon_Y$ are not necessarily independent, but the instrument $I$ is assumed to satisfy the assumptions 1., 2., and 3. above. 

a) Write down conditions on $d$ and $m$ that guarantee that $\hat{\beta}^{GMM}_n$ is well-defined (with probability one). You may assume that the sample versions $\mathbf{I}^t \mathbf{I}$ and $\mathbf{I}^t \mathbf{X}$ of instrumental varible condition 1) and 3) are of full rank. 

** <i>Hint: Prove that for a specific ordering of $d$ and $m$ (e.g. $d\leq m$, $d\geq m$ etc.) the matrix $\mathbf{X}^t \mathbf{I} (\mathbf{I}^t \mathbf{I})^{-1} \mathbf{I}^t \mathbf{X}$ inverted in the GMM estimator is positive definite, hence invertible.  </i> **

b) Prove that under these conditions that the GMM method is consistent towards the causal parameter, i.e., $\hat{\beta}^{GMM}_n \rightarrow \beta$ in probability. 

** <i>Hint: Use the model specification from above to write $\hat{\beta}^{GMM}_n$ as $\beta$ plus a remainder term that is a linear function of $\epsilon_Y$. Then use the continuous mapping theorem, which states that you may apply a continuous mapping on both sides of a convergence in probability expression, if the mapping in continuous in the limiting expression (Note that the inverse operator on matrices $M \stackrel{\text{Inv.}}{\mapsto} M^{-1}$ is continuous). Finally, use the fact that the product of two random sequences, each of which converge in probability to a constant, converges in probability to the product of these constants.  </i> **

c) Assume $d \leq  m$. Prove that the methods 2SLS and GMM provide the same estimate. 

### Solution 1

### End of Solution 1

For illustration, we use the <tt>CollegeDistance</tt> data set from [1] available in the R package <tt>AER</tt>.

In [6]:
# load CollegeDistance data set
data("CollegeDistance")
head(CollegeDistance)
# read out relevant variables
Y <- CollegeDistance$score
X <- CollegeDistance$education
I <- CollegeDistance$distance

gender,ethnicity,score,fcollege,mcollege,home,urban,unemp,wage,distance,tuition,education,income,region
male,other,39.15,yes,no,yes,yes,6.2,8.09,0.2,0.88915,12,high,other
female,other,48.87,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
male,other,48.74,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
male,afam,40.4,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
female,other,40.48,no,no,no,yes,5.6,8.09,0.4,0.88915,13,low,other
male,other,54.71,no,no,yes,yes,5.6,8.09,0.4,0.88915,12,low,other


This data set consists of $4739$ observations on $14$ variables from high school student survey conducted by the Department of Education in $1980$, with a follow-up in $1986$. In this notebook, we only consider the following variables:
* $Y$ - base year composite test score.  These are achievement tests given to high school seniors in the sample.
* $X$ - number of years of education.
* $I$ - distance from closest 4-year college (units are in 10 miles).


### Exercise 2

Argue whether the variable $I$ can be used as an instrumental variable to infer the causal effect of $X$ on $Y$. Are there arguments, why it might not be a valid instrument? <i> Hint: You can perform a linear regresssion and use the corresponding t-test p-value for significant regression coefficient. This is indeed identical to the t-test for the peasons correlation test-statistic. </i>

### Solution 2

### End of Solution 2

### Exercise 3
Use 2SLS to estimate the causal effect of $X$ on $Y$ based on the instrument $I$. Compare your results with a standard OLS regression of $Y$ on $X$ (that includes an intercept). What happens to the correlation between $X$ and the residuals in both methods? Which attempt yields smaller variance of residuals?

### Solution 3

### End of Solution 3

A slightly different approach to 2SLS is to use the formula

OPTION 3:
\begin{equation} \tag{1}
\hat{\beta}_n = (\mathbf{I}^t \mathbf{X})^{-1} \mathbf{I}^t \mathbf{Y}.
\end{equation}

If $d=m$, this formula yields the same estimator as OPTIONS 1 and 2.

    
<div class="alert alert-block alert-info">
<b>Statement:</b> <i>When $m=d$ we have that the GMM and 2SLS estimators have the reduced form</i>
 $$
 \hat{\beta}^{GMM}_n= \hat{\beta}^{2SLS}_n = (\mathbf{I}^t \mathbf{X})^{-1}\mathbf{I}^t \mathbf{Y}  
 $$
 <i>Proof:</i> Let $ P_I =\mathbf{I} (\mathbf{I}^t\mathbf{I})^{-1}\mathbf{I}^t$ be the orthogonal projection onto the column space of $\mathbf{I}$.  We then have 
 $$
 P_I \mathbf{X}= \mathbf{I} (\mathbf{I}^t\mathbf{I})^{-1}\mathbf{I}^t\mathbf{X} = \mathbf{I}\mathbf{Z},
 $$
 where $\mathbf{Z}:=  (\mathbf{I}^t\mathbf{I})^{-1}\mathbf{I}^t\mathbf{X}$ is invertible, as it is the product of two invertible matrices. Using that $(\mathbf{Z}^t \mathbf{I}^t \mathbf{X})^{-1}= (\mathbf{I}^t \mathbf{X})^{-1}(\mathbf{Z}^t)^{-1}$, we then get 
 \begin{align*}
 \hat{\beta}^{GMM}_n &= (\mathbf{X}^t \mathbf{I} (\mathbf{I}^t\mathbf{I})^{-1}\mathbf{I}^t\mathbf{X})^{-1} \mathbf{X}^t \mathbf{I}(\mathbf{I}^t\mathbf{I})^{-1} \mathbf{I}^t  \mathbf{Y}  \\
 &= ((P_I\mathbf{X})^t \mathbf{X})^{-1} (P_I\mathbf{X})^t \mathbf{Y} \\
 &= (\mathbf{Z}^t \mathbf{I}^t \mathbf{X})^{-1} \mathbf{Z}^t \mathbf{I}^t \mathbf{Y} \\
 &= (\mathbf{I}^t \mathbf{X})^{-1} (\mathbf{Z}^t)^{-1}  \mathbf{Z}^t \mathbf{I}^t \mathbf{Y} \\
  &= (\mathbf{I}^t \mathbf{X})^{-1}  \mathbf{I}^t \mathbf{Y}
 \end{align*}
</div>

### Exercise 4
Apply the above estimator (1) to <tt>CollegeDistance</tt> data and compare your result with the one from Exercise 3. (If you have included intercepts in the 2SLS, you need to replace the product moments by sample covariances.)

### Solution 4

In [7]:
beta = cov(Y,I)/cov(X,I)
print(beta)

[1] 3.548279


They do indeed coincide.

### End of Solution 4

## References

[1] C. Kleiber, A. Zeileis (2008). Applied Econometrics with R. Springer-Verlag New York.

[2] M. Eaton, M. D. Perlman (1973). The nonsingularity of generalized sample covariance
matrices. Annals of Statistics - ANN STATIST, 1, 07 1973. doi: 10.1214/aos/1176342465.