# Conditions for IV Estimation (No Controls)
Let's consider the single endogenous regressor, single instrument IV case where $L=K=1$, i.e., that $X=(1,X_1)'$ and $Z=(1,Z_1)'$ are both $2\times 1$ vectors. Let our structural model be
$$
Y = X'\gamma + U
$$
$$
= \gamma_0 + \gamma_1X_1 + U
$$

and assume that $E[U]=0$ but $E[X_1U]\neq 0$, i.e., that $X_1$ is endogeneous. Let the standard regression conditions hold, i.e., that $E[XX']$ and $E[ZZ']$ are invertible (equivalently, no perfect collinearity in $Z$ or $X$). In order for the IV estimand $\beta^{IV}$ to identify $\gamma$ we require the following conditions to hold:

## 1. Exogeneity 

$E[Z_1U] = 0$

## 2. Relevance 

$Cov[X_1,Z_1]\neq 0$

## 3. Exclusion 

The structural equation $Y = X'\gamma + U$ does not include $Z_1$ as a determinant of $Y$. This assumption is implicit when we write out the structural equation for $Y$ and don't include $Z_1$, but it is worth noting in practice to make sure that your structural model is specified correctly.

## Discussion

Remember our plot of useful variation versus actual variation? Conditions (1) and (2) essentially guarantee that $Z_1$ provides useful variation for identifying the *ceterus parabus* effect of an increase in $X_1$ on $Y$. That is, it provides useful variation for the identification of $\gamma_1$.

If Condition (1), Exogeneity, doesn't hold then the same problem we had with $X_1$ shows up in $Z_1$. Namely, we can't untangle movements in $Z_1$ from movements in $U$.

If Condition (2), Relevance, doesn't hold then even though movements in $Z_1$ may be independent of movements in $U$, they provide no useful information for movements in $X_1$. Thus, we can't link the exogenous variation in $Z_1$ to variation in $X_1$, and we therefore learn nothing about the *ceterus parabus* effect of changes in $X$ on changes in $Y$.

## Testing the conditions

Of the three conditions above, only Condition (2), Relevance, can be tested in the data. We usually test it by looking at the coefficient on the slope parameter from the first stage. Call this parameter $\pi_1$. We can test the hypothesis that $H_0:\pi_1=0$. If we reject, then the first stage is strong and we have evidence that we satisfy the Relevance condition. If we cannot reject, or the p-value is not too small, this is evidence that $Z_1$ is a *weak instrument*, and we should be worried that the variance of our IV estimate will be very large.

## Example 1: Supply and Demand for Labor

Recall our labor supply and demand model for the farming industry. Labor supply and demand were written as
$$
g(H,V,U) = a_0 + a_1H + U + V
$$
$$
f(U,Z) = b_0 + b_1U + b_2Q,
$$
and equating the two we could solve for the equilibrium level of hours worked:
$$
H = \frac{b_0-a_0 + (b_1 - 1)U + b_2Q - V}{a_1}.
$$
Our structural equation of interest is the supply curve
$$
W = a_0 + a_1H + U + V,
$$

where we assume that $E[U]=E[V]=0$ and that soil quality $Q$ is unrelated to anything that shifts tastes for work, i.e, that $E[QU]=E[QV]=0$. Now let's see if we satisfy our assumptions for IV estimation. 

1. **Exogeneity**. This condition holds. Why? We assumed that $E[QU]=E[QV]=0$ so that $E[Q(U+V)]=0$. We have good reason to believe this to be true, as soil quality is likely unrelated to human capital $U$ and taste shifters $V$.

2. **Relevance**. Since equilibrium hours worked are written as $H = \frac{b_0-a_0 + (b_1 - 1)U + b_2Q - V}{a_1}$, they are a function of $Q$, so provided $b_2\neq 0$, then $Cov[H,Q]\neq 0$.

3. **Exclusion**. We wrote out the supply curve so as not to be a function of soil quality $Q$, so this condition is satisfied.

### Simulation

Let's go ahead and simulate some data from this model. We will let $b_0=1$, $b_1=1$, $b_2=1$, $a_0=0.5$ and $a_1=1$, and let $U\sim U[-0.75, 0.75]$, $V\sim U[-0.25, 0.25]$ and $Q\sim U[-0.25, 0.25]$. Because we know the precise way in which our data is generated in this simulation, we know that Conditions 1-3 hold, so that the IV estimator should converge to $a_1=1$.

For ease of notation, we may sometimes write $Y=W$, $X=(1,X_1)'=(1,H)'$ and $Z=(1,Z_1)'=(1,Q)$. This will help us match our example up with the code below.

In [1]:
set.seed(210)
N = 1000
b0 = 1
b1 = 1
b2 = 1
a0 = 0.5
a1 = 1

# Simulating heterogeneity
U = runif(N, -0.75, 0.75)
V = runif(N, -0.25, 0.25)
Q = runif(N, -0.25, 0.25)

# Computing hours worked and wages
H = (b0 - a0)/a1 - V/a1 + b2*Q/a1 + (b1 - 1)*U/a1
W = b0 + b1*U + b2*Q

data = cbind(W, H, Q)
data[1:10,]

W,H,Q
0.8985105,0.5919283,-0.11950464
1.3762683,0.4942684,-0.12642235
1.7758442,0.7686322,0.18622874
1.0338774,0.4738535,0.1147872
0.5237263,0.1011106,-0.17372953
0.4748033,0.7146198,0.10289106
0.4924021,0.1956776,-0.17375258
0.557017,0.8807128,0.24910121
1.0115156,0.3457241,0.06642625
0.5443131,0.3373904,-0.00715636


In [2]:
Y = data[, 1]
X = cbind(rep(1,N), data[,2])
Z = cbind(rep(1,N), data[,3])

L = dim(Z)[2] - 1 # number of instruments
K = dim(X)[2] - 1 # number of regressors

reg = function(Y, X){solve(t(X)%*%X)%*%t(X)%*%Y}
iv = function(Y, X, Z){reg(Y, Z%*%reg(X, Z))}

#### First Stage

In [3]:
# First stage
pi = reg(X, Z)
pi

0,1
1.0,0.5079352
-2.13371e-16,0.9341092


In [4]:
ZZ_inv = solve(t(Z)%*%Z)
Uf.hat = X[,2] - Z%*%pi[,2]
pi1 = pi[,2]
ZU = sweep(Z, MARGIN=1, Uf.hat, `*`)
ZUUZ = t(ZU)%*%ZU
Vf = N*ZZ_inv%*%ZUUZ%*%ZZ_inv # First stage estimate of variance
Vf

0,1
0.02114218,-0.01318858
-0.01318858,0.97229594


In [5]:
t.stat.f = pi[2,2]/sqrt(Vf[2,2]/N)
t.stat.f

In [6]:
p.val.f = 2*(1 - pnorm(abs(t.stat.f)))
p.val.f

You can also see this first stage result by doing the regression using the LM command.

In [7]:
fit.f = lm(H ~ Q, data=data.frame(data))
summary(fit.f)


Call:
lm(formula = H ~ Q, data = data.frame(data))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.261399 -0.121400 -0.000996  0.133237  0.251718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.507935   0.004598  110.48   <2e-16 ***
Q           0.934109   0.031536   29.62   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1449 on 998 degrees of freedom
Multiple R-squared:  0.4678,	Adjusted R-squared:  0.4673 
F-statistic: 877.4 on 1 and 998 DF,  p-value: < 2.2e-16


So that it appears that the first stage is quite strong. The slight differences in the t-stats come from the fact that R assumes homoskedasticity by default in the `lm` command and it also makes degrees of freedom corrections. But these changes have minor effects as our t-stats are rather close.

#### Second Stage

In [8]:
b.iv = reg(Y, Z%*%pi)
b.iv

0
0.5574834
0.8958827


In [9]:
b.iv = iv(Y, X, Z)
b.iv

0
0.5574834
0.8958827


So that we see we are quite close to the truth. If we had done OLS...

In [10]:
b.ols = reg(Y, X)
b.ols

0
0.7977931
0.4324498


then we would have been way off the truth. What about t-statistics and pvalues for second stage?

In [11]:
piZZpi_inv = solve(t(pi)%*%t(Z)%*%Z%*%pi)
U.hat = Y - X%*%reg(Y, Z%*%pi)
ZU = sweep(Z, MARGIN=1, U.hat, `*`)
piZUUZpi = t(pi)%*%t(ZU)%*%ZU%*%pi
V = N*piZZpi_inv%*%piZUUZpi%*%piZZpi_inv # Second stage estimate of variance
V

0,1
3.151625,-5.702071
-5.702071,11.011021


In [12]:
t.stat = b.iv[2]/sqrt(V[2,2]/N)
t.stat

In [13]:
p.val = 2*(1 - pnorm(abs(t.stat)))
p.val

We could also have done this using the IV command.

In [14]:
library(AER)
fit = ivreg(W ~ H|Q, data=data.frame(data))
summary(fit)

Loading required package: car
Loading required package: lmtest
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
Loading required package: survival



Call:
ivreg(formula = W ~ H | Q, data = data.frame(data))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9267 -0.3722  0.0170  0.3769  0.9079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.55748    0.05572  10.005   <2e-16 ***
H            0.89588    0.10395   8.618   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4463 on 998 degrees of freedom
Multiple R-Squared: -0.005532,	Adjusted R-squared: -0.006539 
Wald test: 74.27 on 1 and 998 DF,  p-value: < 2.2e-16 


so that again, the t-statistic associated with `H` is close but not exact to what we computed above, based on differences how `ivreg` treats either homoskedasticity or degrees of freedom. But we see it largely doesn't matter.

# IV Conditions (with Controls)
If we recall from class, for a general model $Y=\gamma_1 + \gamma_1X_1 + \gamma_2X_2 + \epsilon$, where $\gamma_2$ may be 0, if we have an instrument $Z_1$, the IV estimand will identify $\gamma_1$ if the following conditions hold. Let $\tilde{Z}_1=Z_1 - BLP(X_1|X_2)$.

## Exogeneity
$Cov[\tilde{Z}_1, \epsilon]=0$

## Relevance
$Cov[\tilde{Z}_1,X_1]\neq 0$

## Exclusion
$Z_1$ is not in the structural model for $Y$.

### Discussion
So regardless of whether $X_2$ is in the structural model $\gamma_2\neq 0$ or not $\gamma_2=0$, these conditions are the same. So a lot of times even if we think $\gamma_2=0$, we may still include $X_2$ in our analysis if we think it can help isolate the exogenous variation in $Z_1$ to determine the effect of $X_1$ on $Y$.

### Example: Labor Supply and Demand with Controls
Now consider the labor supply and demand in the farming industry augmented with two additional variables $C$, for number of children, and $P$, for allergen or pollen content in the air at the farm:
$$
g(H,C,P,V,U) = a_0 + a_1H + a_2C + a_3P + V + U
$$
$$
f(U,Q) = b_0 + b_1Q + U.
$$
Let us consider an IV scenario where $X=(1,H,C,P)'$ and $Z=(1,Q,C,P)'$. That is, we are proposing using $Q$ as an instrument for $H$ and using $C$ and $P$ as controls. Suppose further that the pollen content is directly related to soil quality, so that $Q = c_0 + c_1P + \epsilon$. Assume some shifts in soil quality are directly connected with pollen content and therefore may be endogenous, but some shifts (due to $\epsilon$) are completely exogenous.

1. **Exogeneity**. We need that $Cov[\tilde{Q},V+U]=0$ where $\tilde{Q}=Q - BLP(Q|C,P)$. This should hold since after controlling for $P$, shifts in soil quality are assumed to be exogenous. Notice that if we don't control for $P$, then $Q$ would not be exogenous (we'll see the implications of not using $P$ as a control below). Why? Because omitting $P$ from the analysis, $P$ goes into the error, it determines $H$ and it is correlated with $Q$.
2. **Relevance**. We need that $Cov[\tilde{Q},H]\neq 0$. This is clearly true by our assumptions, since even after controlling for $P$ and $C$ ($C$ is unrelated to $Q$), $Q$ is still correlated with $H$ through $\epsilon$.
3. **Exclusion**. By assumption of our model, $Q$ does not directly affect supply.

### Simulation

Let's go ahead again and simulate the data from this model. Solving for equilibrium hours worked we have
$$
H = \frac{b_0 + b_1Q - a_0 - a_2C - a_3P - V}{a_1}.
$$
For our simulation, let $P$ be distributed $U[0,1]$ and $U$ and $V$ distributed $U[-0.5,0.5]$. Construct $C$ as $0.4V + 0.4S$ where $S$ is some variable that captures the cost of having children with distribution $U[0,1]$. Furthermore, let $b_0=b_1=1$ and $a_0=0$, $a_1=1$, and $a_2=a_3=0.25$. Assume $c_0=0.5$, $c_1=1$ and $\epsilon\sim[-0.5, 0.5]$.

In [22]:
set.seed(1234)
N = 10000
b0 = 1
b1 = 1
a0 = 0
a1 = 1
a2 = 0.25
a3 = 0.25
c0 = 0.5
c1 = 1

# Simulating heterogeneity
U = runif(N, -0.5, 0.5)
V = runif(N, -0.5, 0.5)
P = runif(N, 0, 1)
S = runif(N, 0, 1)
e = runif(N, -0.5, 0.5)

C = 0.4*V + 0.5*S
Q = c0 + c1*P + e

# Computing hours worked and wages
H = (b0 + b1*Q - a0 - a2*C - a3*P - V)/a1
W = b0 + b1*Q + U

data = cbind(W, H, C, P, Q)
data[1:10,]

W,H,C,P,Q
0.6940386,1.278966,0.3522792,0.0346164,0.08033519
2.0368684,1.594477,0.4668449,0.8765384,0.91456902
2.0137378,1.532908,0.2084239,0.7347251,0.90446308
1.8996361,1.093959,0.2705243,0.6218362,0.77625664
2.3171924,2.050227,0.2113488,0.697803,0.95627699
2.5822114,2.443002,0.2635888,0.5783307,1.44190076
1.6684968,2.043967,0.3563914,0.5560342,1.15900108
1.4235288,1.894837,0.3287923,0.5496697,0.69097834
2.2048536,1.950013,0.2644586,0.9274793,1.0387698
1.4659642,1.331461,0.3221735,0.274385,0.45171305


In [23]:
Y = data[, 1]
X = cbind(rep(1,N), data[, c(2, 3, 4)])
Z = cbind(rep(1,N), data[, c(5, 3, 4)])

#### First Stage

In [25]:
fit.f = lm(H ~ Q + C + P, data=data.frame(data))
summary(fit.f)


Call:
lm(formula = H ~ Q + C + P, data = data.frame(data))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53947 -0.16125 -0.00216  0.16038  0.54702 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.234520   0.006575  187.75   <2e-16 ***
Q            0.996574   0.007745  128.67   <2e-16 ***
C           -1.211266   0.011982 -101.09   <2e-16 ***
P           -0.244328   0.011062  -22.09   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2236 on 9996 degrees of freedom
Multiple R-squared:  0.7847,	Adjusted R-squared:  0.7846 
F-statistic: 1.214e+04 on 3 and 9996 DF,  p-value: < 2.2e-16


The p-value on $Q$ is really small, so we believe we have a strong instrument that satisfies relevance.

#### Second Stage

In [24]:
b.iv = iv(Y, X, Z)
b.iv

0,1
,-0.257661
H,1.012427
C,1.223791
P,0.246857


Our sample is very large so these estimates are really close to $\beta^{IV}$. We see that the coefficient on $H$ is very close to 1, so IV does a nice job at estimating $a_1=1$. We also see that the coefficient on $P$ is $0.25$, so we also do a great job of estimating $a_3=0.25$. But what happened to $a_2$? Because $C$ has an endogenous component (it is correlated with tastes for work $V$) we cannot identify $a_2$ with $\beta^{IV}_2$.

Thus, we see that the model is partially identified and that we were able to identify our object of interest, $a_1$, without identifying all of the vector of supply parameters $a=(a_0,a_1,a_2,a_3)'$.

We can confirm our result using the `ivreg` command.

In [26]:
fit = ivreg(W ~ H + C + P|Q + C + P, data=data.frame(data))
summary(fit)


Call:
ivreg(formula = W ~ H + C + P | Q + C + P, data = data.frame(data))

Residuals:
      Min        1Q    Median        3Q       Max 
-1.007260 -0.272592  0.002755  0.269767  0.968761 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.25766    0.02374  -10.85   <2e-16 ***
H            1.01243    0.01280   79.10   <2e-16 ***
C            1.22379    0.02503   48.90   <2e-16 ***
P            0.24686    0.01614   15.30   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3683 on 9996 degrees of freedom
Multiple R-Squared: 0.469,	Adjusted R-squared: 0.4689 
Wald test:  4250 on 3 and 9996 DF,  p-value: < 2.2e-16 


What if we forgot to control for $P$?

In [27]:
fit = ivreg(W ~ H + C|Q + C, data=data.frame(data))
summary(fit)


Call:
ivreg(formula = W ~ H + C | Q + C, data = data.frame(data))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.15453 -0.28419  0.00215  0.28250  1.11728 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.43116    0.02389  -18.05   <2e-16 ***
H            1.15352    0.01090  105.83   <2e-16 ***
C            1.39483    0.02476   56.34   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3929 on 9997 degrees of freedom
Multiple R-Squared: 0.3955,	Adjusted R-squared: 0.3954 
Wald test:  5600 on 2 and 9997 DF,  p-value: < 2.2e-16 


We see that we now have bias in our estimate of $a_1$ (remember we use a very large sample). This is because $Q$ is only exogenous after controlling for $P$. Thus, we needed to include the control in order to identify $a_1$.

## Example: Returns to Schooling
Let's do the example in class where we use a measure of distance to college as an instrument for schooling. Our measure is the number of colleges per square foot in each state in year 2017.

In [None]:
url = paste("https://raw.githubusercontent.com/jtorcasso/teaching/",
            "master/econ210_fall2017/data/project/psid_2011.csv", sep="")
df = read.csv(url)
df$wage =  ifelse(df$inc_labor==0 | df$hours==0, NaN, df$inc_labor/df$hours)
df$logwage = log(df$wage)

Now merge the college data to the PSID.

In [31]:
url2 = paste("https://raw.githubusercontent.com/jtorcasso/teaching/",
            "master/econ210_fall2017/data/project/col_info.csv", sep="")
col.info = read.csv(url2)
df = merge(df, col.info[,c("psid_id", "edu_dens", "count")], by.x="birthstate", by.y="psid_id")

In [33]:
cols = c("edu", "edu_dens", "logwage", "birthyear", "black", "m_edu", "age", "workexp")
df.m = na.omit(df[df$male==1 & df$birthyear >= 1970 & df$birthyear <= 1985, cols])
N = dim(df.m)[1]
N

First, let's try it without any controls.

In [34]:
fit = ivreg(logwage ~ edu|edu_dens, data=df.m)
summary(fit)$coef[2,]

Woahh, this is a huge estimate. Let's check first stage.

In [35]:
fit.f = lm(edu ~ edu_dens, data=df.m)
summary(fit.f)


Call:
lm(formula = edu ~ edu_dens, data = df.m)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3573 -2.2926 -0.2926  1.7056  2.7175 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.2762     0.1166  122.44   <2e-16 ***
edu_dens      8.6054    11.6343    0.74     0.46    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.055 on 580 degrees of freedom
Multiple R-squared:  0.0009424,	Adjusted R-squared:  -0.0007801 
F-statistic: 0.5471 on 1 and 580 DF,  p-value: 0.4598


Uh oh, looks like a weak instrument. But let's go ahead and try everything but adding controls.

In [36]:
fit = ivreg(logwage ~ edu + factor(birthyear) + m_edu + black + age|
            edu_dens + factor(birthyear) + m_edu + black + age, data=df.m)
summary(fit)$coef[2,]

In [37]:
fit.f = ivreg(edu ~ edu_dens + factor(birthyear) + m_edu + black + age, data=df.m)
summary(fit.f)


Call:
ivreg(formula = edu ~ edu_dens + factor(birthyear) + m_edu + 
    black + age, data = df.m)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9283 -1.4903  0.2455  1.3548  6.6724 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            7.13419    6.25758   1.140  0.25474    
edu_dens               0.78064   10.62761   0.073  0.94147    
factor(birthyear)1971 -1.60601    0.53344  -3.011  0.00272 ** 
factor(birthyear)1972 -1.10192    0.58237  -1.892  0.05899 .  
factor(birthyear)1973 -0.21952    0.68051  -0.323  0.74713    
factor(birthyear)1974 -0.50528    0.81518  -0.620  0.53562    
factor(birthyear)1975 -0.49579    0.92400  -0.537  0.59178    
factor(birthyear)1976 -0.58728    1.07649  -0.546  0.58559    
factor(birthyear)1977 -0.59873    1.18378  -0.506  0.61321    
factor(birthyear)1978 -0.39291    1.33023  -0.295  0.76782    
factor(birthyear)1979 -0.19785    1.45787  -0.136  0.89210    
factor(birthyear)1980 -0.05931    1.

Oh dear, the instrument is (unsurprisingly really weak after using controls). So what's the takeaway? Our instrument is likely weak for two reasons: (1) our measure of "distance to college" is the same for everybody in the same state, so there is a lot of measurement error in this variable and this should weaken the first stage, (2) we use the number of colleges in states in year 2017, which may not accurately reflect the number of colleges in each individuals birth state, so as a proxy for distance to college, this variable is pretty bad.