# Instrumental Variables Regression

## Introduction

This analysis focused on the effect of education on income in the USA.

The data set was taken from Card (1995) and all subjects were male.

In [1]:
# loads packages

library(data.table)
library(AER)

Loading required package: car

Loading required package: carData

Loading required package: lmtest

Loading required package: zoo


Attaching package: ‘zoo’


The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric


Loading required package: sandwich

Loading required package: survival



In [2]:
# loads data set

dat_wage<-fread("Data/Data_Wage.csv")

## Models

### OLS

First, I looked into the following model:

$$
\log(w) = \beta_1 + \beta_2 \mathit{educ} + \beta_3 \mathit{exper} + \beta_4 \mathit{exper}^2 + \beta_5 \mathit{smsa} + \beta_6 \mathit{south} + \varepsilon
$$

where $\log(w)$ is log wage, $\mathit{educ}$ is education in years, $\mathit{exper}$ is working experience in years, $\mathit{smsa}$ whether the subject lived in a metropolitan area (1 vs 0), and $\mathit{south}$ is whether the subject lived in the south of the USA (1 vs 0).

Results below show that all predictors are significant. $\beta_2$ is 0.082 and shows that, with each year of education, wage increases by roughly 8.5%.

In [3]:
lm_wage<-lm(logw~educ+exper+I(exper^2)+smsa+south, data=dat_wage)

In [4]:
summary(lm_wage)


Call:
lm(formula = logw ~ educ + exper + I(exper^2) + smsa + south, 
    data = dat_wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.71487 -0.22987  0.02268  0.24898  1.38552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.6110144  0.0678950  67.914  < 2e-16 ***
educ         0.0815797  0.0034990  23.315  < 2e-16 ***
exper        0.0838357  0.0067735  12.377  < 2e-16 ***
I(exper^2)  -0.0022021  0.0003238  -6.800 1.26e-11 ***
smsa         0.1508006  0.0158360   9.523  < 2e-16 ***
south       -0.1751761  0.0146486 -11.959  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3813 on 3004 degrees of freedom
Multiple R-squared:  0.2632,	Adjusted R-squared:  0.2619 
F-statistic: 214.6 on 5 and 3004 DF,  p-value: < 2.2e-16


### Endogeneity

However, it is possible that $\mathit{educ}$ and $\mathit{exper}$ are endogenous, with the likely cause being omitted variables. For instance, the number of years an individual stays in schools might be a result of social class and the wealth of the family, which also affect how early they enter the job market. Further, they are both correlated with cognitive and possibly physical abilities.

That means it is possible that the OLS estiamte is biased and inconsistent, therefore an alternative model is more appropriate.

### Instrumental Variables

Here I purpose to use age and a few other variables as instrumental variables for working experience, with age being the focus here.

Age is clearly correlated with working experience for adults. Correlations are shown below.

Age is exogenous since it cannot be changed by the subjects. Further, arguably age should not be correlated with the error term. In other words, age should only be correlated with wage through working experience. This can be argued since an increase in age should not directly lead to a higher income, but an increase in working experience will.

In [5]:
dat_wage[, ":="(age2=age^2, exper2=exper^2)]

cor(dat_wage[, .(age, age2, exper, exper2)])

Unnamed: 0,age,age2,exper,exper2
age,1.0,0.9988219,0.7630736,0.7380634
age2,0.9988219,1.0,0.7640276,0.7437181
exper,0.7630736,0.7640276,1.0,0.9672025
exper2,0.7380634,0.7437181,0.9672025,1.0


I ran the following model as a first-stage regression:

$$
\mathit{educ} = \beta_1 + \beta_2 \mathit{age} + \beta_3 \mathit{age}^2 + \beta_4 \mathit{nearc} + \beta_5 \mathit{daded} + \beta_6 \mathit{momed} + \varepsilon
$$

where $\mathit{nearc}$ is whether the subject lived near a university (1 vs 0), $\mathit{dadeduc}$ is the education of the subject's father in years, and $\mathit{momeduc}$ is the education of the subject's mother in years.

Results show that all variables are significant and are therefore appropriate candidates as instrumental variables.

In [6]:
lm_first_stage<-lm(educ~age+I(age^2)+nearc+daded+momed, data=dat_wage)

In [7]:
summary(lm_first_stage)


Call:
lm(formula = educ ~ age + I(age^2) + nearc + daded + momed, data = dat_wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4573  -1.4968  -0.2734   1.6843   7.5636 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.923273   4.010502  -1.477 0.139796    
age          0.992550   0.281060   3.531 0.000419 ***
I(age^2)    -0.017075   0.004878  -3.500 0.000472 ***
nearc        0.528751   0.092698   5.704 1.28e-08 ***
daded        0.202048   0.015665  12.898  < 2e-16 ***
momed        0.248379   0.017036  14.580  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.346 on 3004 degrees of freedom
Multiple R-squared:  0.233,	Adjusted R-squared:  0.2317 
F-statistic: 182.5 on 5 and 3004 DF,  p-value: < 2.2e-16


### IV Regression

Diagnostics show that the instruments passed the weak instrument tests, which is consistent with the first stage regression above. Sargan is non-significant and suggests instruments are valid and uncorrelated with the error term, while the Wu-Hausman test shows that $\mathit{educa}$ and $\mathit{exper}$ are indeed endogenous in the OLS regression.

In short, the specification of the IV regression is appropriate.

Results of the IV regression shows that education and working experience could both predict wage. The model also shows that $\mathit{exper}^2$ is only *marginally* significant, that is, the effect of working experience on wage is arguably linear.

In [8]:
iv_wage<-ivreg(logw~educ+exper+I(exper^2)+smsa+south | age+I(age^2)+nearc+daded+momed+smsa+south, data=dat_wage)

In [9]:
summary(iv_wage, diagnostics=TRUE, df=0)


Call:
ivreg(formula = logw ~ educ + exper + I(exper^2) + smsa + south | 
    age + I(age^2) + nearc + daded + momed + smsa + south, data = dat_wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7494 -0.2360  0.0266  0.2498  1.3468 

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  4.4169039  0.1154208  38.268  < 2e-16 ***
educ         0.0998429  0.0065738  15.188  < 2e-16 ***
exper        0.0728669  0.0167134   4.360 1.30e-05 ***
I(exper^2)  -0.0016393  0.0008381  -1.956   0.0505 .  
smsa         0.1349370  0.0167695   8.047 8.52e-16 ***
south       -0.1589869  0.0156854 -10.136  < 2e-16 ***

Diagnostic tests:
                               df1  df2 statistic p-value    
Weak instruments (educ)          5 3002   145.511 < 2e-16 ***
Weak instruments (exper)         5 3002  1257.258 < 2e-16 ***
Weak instruments (I(exper^2))    5 3002  1098.430 < 2e-16 ***
Wu-Hausman                       2 3002     5.709 0.00335 ** 
Sargan                     