# Lab 3 - Instrumental variable
- **Author:** Dimitris Papadimitriou ([dimitri@berkeley.edu](mailto:dimitri@berkeley.edu))
- **Date:** 12 February 2020
- **Course:** INFO 251: Applied machine learning

### Learning Objectives:
By the end of the lab, you will be able to:

* work with real dataset to analyze instrumental variable

### Topics:
1. Instrumental variable

### References: 
 * [Instrumental Variables Estimation](https://github.com/natematias/research_in_python/blob/master/instrumental_variables_estimation/Instrumental-Variables%20Estimation.ipynb) by NATHAN MATIAS
 * [What is education‚Äôs impact on civic and social engagement?](http://www.oecd.org/edu/innovation-education/37425694.pdf) by David Campbell
 * Ecnometric Analysis, W. Greene, chapter 5.4

### Refresh (from slides)

Instrumental Variables (‚Äútwo stage least squares‚Äù) is a two-step procedure for estimating X‚Äôs effect on Y

* You want to estimate: $ùëå_ùëñ = \alpha + \betaùëã_ùëñ+ ùë¢_ùëñ$
* Suppose that you have a valid instrument $Z_i$
* Stage 1: $X_i = b_0 + b_1 Z_ùëñ + ùúà_ùëñ$
    * Obtain predicted values $\hat{ùëã_ùëñ}$
* Stage 2: $Y_i = \alpha + \beta \hat{ùëã_ùëñ} + ùë¢_ùëñ$

Recall that for an instrumental variable Z to be valid it must satisfy two conditions: 
* Instrument relevance: $corr(Z_i,X_ùëñ)\ne 0$
* Instrument exogeneity: $corr(Z_i, u_ùëñ) = 0$


## Example: Does college attainment lead to more income?

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import seaborn as sns

  from pandas.core import datetools


In [2]:
college_distance = pd.read_csv('CollegeDistance.csv')
college_distance.head()

Unnamed: 0,id,gender,ethnicity,score,fcollege,mcollege,home,urban,unemp,wage,distance,tuition,education,income,region
0,1,male,other,39.150002,yes,no,yes,yes,6.2,8.09,0.2,0.88915,12,high,other
1,2,female,other,48.869999,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
2,3,male,other,48.740002,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
3,4,male,afam,40.400002,no,no,yes,yes,6.2,8.09,0.2,0.88915,12,low,other
4,5,female,other,40.48,no,no,no,yes,5.6,8.09,0.4,0.88915,13,low,other


In [3]:
college_distance.shape

(4739, 15)

In [4]:
college_distance[['wage', 'education', 'distance']].describe()

Unnamed: 0,wage,education,distance
count,4739.0,4739.0,4739.0
mean,9.500506,13.807765,1.80287
std,1.343067,1.789107,2.297128
min,6.59,12.0,0.0
25%,8.85,12.0,0.4
50%,9.68,13.0,1.0
75%,10.15,16.0,2.5
max,12.96,18.0,20.0


In [5]:
college_distance[['wage', 'education', 'distance']].corr()

Unnamed: 0,wage,education,distance
wage,1.0,0.023858,-0.00039
education,0.023858,1.0,-0.093183
distance,-0.00039,-0.093183,1.0


In [6]:
result = smf.ols(formula = "wage ~ education", data = college_distance).fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     2.698
Date:                Wed, 12 Sep 2018   Prob (F-statistic):              0.101
Time:                        11:19:36   Log-Likelihood:                -8120.3
No. Observations:                4739   AIC:                         1.624e+04
Df Residuals:                    4737   BIC:                         1.626e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      9.2532      0.152     60.949      0.0

### Exercise: implement instrumental variable estimation