
# Intrumental Variables Applications Example

### Summary of Contents:
1. [Introduction](#intro)
2. [NLSYM Dataset](#data)
3. [A Gentle Start: The Naive Approach](#naive)
4. [Using Instrumental Variables: 2SLS](#2sls)

# 1. Introduction <a class="anchor" id="intro"></a>

To measure true causal effects of a treatment $T$ on an outcome $Y$ from observational data, we need to record all features $X$ that might influence both $T$ and $Y$. These $X$'s are called confounders. 

When some confounders are not recorded in the data, we might get biased estimates of the treatment effect. Here is an example:
* Children of high-income parents might attain higher levels of education (e.g. college) since they can afford it
* Children of high-income parents might also obtain better paying jobs due to parents' connections and knowledge
* At first sight, it might appear as if education has an effect on income, when in fact this could be fully explained by family background

There are several reasons for not recording all possible confounders, such as incomplete data or a confounder that is difficult to quantify (e.g. parental involvement). However, not all is lost! In cases such as these, we can use instrumental variables $Z$, features that affect the outcome only through their effect on the treatment. 

In this notebook, we use a real-world problem to show how treatment effects can be extracted with the help of instrumental variables. 

# 2. NLSYM Dataset <a class="anchor" id="data"></a>

<img src="https://straubroland.files.wordpress.com/2010/12/education_technology-resized-600.png" width=400px/>

Describe the dataset briefly:

* who collected it
* method of collection 
* feature description 
* what we are interested in (average treatment effect as well as features of heterogeneity)

The world can then be modelled as:
$$
\begin{align}
Y & = \theta(X) \cdot T + f(W) + \epsilon\\
T & = g(Z) + \eta
\end{align}
$$
where $Y$ - outcome of interest, $X$ - features of heterogeneity, 

In [1]:
# Some imports
from econml.two_stage_least_squares import NonparametricTwoStageLeastSquares
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [53]:
# Data processing
df = pd.read_csv("data/card.csv", dtype=float)
data_filter = df['educ'].values >= 6
T = df['educ'].values[data_filter]
Z = df['nearc4'].values[data_filter]
Y = df['lwage'].values[data_filter]

# Impute missing values with mean, add dummy columns
# Filter outliers (interviewees with less than 6 years of education)
X_df = df[['exper', 'expersq']].copy()
X_df['fatheduc'] = df['fatheduc'].fillna(value=df['fatheduc'].mean())
X_df['fatheduc_nan'] = df['fatheduc'].isnull() * 1
X_df['motheduc'] = df['motheduc'].fillna(value=df['motheduc'].mean())
X_df['motheduc_nan'] = df['motheduc'].isnull() * 1
X_df[['momdad14', 'sinmom14', 'reg661', 'reg662',
        'reg663', 'reg664', 'reg665', 'reg666', 'reg667', 'reg668', 'reg669', 'south66']] = df[['momdad14', 'sinmom14', 
        'reg661', 'reg662','reg663', 'reg664', 'reg665', 'reg666', 'reg667', 'reg668', 'reg669', 'south66']]
X_df[['black', 'smsa', 'south', 'smsa66']] = df[['black', 'smsa', 'south', 'smsa66']]
columns_to_scale = ['fatheduc', 'motheduc', 'exper', 'expersq']
# Scale continuous variables
scaler = StandardScaler()
X_df[columns_to_scale] = scaler.fit_transform(X_df[columns_to_scale])
X = X_df.values[data_filter]

# 3. A Gentle Start: The Naive Approach <a class="anchor" id="naive"></a>

Let's assume we know nothing about instrumental variables and we want to measure the treatment effect of schooling on wages. We can apply something like DML to do this and extract a treatment effect. 

In [54]:
from econml.dml import DMLCateEstimator
from sklearn.ensemble import RandomForestRegressor

In [55]:
dml_est = DMLCateEstimator(model_y=RandomForestRegressor(n_estimators=100), model_t=RandomForestRegressor(n_estimators=100))
dml_est.fit(Y, T, X)

In [56]:
np.mean(dml_est.effect(X))

0.07118106872292643

In [57]:
dml_est._model_final.coef_

array([ 0.06450203, -0.02916734,  0.03304918,  0.00789729,  0.01770636,
       -0.00051149, -0.0255208 , -0.0143534 , -0.04458493,  0.0079625 ,
        0.02696753, -0.00425429, -0.01666799,  0.00619979,  0.00258502,
       -0.00454056,  0.02402749,  0.02222253,  0.00424424,  0.00192012,
        0.01053959,  0.02104726, -0.00477149])

In [58]:
X_df.columns

Index(['exper', 'expersq', 'fatheduc', 'fatheduc_nan', 'motheduc',
       'motheduc_nan', 'momdad14', 'sinmom14', 'reg661', 'reg662', 'reg663',
       'reg664', 'reg665', 'reg666', 'reg667', 'reg668', 'reg669', 'south66',
       'black', 'smsa', 'south', 'smsa66'],
      dtype='object')

# 4. Using Intrumental Variables: 2SLS <a class="anchor" id="2sls"></a>

In [59]:
from sklearn.preprocessing import PolynomialFeatures

In [60]:
W = X
Z = Z.reshape(-1, 1)
T = T.reshape(-1, 1)
X = np.ones_like(Z)

In [61]:
two_sls_est = NonparametricTwoStageLeastSquares(
    t_featurizer=PolynomialFeatures(degree=1, include_bias=False),
    x_featurizer=PolynomialFeatures(degree=1, include_bias=False),
    z_featurizer=PolynomialFeatures(degree=1, include_bias=False),
    dt_featurizer=None) # dt_featurizer only matters for marginal_effect

In [62]:
two_sls_est.fit(Y, T, X, W, Z)

<econml.two_stage_least_squares.NonparametricTwoStageLeastSquares at 0x1ffab2b1ac8>

In [63]:
two_sls_est.effect(np.ones((1,1)))

array([0.13422248])