# Using Machine Learning to Control for Covariates in a Regression
This notebook will illustrate machine learning methods for controlling for a large set of covariates in a regression estimating the effect of elite college attendance on later-life earnings. There are two basic approaches. The first is "Post-Double Selection Lasso" (Belloni, Chernozhukov, Hansen). The second is "Double-Debiased Machine Learning" (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, Robins)

In [None]:
# Import necessary libraries
!pip install mglearn
!git clone https://github.com/brighamfrandsen/econ484.git
%cd econ484/utilities
from preamble import *
%cd content/econ484/data

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/brighamfrandsen/econ484/blob/master/examples/controls.ipynb)

## Load useful packages:
pandas, numpy, linear_model (from sklearn), and KFold (from sklearn.model_selection)

Try it yourself first

### Cheat

In [None]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import KFold

## Read in data and have a look at the head and shape
college.csv is in the "datasets" subfolder

Try it yourself

### Cheat

In [None]:
collegedata=pd.read_csv('./data/college.csv')
print(collegedata.head())
print("Shape: {}".format(str(collegedata.shape)))

## Define outcome, regressor of interest, covariate matrix and sampling weights
y = lowninc

d = matsat2

X = everything except y, d, inst, and instwt

sampling weights = instwt

Try it yourself:

### Cheat

In [None]:
y=collegedata.loc[:,'lowninc']
d=collegedata.loc[:,['matsat2']]
X=collegedata.drop(['lowninc','matsat2','inst','instwt'],axis=1)
instwt=collegedata.loc[:,'instwt']

## Simple Regression with no Controls
Regress y on d and print out coefficient
Try it yourself

### Cheat

In [None]:
lm=linear_model.LinearRegression()
lm.fit(d,y,instwt)
print("Simple regression effect of selective college: {:.3f}".format(lm.coef_[0]))

## Post Double Selection Lasso

### Step 1: Lasso the outcome on X
Try it yourself

#### Cheat

In [None]:
lassoy = linear_model.Lasso(alpha=0.001, max_iter=1000,normalize=True).fit(X, y)

### Step 2: Lasso the treatment on X
Try it yourself

#### Cheat

In [None]:
lassod = linear_model.Lasso(alpha=0.001, max_iter=1000,normalize=True).fit(X, d)

### Step 3: Form the union of controls
Try it yourself

#### Cheat

In [None]:
Xunion=X.iloc[:,(lassod.coef_!=0) + (lassoy.coef_!=0)]
Xunion.head()

### Concatenate treatment with union of controls and regress y on that and print out estimate
Try yourself

#### Cheat

In [None]:
rhs=pd.concat([d,Xunion],axis=1)
fullreg=linear_model.LinearRegression().fit(rhs,y,instwt)
print("PDS regression effect of selective college: {:.3f}".format(fullreg.coef_[0]))

## Double-Debiased Machine Learning
For simplicity, we will first do it without sample splitting

### Step 1: Ridge outcome on Xs, get residuals
Try yourself

#### Cheat

In [None]:
ridgey = linear_model.Ridge(alpha=0.001, max_iter=1000,normalize=True).fit(X, y)
yresid=y-ridgey.predict(X)

### Step 2: Ridge treatment on Xs, get residuals
Try yourself

#### Cheat

In [None]:
ridged = linear_model.Ridge(alpha=0.001, max_iter=1000,normalize=True).fit(X, d)
dresid=d-ridged.predict(X)

### Step 3: Regress y resids on d resids and print out estimate
Try yourself

####Cheat

In [None]:
ddmlreg=linear_model.LinearRegression().fit(dresid,yresid,instwt)
print("DDML regression effect of selective college: {:.3f}".format(ddmlreg.coef_[0]))

### The real thing: with sample splitting

In [None]:
np.zeros(5)

In [None]:
# create our sample splitting "object"
kf = KFold(n_splits=5,shuffle=True,random_state=42)

# apply the splits to our Xs
kf.get_n_splits(X)

# initialize array to hold each fold's regression coefficient
coeffs=np.zeros(5)

# Now loop through each fold
ii=0
for train_index, test_index in kf.split(X):
  X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  d_train, d_test = d.iloc[train_index,:], d.iloc[test_index,:]
  wt_train, wt_test = instwt.iloc[train_index], instwt.iloc[test_index]
  # Do DDML thing
  # Ridge y on training folds:
  ridgey.fit(X_train, y_train)

  # but get residuals in test set
  yresid=y_test-ridgey.predict(X_test)

  #Ridge d on training folds
  ridged.fit(X_train, d_train)

  #but get residuals in test set
  dresid=d_test-ridged.predict(X_test)

  # regress resids on resids
  ddmlreg=linear_model.LinearRegression().fit(dresid,yresid,wt_test)

  # save coefficient in a vector
  coeffs[ii]=ddmlreg.coef_[0]
  ii+=1

# Take average
print("Double-Debiased Machine Learning effect of selective college: {:.3f}".format(np.mean(coeffs)))
coeffs

In [None]:
list(kf.split(X))

## Now do DDML using Random Forest!