# Extend FWL for Machine Learning

So where does Machine Learning (ML) come into the picture?

There are three main cases where we might want to use other ML algorithms instead of OLS
1. When there are significant non-linearities present
2. When the number of confounders you're trying to control for is too large (m > n)
3. When you're trying to extract Heterogenous Treatment Effect (HTE)

If you're attempted to think, why should we even use FWL? can't we just run the ML algorithm as is and treat this as a prediction task?

The answer is resounding **NO**. Inference and prediction are two completely different tasks. Prediction cares about the accuracy of the prediction of $y$ where inference cares about the accuracy of the coefficient $\beta$.

So how can we use ML for inference?

Lets go back to the FWL decomposition:
*   regress $x_1$ on $x_2$ to $x_n$ and retrieve the residuals $\tilde x_1$
*   regress $y$ on $x_2$ to $x_n$ and retrieve the residuals $\tilde y$
*   regress $\tilde y$ on $\tilde x_1$ and retrieve the coefficient on $\tilde x_1$ <br>
<br>

The first 2 steps are actually prediction tasks, where it turns out we can use any ML algorithm, but there are 2 things we need to watch out for.

1. **Overfitting**. Many ML algorithms tend to overfit the data, and when we overfit, we end up including parts of the data which is actually unrelated to the regressor. In the worst case it would drive the residuals to zero.
2. **Regularization Bias**. Bias generated from how many ML algorithms conduct variable selection implicitly through penalization



In [3]:
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold

In [4]:
#@title
def generate_data(seed=1, N=300, K=5):

    np.random.seed(seed)

    # visit history
    visitHistory = np.round(np.random.normal(10,5,N),0)
    visitHistory[visitHistory<0] = 0

    # trx history
    trxHistory = np.round(.2*visitHistory * np.random.normal(5,2,N),0)
    trxHistory[trxHistory<0] = 0

    # SkuAvailability in terms of oos pct
    skuAvailability = np.random.normal(0.7,0.1,N)

    # number of sales visit
    salesVisit = np.round(visitHistory * 0.2 + trxHistory * 0.1 +
                          np.random.normal(1,1, N), 0)
    salesVisit[salesVisit<0] = 0

    # Sales
    sales = np.round(5000 * salesVisit + 20000 * visitHistory +
                     10000 * trxHistory + 200000 * skuAvailability +
                     np.random.normal(20000,5000,N),0)

    # Generate the dataframe
    df = pd.DataFrame({'visitHistory': visitHistory,
                       'trxHistory': trxHistory,
                       'skuAvailability': skuAvailability,
                       'salesVisit': salesVisit,
                       'sales': sales})

    return df

In [None]:
# create our sample splitting "object"
kf = KFold(n_splits=5,shuffle=True,random_state=42)

# apply the splits to our Xs
kf.get_n_splits(X)

# initialize columns for residuals
yresid = y*0
dresid = d*0

# Now loop through each fold
ii=0
for train_index, test_index in kf.split(X):
  X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  d_train, d_test = d.iloc[train_index,:], d.iloc[test_index,:]

  # Do DML thing
  # Ridge y on training folds:
  ridgey.fit(X_train, y_train)

  # but get residuals in test set
  yresid.iloc[test_index]=y_test-ridgey.predict(X_test)

  #Ridge d on training folds
  ridged.fit(X_train, d_train)

  #but get residuals in test set
  dresid.iloc[test_index,:]=d_test-ridged.predict(X_test)


# Regress resids
dmlreg=linear_model.LinearRegression().fit(dresid,yresid)

print("DML regression earnings race gap: {:.3f}".format(dmlreg.coef_[0]))