# Extend FWL for Machine Learning

So where does Machine Learning (ML) come into the picture?

There are three main cases where we might want to use other ML algorithms instead of OLS
1. When there are significant non-linearities present
2. When the number of confounders you're trying to control for is too large (m > n)
3. When you're trying to extract Heterogenous Treatment Effect (HTE)

If you're attempted to think, why should we even use FWL? can't we just run the ML algorithm as is and treat this as a prediction task?

The answer is resounding **NO**. Inference and prediction are two completely different tasks. Prediction cares about the accuracy of the prediction of $y$ where inference cares about the accuracy of the coefficient $\beta$.

So how can we use ML for inference?

Lets go back to the FWL decomposition:
*   regress $x_1$ on $x_2$ to $x_n$ and retrieve the residuals $\tilde x_1$
*   regress $y$ on $x_2$ to $x_n$ and retrieve the residuals $\tilde y$
*   regress $\tilde y$ on $\tilde x_1$ and retrieve the coefficient on $\tilde x_1$ <br>
<br>

The first 2 steps are actually prediction tasks, where it turns out we can use any ML algorithm, but there are 2 things we need to watch out for.

1. **Overfitting**. Many ML algorithms tend to overfit the data, and when we overfit, we end up including parts of the data which is actually unrelated to the regressor. In the worst case it would drive the residuals to zero.
2. **Regularization Bias**. Bias generated from how many ML algorithms conduct variable selection implicitly through penalization

The above 2 can issues can be solved with a technique similar to cross validation

We'll rename $x_2$ to $x_n$ as W in the examples below


In [39]:
#@title
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm

In [2]:
#@title
def generate_data(seed=1, N=300, K=5):

    np.random.seed(seed)

    # visit history
    visitHistory = np.round(np.random.normal(10,5,N),0)
    visitHistory[visitHistory<0] = 0

    # trx history
    trxHistory = np.round(.2*visitHistory * np.random.normal(5,2,N),0)
    trxHistory[trxHistory<0] = 0

    # SkuAvailability in terms of oos pct
    skuAvailability = np.random.normal(0.7,0.1,N)

    # number of sales visit
    salesVisit = np.round(visitHistory * 0.2 + trxHistory * 0.1 +
                          np.random.normal(1,1, N), 0)
    salesVisit[salesVisit<0] = 0

    # Sales
    sales = np.round(5000 * salesVisit + 20000 * visitHistory +
                     10000 * trxHistory + 200000 * skuAvailability +
                     np.random.normal(20000,5000,N),0)

    # Generate the dataframe
    df = pd.DataFrame({'visitHistory': visitHistory,
                       'trxHistory': trxHistory,
                       'skuAvailability': skuAvailability,
                       'salesVisit': salesVisit,
                       'sales': sales})

    return df

In [55]:
#@generate-data
df = generate_data(seed = 6, N=200)
y = df['sales']
W = df[['visitHistory','trxHistory']]
X = df['salesVisit']

Go through the FWL Steps <br>
Step 1: predict X using W with ML and compute residuals $\tilde x$

In [None]:
rfx = RandomForestRegressor(max_depth=4, random_state=0).fit(W, X)
xResid=X-rfx.predict(W)

Step 2: predict y using W with ML and compute residuals $\tilde y$

In [85]:
rfy = RandomForestRegressor(max_depth=4, random_state=0).fit(W, y)
yResid=y-rfy.predict(W)

Step 3: regress $\tilde y$ on $\tilde x$

In [88]:
dfRes = pd.DataFrame({'salesVisitResiduals':xResid,
                      'salesResiduals':yResid})
mod = smf.ols('salesResiduals ~ salesVisitResiduals', data=dfRes)
res = mod.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:         salesResiduals   R-squared:                       0.044
Model:                            OLS   Adj. R-squared:                  0.039
Method:                 Least Squares   F-statistic:                     9.083
Date:                Thu, 15 Jun 2023   Prob (F-statistic):            0.00292
Time:                        04:14:56   Log-Likelihood:                -2281.9
No. Observations:                 200   AIC:                             4568.
Df Residuals:                     198   BIC:                             4574.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept             466.1959   1

### Try to do it with any ML algo of your choosing below