## Stacking

#### Table of Contents

- [Preliminaries](#Preliminaries)
- [Base Learners](#Base-Learners)
    - [Ridge](#Ridge)
    - [KNN](#KNN)
    - [RF](#RF)
    - [Best Base Learner](#Best-Base-Learner)
- [Average](#Average)
- [Weighted Average](#Weighted-Average)
- [Model 1: OLS Average](#Model-1:-OLS-Average)
- [Model 2: RF Aggregation](#Model-2:-RF-Aggregation)

```
def rmse(yhat, y):
    import numpy as np
    RMSE = np.sqrt(np.mean(  (yhat - y)**2  ))
    return RMSE

def acc(yhat, y):
    import numpy as np
    acc = np.mean(yhat == y)
    return acc

def r2(yhat, y):
    SSres = ((yhat - y)**2).sum()
    SStot = ((y - y.mean())**2).sum()
    r2 = 1 - SSres/SStot
    return r2

def stdz(vector):
    import numpy as np
    std_vec = (vector - np.mean(vector))/np.std(vector)
    return std_vec
```

*************
# Preliminaries
[TOP](#Stacking)

We will be using the following base learners predicting `pct_d_rgdp` in an ensemble using the aggregation techniques listed in the table of contents:

1. Ridge Regression
2. KNN
3. RF

In [None]:
%run metrics.py

In [None]:
# Utilities
import numpy as np
import pandas as pd
from tqdm import tqdm 

# Processing
from sklearn.model_selection import GridSearchCV, KFold, train_test_split

# algorithms
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

Loading in the data

In [None]:
df = pd.read_pickle('C:/Users/johnj/Documents/Data/aml in econ 02 spring 2021/class data/class_data.pkl')

We are going to exclude the fixed effect features for `year` to reduce the number of features.

In [None]:
df_prepped = df.drop(columns = ['urate_bin', 'year']).join([
    pd.get_dummies(df['urate_bin'], drop_first = True)
])

We are going to make the choice of standardizing all of our variables.

Remember, we need to obtain the data for

- `train1`
- `train2`
- `test`

In [None]:
y = df_prepped['pct_d_rgdp']
x = df_prepped.drop(columns = 'pct_d_rgdp')

x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                   train_size = 2/3,
                                                   random_state = 490)

x_train1, x_train2, y_train1, y_train2 = train_test_split(x_train, y_train,
                                    train_size = 1/2,
                                    random_state = 490)

x_train1 = x_train1.apply(stdz)
x_train2 = x_train2.apply(stdz)
x_test   = x_test.apply(stdz)

Removing what we do not need

In [None]:
%who

In [None]:
del df, df_prepped, x_train, y_train

In [None]:
%who

***********
# Base Learners
[TOP](#Stacking)

In this demonstration, we are only going to use 3 base learners.
However, there is nothing stopping your from using more.
In fact, you may find that the more learners you have, the better your model.

However, once you start to include a larger number of base learners, you may want to consider using regularization to aggregate their predictions.

*************
## Ridge
[TOP](#Stacking)

We will be using a ridge regression function from `sklearn`, which means we do not need to append an intercept to the features.

In [None]:
reg_ridge = RidgeCV(alphas = 10.**np.linspace(-2, 5, num = 20),
                   cv = 5).fit(x_train1, y_train1)
reg_ridge.alpha_

In [None]:
r2_ridge = reg_ridge.score(x_test, y_test)
r2_ridge

***************
# KNN
[TOP](#Stacking)

Remember that KNN is relatively slow at fitting and relatively slow at predicting.
All the other models we have used so far are at least relatively fast at predicting.

**Why is KNN slow at predicting?** *Hint: it is in its name!*

Let the CV begin! We are going to set a hard limit of 100 on the number of neighbors.

In [None]:
%%time
param_grid = {
    'n_neighbors': [5, 10, 25, 50, 75, 100]
}

knn_cv = KNeighborsRegressor()

grid_search = GridSearchCV(knn_cv, param_grid,
                          cv = 5,
                          scoring = 'neg_mean_squared_error',
                          n_jobs = 10,
                          verbose = 2).fit(x_train1, y_train1)
best_knn = grid_search.best_params_
best_knn

And to refit the model.

In [None]:
reg_knn = KNeighborsRegressor(n_neighbors = best_knn['n_neighbors'])
reg_knn.fit(x_train1, y_train1)

r2_knn = reg_knn.score(x_test, y_test)
r2_knn

*************
## RF
[TOP](#Stacking)

In [None]:
%%time
reg_rf = RandomForestRegressor(n_estimators = 500,
                              max_features = 'sqrt',
                              random_state = 490,
                              n_jobs = 10).fit(x_train1, y_train1)
r2_rf = reg_rf.score(x_test, y_test)
r2_rf

**************
## Best Base Learner
[TOP](#Stacking)

We can print out the base learners $R^2$ performance.

In [None]:
r2_base = {
    'r2_ridge': r2_ridge,
    'r2_knn': r2_knn,
    'r2_rf': r2_rf
}
print(r2_base, '\n')
best_base = max(r2_base, key = r2_base.get)

print(best_base, ':', r2_base[best_base])

**********
# Average
[TOP](#Stacking)

Remember that the coefficients (the wieghts) are predetermined for a simple average. 
They are specifically set to the inverse of the number of base learners. 
To see this, let $j$ denote the base learner index.

$$
\begin{align*}
    \bar{f_j}(x) & = \frac{1}{3}\sum_{j=1}^3 f_j(x)\\
    & = \frac{1}{3}f_1(x) + \frac{1}{3}f_2(x) + \frac{1}{3}f_3(x)\\
    & = w_1 f_1(x) + w_2 f_2(x) + w_3 f_3(x)
\end{align*}
$$


$$
MATH!!!!!
$$

In [None]:
df_test_yhat = pd.DataFrame({
    'ridge': reg_ridge.predict(x_test),
#     'svr': reg_svr.predict(x_test),
    'knn': reg_knn.predict(x_test),
    'rf': reg_rf.predict(x_test)}, 
index = y_test.index)
df_test_yhat.head(1)

In [None]:
r2_avg = r2(df_test_yhat.mean(axis = 1), y_test)
r2_avg

************
# Weighted Average
[TOP](#Stacking)

In order to estimate a weighted average, we need to create a grid of weights such that they all add to one.

In [None]:
step_size = 0.1
wts = np.arange(0, 1 + step_size, step = step_size)
wts_grid = np.array([(x, y, z) for x in wts for y in wts for z in wts])

keep = wts_grid.sum(axis = 1) == 1
wts_grid = wts_grid[keep]

wts_grid.shape

We are going to be using the predicted values on `train2` to identify the optimal weights.

It is computationally efficient to only estimate them once, so we are going to create a data frame.

In [None]:
df_train2_yhat = pd.DataFrame({
    'ridge': reg_ridge.predict(x_train2),
    'knn': reg_knn.predict(x_train2),
    'rf': reg_rf.predict(x_train2)}, 
index = y_train2.index)

Now to identify the optimal weights

In [None]:
r2_grid = {}

i = 0
for w in tqdm(wts_grid):
    yhat = df_train2_yhat @ w.T
    r2_grid[i] = r2(yhat, y_test)
    i += 1

In [None]:
best_indx = max(r2_grid, key = r2_grid.get)
best_wts = wts_grid[best_indx]
best_wts

Saving the $R^2$...

In [None]:
yhat = df_test_yhat @ best_wts.T

r2_wtd_avg = r2(yhat, y_test)
r2_wtd_avg

**************
# Model 1: OLS Average
[TOP](#Stacking)



**How is OLS an average?**

Well with a slight abuse of notation, recall that in this case OLS takes the form

$$\hat{y} = \beta_0 + \hat{y}_1 \beta_1 + \hat{y}_2 \beta_2 + \hat{y}_3 \beta_3 $$

Here, $\beta_1$, $\beta_2$, and $\beta_3$ are acting as weights that do not sum to 1.
$\beta_0$ is a *bias* term. 

In [None]:
stack_ols = LinearRegression().fit(df_train2_yhat, y_train2)
print(stack_ols.intercept_, stack_ols.coef_)

In [None]:
stack_ols.coef_.sum()

In [None]:
r2_stack_ols = stack_ols.score(df_test_yhat, y_test)
r2_stack_ols

**************
# Model 2: RF Aggregation
[TOP](#Stacking)

We can also use different models as stackers.

Here we will use a random forest. 
We will use the usual `max_features = 'sqrt'`, however, we will also add `max_depth = 2` because we have so few features. 
We will also reduce `n_estimators` by an order of magnitude for the same reason.

In [None]:
stack_rf = RandomForestRegressor(n_estimators = 50,
                                max_features = 'sqrt',
                                max_depth = 2,
                                random_state = 490,
                                n_jobs = 10)
stack_rf.fit(df_train2_yhat, y_train2)

r2_stack_rf = stack_rf.score(df_test_yhat, y_test)
r2_stack_rf

****************
# Comparison
[TOP](#Stacking)

In [None]:
%whos