# Ensembling: Blending and Stacking

In this notebook, we implement stacking of machine learning models. Stacking several uncorrelated models is known to generalize better than individual models. Stacking mainly requires good cross-validation strategy between levels of prediction. In particular, we will demostrate that maintaining the same cross-validation folds between levels minimizes overfitting.

In [12]:
import pandas as pd
import numpy as np
from scipy.optimize import minimize, fmin
from xgboost import XGBClassifier

from sklearn import model_selection, linear_model, metrics, decomposition, ensemble
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

from functools import partial
from typing import List
import warnings
import random


warnings.simplefilter(action='ignore')
NUM_FOLDS = 5

## Dataset

We do not really care too much about the dataset. The dataset used here is particularly nice. No issues. Idea is that we have text data in the form of a movie review, along with its sentiment classification. We will build a **sentiment classifier** using an ensemble of three models.

In [13]:
df = pd.read_csv('../input/kumarmanoj-bag-of-words-meets-bags-of-popcorn/labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### Train and test split

In [14]:
df_train, df_test = model_selection.train_test_split(df, test_size=0.20)
print(df_train.shape, df_test.shape)

(20000, 3) (5000, 3)


### Cross-validation folds

Here we create cross-validation folds. Very important for evaluating models, and creating Level 1 features that are not overfitted.

In [15]:
df_train.loc[:, 'kfold'] = -1 
df_train = df_train.sample(frac=1.0).reset_index(drop=True)
y = df_train['sentiment'].values

skf = model_selection.StratifiedKFold(n_splits=NUM_FOLDS+1)
for f, (t_, v_) in enumerate(skf.split(X=df_train, y=y)):
    df_train.loc[v_, "kfold"] = f

In [16]:
df_train.kfold.value_counts()

0    3334
1    3334
2    3333
3    3333
4    3333
5    3333
Name: kfold, dtype: int64

## Stacking

We define a class that automates training and prediction of stacked models. Several models can be trained on the training set whose predict probabilities can be used as feature for a further metamodel called a **stacker**. Observe that this process can be iterated to several more levels. To avoid creating meta features that are overfitted to the train set, the meta features are generated by out-of-fold (OOF) training and prediction of the models on the features of the previous level. This requires defining cross-validation folds. The same cross-validation folds will be used to generate metafeatures at deeper levels. This will be justified later. 

After generating metafeatures, the models will be retrained on the whole training set (not just on train folds). This increases accuracy of prediction on the test set. Finally, prediction on the test set will simulate conditions when the model was trained &mdash; essentially the test set acts like an extra validation fold.


:::{note}
Alternatively, we could make predictions on the test dataset using each base model immediately after it gets fitted on each fold. In our case, this would generate test-set predictions for five of each base models. Then, we would average the predictions per model to generate our level 1 meta features.

One benefit to this is that it’s less time consuming than the first approach (since we don’t have to retrain each model on the full training dataset). It also helps that our train meta features and test meta features should follow a similar distribution. However, the test meta features are likely more accurate in the first approach since each base model was trained on the full training dataset (as opposed to 80% of the training dataset, five times in the 2nd approach).
:::



### Implementation

In [17]:
class StackingClassifier:
    """Implements model stacking for classification."""
    
    def __init__(self, model_dict_list):
        """Initialize by passing `model_dict` which is a list of dictionaries 
        of name-model pairs for each level."""
        
        self.model_dict_list = model_dict_list
        self.cv_scores_ = {}
        self.metafeatures_ = None
        
    def fit(self, df):
        """Fit classifier. This assumes `df` is a DataFrame with "id", "kfold", 
        "sentiment" (target) columns, followed by features columns."""
        
        df = df.copy()
        
        # Iterating over all stacking levels
        metafeatures = []
        for m in range(len(self.model_dict_list)):
            
            # Get models in current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Identify feature columns, i.e. preds of prev. layer
            if m == 0:
                feature_cols = ['review']
            else:
                prev_level_names = self.model_dict_list[m-1].keys()
                feature_cols = [f'{name}_{level-1}' for name in prev_level_names]
            
            # Iterate over models in the current layer
            for model_name in model_dict.keys():
                print(f'\nLevel {level} preds: {model_name}')
                self.cv_scores_[f'{model_name}_{level}'] = []
                model = model_dict[model_name]
                
                # Generate feature for next layer models from OOF preds
                oof_preds = []
                for j in range(df.kfold.nunique()):
                    oof_pred, oof_auc = self._oof_pred(df, feature_cols, model, 
                                                        model_name, fold=j, level=level)
                    oof_preds.append(oof_pred)
                    self.cv_scores_[f'{model_name}_{level}'].append(oof_auc)
                
                pred = pd.concat(oof_preds)
                df = df.merge(pred[['id', f'{model_name}_{level}']], on='id', how='left')   
                metafeatures.append(f'{model_name}_{level}')
        
                # Train models on entire feature columns for inference
                model.fit(df[feature_cols], df.sentiment.values)
        
        self.metafeatures_ = df[metafeatures]
        return self
        
    def predict_proba(self, test_df):
        """Return classification probabilities."""
        
        test_df = test_df.copy()
        
        # Iterate over layers to make predictions
        for m in range(len(self.model_dict_list)):
            
            # Get models for current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Get feature columns to use for prediction
            if m == 0:
                feature_cols = ['review']
            else:
                prev_names = self.model_dict_list[m-1].keys()
                feature_cols = [f"{model_name}_{level-1}" for model_name in prev_names]

            # Append predictions to test DataFrame
            for model_name in model_dict.keys():
                model = model_dict[model_name]
                pred = model.predict_proba(test_df[feature_cols])[:, 1] 
                test_df.loc[:, f"{model_name}_{level}"] = pred
                    
        # Return last predictions
        return np.c_[1 - pred, pred]
        
    def _oof_pred(self, df, feature_cols, model, model_name, fold, level):
        "Train on K-1 folds, predict on fold K. Return OOF predictions with IDs."

        # Get folds; include ID and target cols, and feature cols
        df_trn = df[df.kfold != fold][['id', 'sentiment']+feature_cols]
        df_oof = df[df.kfold == fold][['id', 'sentiment']+feature_cols]
        
        # Fit model. 
        model.fit(df_trn[feature_cols], df_trn.sentiment.values)
        oof_pred = model.predict_proba(df_oof[feature_cols])[:, 1] 
        auc = metrics.roc_auc_score(df_oof.sentiment.values, oof_pred)
        print(f"fold={fold}, auc={auc}")

        # Return OOF predictions with ids
        df_oof.loc[:, f"{model_name}_{level}"] = oof_pred
        return df_oof[["id", f"{model_name}_{level}"]], auc

### Blending

Let's start with a simple stacked model where we simply perform a weighted average of the prediction probabilities. This method is called **blending**. We will use three base models to generate probabilities. Hopefully these are uncorrelated:
1. Logistic Regression + TF-IDF
2. Logistic Regression + Count Vectorizer
3. Random Forest + TF-IDF + SVD

In [18]:
class ReviewColumnExtractor(BaseEstimator, ClassifierMixin):
    """Extract text column, e.g. letting X = df_train[['review']]
    as train dataset for TfidfVectorizer and CountVectorizer does
    not work as expected."""
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.review

Initialize base models:

In [19]:
# logistic regression + tfidf
lr = make_pipeline(
    ReviewColumnExtractor(),
    TfidfVectorizer(max_features=1000),
    linear_model.LogisticRegression()
)

# logistic regression + count vectorizer
lr_cnt = make_pipeline(
    ReviewColumnExtractor(),
    CountVectorizer(),
    linear_model.LogisticRegression(solver='liblinear')
)

# random forest + decomposed (svd) tfidf features
rf_svd = make_pipeline(
    ReviewColumnExtractor(),
    TfidfVectorizer(max_features=None),
    decomposition.TruncatedSVD(n_components=120),
    ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1)
)

Run training:

In [20]:
basemodels = {'lr': lr, 'lr_cnt': lr_cnt, 'rf_svd': rf_svd}
stack = StackingClassifier([basemodels])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9384901442591425
fold=1, auc=0.9335381634668255
fold=2, auc=0.9280035949592794
fold=3, auc=0.9304714786729175
fold=4, auc=0.9414464824536768
fold=5, auc=0.937182146174952

Level 1 preds: lr_cnt
fold=0, auc=0.950827093427299
fold=1, auc=0.9397791490696997
fold=2, auc=0.9401545861295928
fold=3, auc=0.9398243567308318
fold=4, auc=0.946280093042683
fold=5, auc=0.9494958987764743

Level 1 preds: rf_svd
fold=0, auc=0.8836239531783937
fold=1, auc=0.8793645155904087
fold=2, auc=0.8715397976827204
fold=3, auc=0.8733627512404492
fold=4, auc=0.8865565205493264
fold=5, auc=0.8709063379926689


<__main__.StackingClassifier at 0x7f81e4aace10>

Check if basemodels are uncorrelated:

In [21]:
stack.metafeatures_.corr()

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
lr_1,1.0,0.886559,0.828966
lr_cnt_1,0.886559,1.0,0.72276
rf_svd_1,0.828966,0.72276,1.0


In [22]:
stack.metafeatures_.head()

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
0,0.007027,3.3e-05,0.24
1,0.884595,0.999684,0.71
2,0.063354,0.001449,0.44
3,0.420351,0.421524,0.63
4,0.231211,0.16132,0.56


We can also check scores of the base models on each validation fold. This informs us of the stability of the folds and the cross-validation performance of the base models. 

In [23]:
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
mean,0.934855,0.944394,0.877559
std,0.005098,0.005121,0.00662


Let us try to blend the probabilities using some hand-designed coefficients.

In [24]:
target = df_train.sentiment.values

# roc is scale invariant, so we dont bother dividing by total weights
avg_preds = (stack.metafeatures_ * [1, 1, 1]).sum(axis=1)
wtd_preds = (stack.metafeatures_ * [1, 3, 1]).sum(axis=1)
rank_avg_preds = (stack.metafeatures_.rank() * [1, 1, 1]).sum(axis=1)
rank_wtd_preds = (stack.metafeatures_.rank() * [1, 3, 1]).sum(axis=1)

# Calculate AUC over combined OOF preds
print(f"Train OOF-AUC (averaged):     ", metrics.roc_auc_score(target, avg_preds))
print(f"Train OOF-AUC (wtd. avg):     ", metrics.roc_auc_score(target, wtd_preds))
print(f"Train OOF-AUC (rank avg):     ", metrics.roc_auc_score(target, rank_avg_preds)) 
print(f"Train OOF-AUC (wtd. rank avg):", metrics.roc_auc_score(target, rank_wtd_preds))

Train OOF-AUC (averaged):      0.9481721046043312
Train OOF-AUC (wtd. avg):      0.949190945103563
Train OOF-AUC (rank avg):      0.9432180771768579
Train OOF-AUC (wtd. rank avg): 0.9492301051227515


Since these coefficients are hand-designed, we may want to devise a strategy for automatically finding the optimal coefficients for blending. This is accomplished by the folowing class.

In [25]:
class Blender(BaseEstimator, ClassifierMixin):
    """Implement blending that maximizes AUC score."""
    
    def __init__(self, rank=False):
        self.coef_ = None
        self.rank = rank

    def fit(self, X, y):
        """Find optimal blending coefficients."""
        
        if self.rank:
            X = X.rank()

        self.coef_ = self._optimize_auc(X, y)
        return self

    def predict_proba(self, X):
        """Return blended probabilities for class 0 and class 1."""
        
        if self.rank:
            X = X.rank()
            
        pred = np.sum(X * self.coef_, axis=1)
        return np.c_[1 - pred, pred]

    def _auc(self, coef, X, y):
        """Calculate AUC of blended predict probas."""

        auc = metrics.roc_auc_score(y, np.sum(X * coef, axis=1))
        return -1.0 * auc # min -auc = max auc
    
    def _optimize_auc(self, X, y):
        """Maximize AUC as a bound-constrained optimization problem using Nelder-Mead 
        method with Dirichlet init. 
        
        Reference: 
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
        """
        partial_loss = partial(self._auc, X=X, y=y) 
        init_coef = np.random.dirichlet(np.ones(X.shape[1]))
        return minimize(partial_loss, init_coef, 
                        method='Nelder-Mead', 
                        bounds=[(0, 1), (0, 1), (0, 1)])['x']

This implementation uses `partial` from `functools` and `minimize` from `scipy.optimize` to minimize the coefficients constained in $(0, 1).$ The initial values of the coefficient are drawn from a Dirichlet distribution $\operatorname{Dir}(\boldsymbol{\alpha})$ with $\boldsymbol{\alpha} = [1, 1, 1].$

In [26]:
target = df_train.sentiment.values

# Blended predictions
blender = Blender()
blender.fit(stack.metafeatures_, target)
combined_oof_preds = (stack.metafeatures_ * blender.coef_).sum(axis=1)

# Blended ranked predictions
blender_rk = Blender(rank=True)
blender_rk.fit(stack.metafeatures_, target)
combined_oof_rk_preds = (stack.metafeatures_.rank() * blender_rk.coef_).sum(axis=1)

print(f"Train OOF-AUC (Blended):    ", metrics.roc_auc_score(target, combined_oof_preds))
print(f"Train OOF-AUC (Blended rk.):", metrics.roc_auc_score(target, combined_oof_rk_preds))

Train OOF-AUC (Blended):     0.949660465333628
Train OOF-AUC (Blended rk.): 0.9501244955610028


Note that this is not the same as train AUC. However, this is a better approximation of the test AUC. Calculating the AUC on the entire out-of-fold predictions involves tracking the rows of the confusion matrix, which is the sum of the confusion matrix of each fold, over all thresholds. On the other hand, the latter approach tracks each confusion matrix separately, then averages the individual AUCs. The two should be similar to cross-validation scores if error is well-distributed between folds &mdash; and we are blending probabilities. [^ref]

[^ref]: For some reason OOF-AUC is bad when blending ranking models, e.g. linear regression, and usual classifiers, even after transforming predict probabilities to rank.

In [27]:
# Inference
test_target = df_test.sentiment.values
test_features = []
for model_name in basemodels.keys():
    test_features.append(basemodels[model_name].predict_proba(df_test)[:, 1])

test_pred = (pd.DataFrame(np.c_[test_features].T) * blender.coef_).sum(axis=1)
test_rk_pred = (pd.DataFrame(np.c_[test_features].T) * blender_rk.coef_).sum(axis=1)
print('Test AUC (Blended):    ', metrics.roc_auc_score(test_target, test_pred))
print('Test AUC (Blended rk.):', metrics.roc_auc_score(test_target, test_rk_pred))

Test AUC (Blended):     0.9466536617647081
Test AUC (Blended rk.): 0.9466128614448339


:::{tip}
Using blended **rank probabilities** is a good trick when optimizing AUC score. Here individual probabilities are replaced by their rank index. Recall that AUC only cares about the predict probability of a randomly chosen negative examples to be assigned lower predict proba than a randomly chosen positive example. Note that this only works for ensembles; for single models using rank probabilities does not affect AUC score.
:::

### XGB Metamodel

Blending can be easily generalized to more complex machine learning model that learns and predicts with the metafeatures using more complex algorithms. For example, we can use `XGBoostClassifier`.

In [28]:
basemodels = {'lr': lr, 'lr_cnt': lr_cnt, 'rf_svd': rf_svd}
metamodel = {'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False)}
stack = StackingClassifier([basemodels, metamodel])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9384901442591425
fold=1, auc=0.9335381634668255
fold=2, auc=0.9280035949592794
fold=3, auc=0.9304714786729175
fold=4, auc=0.9414464824536768
fold=5, auc=0.937182146174952

Level 1 preds: lr_cnt
fold=0, auc=0.950827093427299
fold=1, auc=0.9397791490696997
fold=2, auc=0.9401545861295928
fold=3, auc=0.9398243567308318
fold=4, auc=0.946280093042683
fold=5, auc=0.9494958987764743

Level 1 preds: rf_svd
fold=0, auc=0.8802085582434412
fold=1, auc=0.8797558591782036
fold=2, auc=0.8708704237543847
fold=3, auc=0.8721752327867435
fold=4, auc=0.8833828432749296
fold=5, auc=0.8728789941020156

Level 2 preds: xgb
fold=0, auc=0.9485932862353577
fold=1, auc=0.9412457429014772
fold=2, auc=0.9417221597697266
fold=3, auc=0.9407211168002535
fold=4, auc=0.9483645516019616
fold=5, auc=0.9488598310540757


<__main__.StackingClassifier at 0x7f81e466cbd0>

In [29]:
y_train = df_train.sentiment.values
y_test = df_test.sentiment.values

print(f"Train AUC (XGB stack):", metrics.roc_auc_score(y_train, stack.predict_proba(df_train)[:, 1]))
print(f"Test AUC  (XGB stack):", metrics.roc_auc_score(y_test, stack.predict_proba(df_test)[:, 1]))

Train AUC (XGB stack): 0.9990695195440646
Test AUC  (XGB stack): 0.9437543590341748


In [30]:
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,xgb_2
mean,0.934855,0.944394,0.876545,0.944918
std,0.005098,0.005121,0.0052,0.004056


Observe that cross-validated AUC scores is indicative of test performance. Meanwhile, train AUC is useless. A better estimate is the mean cross-validation AUC score. If we assume that each fold has the same error distribution, then this should approximate the test AUC which can be thought of as predicting on another fold (i.e. fold `NUM_FOLDS + 1`). Indeed, the above results supports this.

### Conclusion

The above examples of building ensembles with **blending** or **stacking** (e.g. with XGBoost) show that stacked models significantly outperform single models.

## Experiment: CV Folds

Consider stacking three levels of models where level three is just a single classifier. Our current implementation does this by keeping the same cross-validation folds when training the level 2 models [^ref2]. It is unclear whether using same folds between levels 1 and 2 affect generalization error. To check this empirically, we compute cv scores with folds unshuffled (the usual) and shuffled (new) for training models in level 2. We also calculate test set AUC. If CV scores decrease and test AUC decrease, while overall train AUC increases, then this indicates using a different validation fold between levels results in overfitting.

Here we have level two features from the three metamodels. We blend the predict probabilities of each metamodel optimizing AUC as before.

[^ref2]: GM Abishek Thakur recommends keeping the same folds in [AAAMLP](https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf).

In [37]:
class LinearRegressionClassifier(BaseEstimator, ClassifierMixin):
    """Linear regression for model-based AUC optimization.
    Note that we transform probabilities to rank probabilities!"""
    
    def __init__(self): 
        self.lr = linear_model.LinearRegression()
        
    def fit(self, X, y):
        self.lr.fit(pd.DataFrame(X).rank(), y)
        return self
        
    def predict_proba(self, X):
        return np.c_[[0]*len(X), self.lr.predict(pd.DataFrame(X).rank())]

Define models for stacking.

In [38]:
# basemodels
level1 = {
    'lr': make_pipeline(
        ReviewColumnExtractor(),
        TfidfVectorizer(max_features=1000),
        linear_model.LogisticRegression()
    ), 
    
    'lr_cnt': make_pipeline(
        ReviewColumnExtractor(),
        CountVectorizer(),
        linear_model.LogisticRegression(solver='liblinear')
    ), 
    
    'rf_svd': make_pipeline(
        ReviewColumnExtractor(),
        TfidfVectorizer(max_features=None),
        decomposition.TruncatedSVD(n_components=120),
        ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1)
    )
}

# metamodels
level2 = {
    'lr': linear_model.LogisticRegression(),
    'linreg': make_pipeline(
        StandardScaler(), 
        LinearRegressionClassifier()
    ),
    'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False)
}

# blender head
level3 = {'blender': Blender(rank=True)}

### Same Folds

In [40]:
# Run training of stack models
stack = StackingClassifier([level1, level2, level3])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9384901442591425
fold=1, auc=0.9335381634668255
fold=2, auc=0.9280035949592794
fold=3, auc=0.9304714786729175
fold=4, auc=0.9414464824536768
fold=5, auc=0.937182146174952

Level 1 preds: lr_cnt
fold=0, auc=0.950827093427299
fold=1, auc=0.9397791490696997
fold=2, auc=0.9401545861295928
fold=3, auc=0.9398243567308318
fold=4, auc=0.946280093042683
fold=5, auc=0.9494958987764743

Level 1 preds: rf_svd
fold=0, auc=0.8799762710839731
fold=1, auc=0.876132287447353
fold=2, auc=0.8681146123716432
fold=3, auc=0.8732959578283319
fold=4, auc=0.8908581243113617
fold=5, auc=0.8755053614765845

Level 2 preds: lr
fold=0, auc=0.9532762745385923
fold=1, auc=0.9452961040531321
fold=2, auc=0.9431921538861496
fold=3, auc=0.9449924744888774
fold=4, auc=0.9543280690762705
fold=5, auc=0.9518860587205911

Level 2 preds: linreg
fold=0, auc=0.9553720768883092
fold=1, auc=0.9462853486718429
fold=2, auc=0.9446216398977105
fold=3, auc=0.9457900346389555
fold=4, auc=0.954513506312067

<__main__.StackingClassifier at 0x7f81cc37dfd0>

In [41]:
same_train_auc = metrics.roc_auc_score(y_train, stack.predict_proba(df_train)[:, 1])
same_test_auc = metrics.roc_auc_score(y_test, stack.predict_proba(df_test)[:, 1])

print(f"Train AUC (same):", same_train_auc)
print(f"Test AUC  (same):", same_test_auc)

pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Train AUC (same): 0.9960898580840304
Test AUC  (same): 0.9472895067497329


Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934855,0.944394,0.877314,0.948829,0.950037,0.94554,0.950326
std,0.005098,0.005121,0.007694,0.004865,0.004958,0.004365,0.004888


In [42]:
experiment_results = {
    'same': {'train': same_train_auc, 'test': same_test_auc}
}

### Different Folds

Now we train the same model except the folds are shuffled beyond training the level 1 models simulating the use of different cross-validation folds when training higher level models.

In [43]:
class StackingClassifierShuffledCV:
    """Implements model stacking for classification."""
    
    def __init__(self, model_dict_list):
        """Initialize by passing `model_dict` which is a list of dictionaries 
        of name-model pairs for each level."""
        
        self.model_dict_list = model_dict_list
        self.cv_scores_ = {}
        self.metafeatures_ = None
        
    def fit(self, df):
        """Fit classifier. This assumes `df` is a DataFrame with "id", "kfold", 
        "sentiment" (target) columns, followed by features columns."""
        
        df = df.copy()
        
        # Iterating over all stacking levels
        metafeatures = []
        for m in range(len(self.model_dict_list)):
            
            # Get models in current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Identify feature columns, i.e. preds of prev. layer
            if m == 0:
                feature_cols = ['review']
            else:
                prev_level_names = self.model_dict_list[m-1].keys()
                feature_cols = [f'{name}_{level-1}' for name in prev_level_names]
                
                # Shuffle folds for level 2 models and up <---------- <!> SHUFFLE FOLDS HERE <!>
                df['kfold'] = random.sample(df.kfold.tolist(), len(df))
            
            # Iterate over models in the current layer
            for model_name in model_dict.keys():
                print(f'\nLevel {level} preds: {model_name}')
                self.cv_scores_[f'{model_name}_{level}'] = []
                model = model_dict[model_name]
                
                # Generate feature for next layer models from OOF preds
                oof_preds = []
                for j in range(df.kfold.nunique()):
                    oof_pred, oof_auc = self._oof_pred(df, feature_cols, model, 
                                                        model_name, fold=j, level=level)
                    oof_preds.append(oof_pred)
                    self.cv_scores_[f'{model_name}_{level}'].append(oof_auc)
                
                pred = pd.concat(oof_preds)
                df = df.merge(pred[['id', f'{model_name}_{level}']], on='id', how='left')   
                metafeatures.append(f'{model_name}_{level}')
        
                # Train models on entire feature columns for inference
                model.fit(df[feature_cols], df.sentiment.values)
        
        self.metafeatures_ = df[metafeatures]
        return self
        
    def predict_proba(self, test_df):
        """Return classification probabilities."""
        
        test_df = test_df.copy()
        
        # Iterate over layers to make predictions
        for m in range(len(self.model_dict_list)):
            
            # Get models for current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Get feature columns to use for prediction
            if m == 0:
                feature_cols = ['review']
            else:
                prev_names = self.model_dict_list[m-1].keys()
                feature_cols = [f"{model_name}_{level-1}" for model_name in prev_names]

            # Append predictions to test DataFrame
            for model_name in model_dict.keys():
                model = model_dict[model_name]
                pred = model.predict_proba(test_df[feature_cols])[:, 1] 
                test_df.loc[:, f"{model_name}_{level}"] = pred
                    
        # Return last predictions
        return np.c_[1 - pred, pred]
        
    def _oof_pred(self, df, feature_cols, model, model_name, fold, level):
        "Train on K-1 folds, predict on fold K. Return OOF predictions with IDs."

        # Get folds; include ID and target cols, and feature cols
        df_trn = df[df.kfold != fold][['id', 'sentiment']+feature_cols]
        df_oof = df[df.kfold == fold][['id', 'sentiment']+feature_cols]
        
        # Fit model. 
        model.fit(df_trn[feature_cols], df_trn.sentiment.values)
        oof_pred = model.predict_proba(df_oof[feature_cols])[:, 1] 
        auc = metrics.roc_auc_score(df_oof.sentiment.values, oof_pred)
        print(f"fold={fold}, auc={auc}")

        # Return OOF predictions with ids
        df_oof.loc[:, f"{model_name}_{level}"] = oof_pred
        return df_oof[["id", f"{model_name}_{level}"]], auc

:::{danger}
The implementation of `StackingClassifier` has a side-effect: models inside the dictionaries are trained. Hence, if we train another model using the same model dictionaries (as we do here for the shuffled version of the stacker), then the models inside the dictionaries will be retrained. This means calling `stack.predict_proba(df_test)` will yield **different results** before and after training `stack_shuffled`! As usual, the stateful approach is error prone. We can modify the `StackingClassifier` to instead save a list of model dictionaries that are *clones* of the stacked models. This allows all state to be localized inside the  stacked model. In fact, [sklearn does exactly this](https://github.com/scikit-learn/scikit-learn/blob/2beed5584/sklearn/model_selection/_search.py#L765).
:::

In [44]:
# Run training of stack models 
stack_shuffled = StackingClassifierShuffledCV([level1, level2, level3])
stack_shuffled.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9384901442591425
fold=1, auc=0.9335381634668255
fold=2, auc=0.9280035949592794
fold=3, auc=0.9304714786729175
fold=4, auc=0.9414464824536768
fold=5, auc=0.937182146174952

Level 1 preds: lr_cnt
fold=0, auc=0.950827093427299
fold=1, auc=0.9397791490696997
fold=2, auc=0.9401545861295928
fold=3, auc=0.9398243567308318
fold=4, auc=0.946280093042683
fold=5, auc=0.9494958987764743

Level 1 preds: rf_svd
fold=0, auc=0.8818237726745374
fold=1, auc=0.8812812894942149
fold=2, auc=0.8704255547449934
fold=3, auc=0.8757854977279438
fold=4, auc=0.8840824637587228
fold=5, auc=0.8688171984934575

Level 2 preds: lr
fold=0, auc=0.9455989128995296
fold=1, auc=0.9517838944259236
fold=2, auc=0.9461596144163118
fold=3, auc=0.9439228206431949
fold=4, auc=0.9574097745579195
fold=5, auc=0.9484029784852765

Level 2 preds: linreg
fold=0, auc=0.947067189488291
fold=1, auc=0.9530462807823629
fold=2, auc=0.9461995932959381
fold=3, auc=0.9452739131123667
fold=4, auc=0.959042708329582

<__main__.StackingClassifierShuffledCV at 0x7f81cc37d890>

In [45]:
shuffled_train_auc = metrics.roc_auc_score(y_train, stack_shuffled.predict_proba(df_train)[:, 1])
shuffled_test_auc = metrics.roc_auc_score(y_test, stack_shuffled.predict_proba(df_test)[:, 1])

print(f"Train AUC (shuffled):", shuffled_train_auc)
print(f"Test AUC  (shuffled):", shuffled_test_auc)

experiment_results['shuffled'] = {'train': shuffled_train_auc, 'test': shuffled_test_auc}

Train AUC (shuffled): 0.9962149081453051
Test AUC  (shuffled): 0.9468861435873658


In [46]:
pd.DataFrame(experiment_results)

Unnamed: 0,same,shuffled
train,0.99609,0.996215
test,0.94729,0.946886


Observe that the train AUC increased while test score decreased when using different folds between layers. This is a classic symptom of overfitting. Moreover, if we look at CV scores we see significant decrease in performance for `linreg_2`, `xgb_2`, and `blender_3` ($\sim 2 \times 10^{-4}$). On the other hand, standard deviation is generally higher with shuffling, which indicates worse fold stability.

In [47]:
pd.DataFrame(stack_shuffled.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934855,0.944394,0.877036,0.94888,0.950089,0.945332,0.95013
std,0.005098,0.005121,0.006378,0.004983,0.005223,0.005424,0.004803


In [48]:
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934855,0.944394,0.877314,0.948829,0.950037,0.94554,0.950326
std,0.005098,0.005121,0.007694,0.004865,0.004958,0.004365,0.004888


### Conclusion

Empirical results above strongly indicate that we should use the **same folds** across levels of stacking. Indeed, the same is recommended by GM Abishek Thakur in his book [AAAMLP](https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf). The following theoretical example shows that overfitting, when using different folds, can happen due to the second stage model taking advantage of a certain relationship between ground truth and first stage predictions, without this structure generalizing well to the test set.

Consider a dataset $\{(x_1, t_1), (x_2, t_2) \ldots, (x_{10}, t_{10})\}$ with five folds such that the first fold is $F_1 = (x_1, x_2)$. Let $x_1 {\mapsto} y_1$ and $x_2 \mapsto y_2$ where the mapping is trained on $F_{\neg 1} = (x_3, \ldots, x_{10}).$ We can think of modelling on $F_{\neg 1}$ as defining some rule or distribution that the points in $F_1$ are compared against. Suppose we reshuffle folds in the next level such that the first fold is $G_1 = (y_1, y_{10}).$ Then, the model trained on $G_{\neg 1} = (y_2, \ldots y_9)$  overfits slightly since $y_2$ is modelled using the ground truths $(x_3, t_3), \ldots, (x_{9}, t_9).$ This doesn't happen if we have kept the same cross-validation folds. Overfitting can be observed at the fold level, by noticing that validation scores with shuffled folds are generally lower.

Theoretically (as mentioned in the [Kaggle Guide to Model Stacking](https://datasciblog.github.io/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/)), there is always some leakage if you train a second level model on the same training set, which you used to derive the first stage predictions. This is because you used the ground truth to get those first stage predictions, and now you take those predictions as input, and try to predict the same ground truth. However, this leakage doesn't seem to be significant in practice.

## TODO

* Parallelize stacker training [using joblib](https://www.youtube.com/watch?v=Ny3O4VpACkc&ab_channel=AbhishekThakur).
* Modify `StackingClassifier` to keep a dictionary of model clones. 