# Ensembling: Blending and Stacking

In this notebook, we demonstrate how blending and stacking of machine learning models can improve scores of individual models such that the whole is greater than its parts. Recall that we need models to be as uncorrelated as possible for this to work well. We show that stacking mainly requires good cross-validation strategy between levels of prediction. In particular, we show that maintaining the same cross-validation folds between levels minimizes overfitting.

In [156]:
import pandas as pd
import numpy as np
from sklearn import model_selection, linear_model, metrics, decomposition, ensemble
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.optimize import fmin
from functools import partial
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

## Getting the Dataset

We do not really care too much about the dataset. The dataset used here is particularly nice. No issues. Idea is that we have text data in the form of a movie review, along with its sentiment classification. We will build a **sentiment classifier** using an ensemble of three models.

In [157]:
df = pd.read_csv('../input/kumarmanoj-bag-of-words-meets-bags-of-popcorn/labeledTrainData.tsv', 
                 sep='\t', encoding='ISO-8859-1')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [158]:
len(df)

25000

## Creating Cross-Validation Folds

Here we create cross-validation folds. Very important for evaluating models, and creating Level 1 features that are not overfitted.

In [159]:
df.loc[:, 'kfold'] = -1 
df = df.sample(frac=1.0).reset_index(drop=True)
y = df['sentiment'].values

skf = model_selection.StratifiedKFold(n_splits=6)
for f, (t_, v_) in enumerate(skf.split(X=df, y=y)):
    df.loc[v_, "kfold"] = f

In [160]:
df.kfold.value_counts()

0    4167
1    4167
2    4167
3    4167
4    4166
5    4166
Name: kfold, dtype: int64

In [161]:
df_test = df[df.kfold == 5]
df = df[df.kfold < 5]

## Training Base Models

We train three models that we use to make Level 1 predictions. The resulting feature set will be three probability columns for the positive class generated by these base models. First, let us define some helper functions and custom transformers so we can easily create pipelines which take in the whole train and test dataframes without having worrying about correct format.

In [162]:
class TfidfVectorizerPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, max_features):
        self.tfv = TfidfVectorizer(max_features=max_features)
        
    def fit(self, X, y=None):
        self.tfv.fit(X.review)
        return self
    
    def transform(self, X):
        return self.tfv.transform(X.review)
    
    
class CountVectorizerPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.cvec = CountVectorizer()
        
    def fit(self, X, y=None):
        self.cvec.fit(X.review)
        return self
    
    def transform(self, X):
        return self.cvec.transform(X.review)
    
    
class TruncatedSVDPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, n_components):
        self.svd = decomposition.TruncatedSVD(n_components=n_components)
        
    def fit(self, X, y=None):
        self.svd.fit(X)
        return self
    
    def transform(self, X):
        return self.svd.transform(X)
    

class StandardScalerPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.sc = StandardScaler()
        self.cols = cols
        
    def fit(self, X, y=None):
        self.sc.fit(X[self.cols])
        return self
    
    def transform(self, X):
        return self.sc.transform(X[self.cols])
        
        
class LinearRegressionModel(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.lr = linear_model.LinearRegression()
    
    def fit(self, X, y=None):
        self.lr.fit(X, y)
        
    def predict_proba(self, X):
        return np.c_[self.lr.predict(X), self.lr.predict(X)]      

We also define `stack_oof_preds` which adds one column of prediction from a model obtained using out-of-fold predictions which is described in the `oof_predictions` function.

In [None]:
def oof_predictions(model_pipe, model_name, fold):
    "Train on K-1 folds, predict on fold K. Return predictions."
    
    # Get folds
    df_train = df[df.kfold != fold].reset_index(drop=True)
    df_valid = df[df.kfold == fold].reset_index(drop=True)
    
    # Fit model
    model_pipe.fit(df_train, df_train.sentiment.values)
    
    # Predict and evaluate model on `fold`
    pred = model_pipe.predict_proba(df_valid)[:, 1]    
    auc = metrics.roc_auc_score(df_valid.sentiment.values, pred)
    
    print(f"fold={fold}, auc={auc}")
    
    # Return OOF predictions with ids
    df_valid.loc[:, f"{model_name}_pred"] = pred
    return df_valid[["id", f"{model_name}_pred"]]


def stack_oof_preds(df, model_pipe, model_name):
    "Append OOF `model_pipe` predictions as new column on dataframe `df`."
    
    # Make OOF predictions for each fold
    dfs = []
    print(f'{model_name}')
    for j in range(5):
        temp_df = oof_predictions(model_pipe, model_name, fold=j)
        dfs.append(temp_df)
    m = pd.concat(dfs)
    
    # Merge OOF predictions to `df`. Replace if existing (for fast dev cycles in kernel)
    if f'{model_name}_pred' in df.columns:
        df.drop(f'{model_name}_pred', axis=1, inplace=True)
    
    df = df.merge(m[['id', f'{model_name}_pred']], on='id', how='left')
    return df

### Base model 1: LogReg + TF-IDF

In [163]:
lr_pipe = make_pipeline(
    TfidfVectorizerPreprocessor(max_features=1000),
    linear_model.LogisticRegression()
)

df = stack_oof_preds(df, lr_pipe, "lr")

lr
fold=0, auc=0.9338189695763991
fold=1, auc=0.9352750950708735
fold=2, auc=0.934032976946177
fold=3, auc=0.9285153186889942
fold=4, auc=0.9389834586687974


### Base model 2: LR + CountVectorizer

In [164]:
lr_cnt_pipe = make_pipeline(
    CountVectorizerPreprocessor(),
    linear_model.LogisticRegression(solver='liblinear')
)

df = stack_oof_preds(df, lr_cnt_pipe, "lr_cnt")

lr_cnt
fold=0, auc=0.9408010464015892
fold=1, auc=0.941410587306253
fold=2, auc=0.9454018593070861
fold=3, auc=0.9435273943255105
fold=4, auc=0.9490850077058897


### Base model 3: RF + SVD

In [165]:
rf_svd_pipe = make_pipeline(
    TfidfVectorizerPreprocessor(max_features=None),
    TruncatedSVDPreprocessor(n_components=120),
    ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1)
)

df = stack_oof_preds(df, rf_svd_pipe, "rf_svd")

rf_svd
fold=0, auc=0.8799584517016005
fold=1, auc=0.8788462583955851
fold=2, auc=0.8775649785347613
fold=3, auc=0.8730771587561496
fold=4, auc=0.8778902848171503


Check correlations:

In [166]:
level1_cols = ['lr_pred', 'lr_cnt_pred', 'rf_svd_pred']
pd.DataFrame(df[level1_cols]).corr()

Unnamed: 0,lr_pred,lr_cnt_pred,rf_svd_pred
lr_pred,1.0,0.888219,0.828686
lr_cnt_pred,0.888219,1.0,0.723146
rf_svd_pred,0.828686,0.723146,1.0


## Blending

Here Abishek uses `glob` to get all files, since he saved all DataFrames on disk. 

```python
files = glob.glob("../model_preds/*.csv")
df = None
for f in files:
    if df is None:
        df = pd.read_csv(f)
    else:
        temp_df = pd.read_csv(f)
        df = df.merge(temp_df, on="id", how="left")
```

In [167]:
target = df.sentiment.values

# roc is scale invariant, so we dont bother dividing by total weights
avg_preds = (df[level1_cols] * [1, 1, 1]).sum(axis=1)
wtd_preds = (df[level1_cols] * [1, 3, 1]).sum(axis=1)
rank_avg_preds = (df[level1_cols].rank() * [1, 1, 1]).sum(axis=1)
rank_wtd_preds = (df[level1_cols].rank() * [1, 3, 1]).sum(axis=1)

print(f"auc (averaged):\t\t", metrics.roc_auc_score(target, avg_preds))
print(f"auc (wtd. avg):\t\t", metrics.roc_auc_score(target, wtd_preds))
print(f"auc (rank avg):\t\t", metrics.roc_auc_score(target, rank_avg_preds)) 
print(f"auc (wtd. rank avg):\t", metrics.roc_auc_score(target, rank_wtd_preds))

auc (averaged):		 0.9474166574197703
auc (wtd. avg):		 0.9486698241918139
auc (rank avg):		 0.9429065112577433
auc (wtd. rank avg):	 0.9488866904401518


### Optimize AUC

We want to find the optimal coefficients for blending.

In [168]:
 class OptimizeAUC:
        """Implement blending that maximizes AUC score. 
        Observe that this looks like an sklearn model."""
        
        def __init__(self):
            self.coef_ = None
            
        def fit(self, X, y):
            """Find weights of probability columns."""
            
            # Think of: partial_loss(coef) = _auc(coef, X, y)
            partial_loss = partial(self._auc, X=X, y=y) 
            
            # Initialize coefficients for descent
            init_coef = np.random.dirichlet(np.ones(X.shape[1]))
            
            # Compute best coefficients for blending
            self.coef_ = fmin(partial_loss, init_coef, disp=True) 
            
        def predict_proba(self, X):
            """Return blended probabilities for class 0 and class 1."""
            
            x_coef = X * self.coef_
            predictions = np.sum(x_coef, axis=1)
            return np.c_[1-predictions, predictions]
        
        def _auc(self, coef, X, y):
            """Compute AUC of blended positive predict probas.
            X: probability features columns
            y: targets
            coef: weights of columns of X."""
            
            x_coef = X * coef
            predictions = np.sum(x_coef, axis=1)
            auc_score = metrics.roc_auc_score(y, predictions)
            
            # Negative AUC since we use minimizer
            return -1.0 * auc_score 

In [169]:
# Example: usage of fmin and partial
# Prints min value 3.000, returns minimum, i.e. x = 0
def f(x, y):
    return x**2 + y

print(fmin(partial(f, y=3), 100, disp=True))

Optimization terminated successfully.
         Current function value: 3.000000
         Iterations: 24
         Function evaluations: 48
[0.]


In [170]:
def find_best_coef(df, cols, fold, opt):
    """Helper function for optimizer class. Here opt needs only 
    to implement two methods and an attribute. Basically, to look
    like an sklearn model:
        - fit           (method: find best coef)
        - predict_proba (method: predicts using best coef)
        - coef_         (attr: best coef)
        
    Return best coef obtained on the ~`fold` subset. The evaluation on
    `fold` using the best coef found is printed.
    """
    
    # Get train and valid folds of level predictions
    P_train = df[df.kfold != fold][cols]
    y_train = df[df.kfold != fold].sentiment.values
    P_valid = df[df.kfold == fold][cols]
    y_valid = df[df.kfold == fold].sentiment.values
    
    # Find best coef using opt on xtrain subset
    opt.fit(P_train, y_train)
    
    # Make prediction on xvalid subset using optimal coefs
    pred = opt.predict_proba(P_valid)[:, 1]
    auc = metrics.roc_auc_score(y_valid, pred)
    
    print(f"fold={fold} auc={auc}")
    print(opt.coef_)
    print()
    
    return opt.coef_

In [171]:
# Average best coef for each fold. Use average as final blending coefficients
coefs = []
for j in range(5):
    opt = OptimizeAUC()
    coefs.append(find_best_coef(df, level1_cols, fold=j, opt=opt))

best_coefs = sum(coefs)/5 

Optimization terminated successfully.
         Current function value: -0.949795
         Iterations: 39
         Function evaluations: 80
fold=0 auc=0.9459351500078785
[0.19805461 0.45966138 0.07145728]

Optimization terminated successfully.
         Current function value: -0.949316
         Iterations: 45
         Function evaluations: 98
fold=1 auc=0.9479026817035446
[0.29371511 0.63384135 0.10039952]

Optimization terminated successfully.
         Current function value: -0.948747
         Iterations: 56
         Function evaluations: 113
fold=2 auc=0.9502855581653142
[0.15863027 0.32254164 0.04235994]

Optimization terminated successfully.
         Current function value: -0.949593
         Iterations: 54
         Function evaluations: 108
fold=3 auc=0.9468054620025194
[0.24773703 0.4791501  0.09503643]

Optimization terminated successfully.
         Current function value: -0.947791
         Iterations: 70
         Function evaluations: 134
fold=4 auc=0.9543175683913555
[0.03706

In [172]:
# Final result! Overall train AUC for blended level 1 predictions.
# We simply average best coeffs found on each fold. Might be suboptimal.

blended_preds = (df[level1_cols] * best_coefs).sum(axis=1)
print("Train blended AUC:", metrics.roc_auc_score(df.sentiment.values, blended_preds))

Train blended AUC: 0.9490374176894534


In [173]:
# Checking if ranking improves predictions

blended_preds = (df[level1_cols].rank() * best_coefs).sum(axis=1)
print("Train blended rank AUC:", metrics.roc_auc_score(df.sentiment.values, blended_preds))

Train blended rank AUC: 0.9496022301808758


Observe that the blended model has better than train AUC scores of individual models! Even better is using rank probabilities! Here individual probabilities are replaced by their rank index, this is a good trick for AUC which only cares about the probability of ranking negative examples lower than positive examples. Note that for single models, using rank does not affect score. Only works for ensembles. (See below.)

### Blending Inference

In [174]:
# Refit models on whole train set
lr_pipe.fit(df, df.sentiment.values)
lr_cnt_pipe.fit(df, df.sentiment.values)
rf_svd_pipe.fit(df, df.sentiment.values)



Pipeline(steps=[('tfidfvectorizerpreprocessor',
                 TfidfVectorizerPreprocessor(max_features=None)),
                ('truncatedsvdpreprocessor',
                 TruncatedSVDPreprocessor(n_components=None)),
                ('randomforestclassifier', RandomForestClassifier(n_jobs=-1))])

In [175]:
for model_name in ['lr', 'lr_cnt', 'rf_svd']:
    df_test[model_name+'_pred'] = eval(model_name + "_pipe").predict_proba(df_test)[:, 1]

blended_test_preds = (df_test[level1_cols] * best_coefs).sum(axis=1)
print("Test blended AUC:", metrics.roc_auc_score(df_test.sentiment.values, blended_test_preds))

Test blended AUC: 0.9510487592561137


In comparison, let's see test performance of individual models.

In [176]:
for model_name in ['lr', 'lr_cnt', 'rf_svd']:
    print(f"Test {model_name} AUC:", metrics.roc_auc_score(df_test.sentiment.values, df_test[model_name+'_pred']))

Test lr AUC: 0.935395673869509
Test lr_cnt AUC: 0.945592062852956
Test rf_svd AUC: 0.876999964737517


Awesome!!! This is our current best test score.

## Stacking

Instead of fixed constants for stacking, we learn the weights using logistic regression. Then, pass the results through a sigmoid. This is basically **stacking** since we use a Level 2 model. Note that we are using the **same folds** to evaluate the Level 2 model. It is recommended to use the same folds to make Level 2 stacking.

In [177]:
def oof_score(fold, model):
    """Get out-of-fold AUC score of model."""
    
    # Get train and valid folds of level predictions
    X_train = df[df.kfold != fold][level1_cols]
    y_train = df[df.kfold != fold].sentiment.values
    
    X_valid = df[df.kfold == fold][level1_cols]
    y_valid = df[df.kfold == fold].sentiment.values
    
    # Find best coef using opt on xtrain subset
    model.fit(X_train, y_train)
    
    # Make prediction on xvalid subset using optimal coefs
    pred = model.predict_proba(X_valid)[:, 1]
    auc = metrics.roc_auc_score(y_valid, pred)
    
    print(f"fold={fold} auc={auc}")
    
    return auc

### Meta model 1: Logistic Regression

In [178]:
# cross validate logistic regression model
cv_scores = []
for j in range(5):
    model = linear_model.LogisticRegression()
    cv_scores.append(oof_score(fold=j, model=model))
    
print(sum(cv_scores) / 5)

fold=0 auc=0.945384812433713
fold=1 auc=0.9473373705243895
fold=2 auc=0.9491648414226123
fold=3 auc=0.9446545612365157
fold=4 auc=0.9534996170678716
0.9480082405370205


### Meta model 2: Linear Regression

In [179]:
# cross validate linear regression model; we need to scale for convergence
lin_reg_pipe = make_pipeline(
    StandardScalerPreprocessor(level1_cols),
    LinearRegressionModel()
)
cv_scores = []
for j in range(5):
    model = lin_reg_pipe
    cv_scores.append(oof_score(fold=j, model=model))
    
print(sum(cv_scores) / 5)

fold=0 auc=0.9459627935863211
fold=1 auc=0.9479386183555204
fold=2 auc=0.950232113913658
fold=3 auc=0.9465230367760953
fold=4 auc=0.9543042009141052
0.9489921527091401


### Meta model 3: XGBClassifier

In [180]:
# cross validate XGBoost classifier 
cv_scores = []
for j in range(5):
    model = XGBClassifier(eval_metric="logloss", use_label_encoder=False)
    cv_scores.append(oof_score(fold=j, model=model))
    
print(sum(cv_scores) / 5)

fold=0 auc=0.9418400994063082
fold=1 auc=0.9447695124502069
fold=2 auc=0.9454844444976839
fold=3 auc=0.9407664919285357
fold=4 auc=0.9502164033235234
0.9446153903212517


### Stacking Inference

Recall base models have been fitted on the whole train set. And has predicted on the test set. We now check inference scores using the metamodels. Should be close to cross-validated scores.

In [181]:
df_test.head()

Unnamed: 0,id,sentiment,review,kfold,lr_pred,lr_cnt_pred,rf_svd_pred
20794,8390_8,1,My first Fassbinder was a wonderful experience...,5,0.977814,0.998684,0.65
20797,7316_10,1,"This film is one of the best of all time, cert...",5,0.996353,1.0,0.7
20800,5784_8,1,I've noticed that a lot of people who post on ...,5,0.988561,0.999921,0.6
20801,10546_9,1,I chanced upon this movie because I had a free...,5,0.757632,0.930755,0.57
20803,10796_9,1,If you're a a fan of either or both Chuck Norr...,5,0.842108,0.961403,0.6


In [182]:
# Define metamodels
logreg = linear_model.LogisticRegression()
linreg = make_pipeline(
    StandardScalerPreprocessor(level1_cols),
    LinearRegressionModel()
)
xgbclf = XGBClassifier(eval_metric="logloss", use_label_encoder=False)

# Fit on level 1 features
logreg.fit(df[level1_cols], df.sentiment.values)
linreg.fit(df[level1_cols], df.sentiment.values)
xgbclf.fit(df[level1_cols], df.sentiment.values)

# Score inference
metamodels = {
    'logreg': logreg,
    'linreg': linreg,
    'xgbclf': xgbclf
}

for model_name in metamodels.keys():
    stacker_test_preds = metamodels[model_name].predict_proba(df_test[level1_cols])[:, 1]
    print(f"Test {model_name} AUC:", metrics.roc_auc_score(df_test.sentiment.values, stacker_test_preds))

Test logreg AUC: 0.94972607042955
Test linreg AUC: 0.9509284519608591
Test xgbclf AUC: 0.9483125749471814


Linear regression best model (as a ranking model). Better than blending. Also similar scores with cross val scores.

:::{note}
Alternatively, we could make predictions on the test dataset using each base model immediately after it gets fitted on each fold. In our case, this would generate test-set predictions for five of each base models. Then, we would average the predictions per model to generate our level 1 meta features.

One benefit to this is that it’s less time consuming than the first approach (since we don’t have to retrain each model on the full training dataset). It also helps that our train meta features and test meta features should follow a similar distribution. However, the test meta features are likely more accurate in the first approach since each base model was trained on the full training dataset (as opposed to 80% of the training dataset, five times in the 2nd approach).
:::

## Experiment: Using Same CV Folds for Level 2 Stacking

It's still not clear to me whether using same folds between levels 1 and 2 affect generalization error. We check CV scores with folds shuffled. Compare with actual test performance. Abishek in AAAMLP recommends using same folds. We test whether using different folds results in overfitting.

Here we have level two features from the three metamodels. We blend the predict probabilities of each metamodel.

### Same Folds

In [183]:
class LinearRegressionLevel2(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
        self.lr = linear_model.LinearRegression()
        self.sc = StandardScaler()
    
    def fit(self, X, y):
        self.lr.fit(self.sc.fit_transform(X[self.cols]), y)
        
    def predict_proba(self, X):
        return np.c_[self.lr.predict(self.sc.transform(X[self.cols])), 
                     self.lr.predict(self.sc.transform(X[self.cols]))]
    

class LogisticRegressionLevel2(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
        self.logreg = linear_model.LogisticRegression()
    
    def fit(self, X, y):
        self.logreg.fit(X[self.cols], y)
        
    def predict_proba(self, X):
        return self.logreg.predict_proba(X[self.cols])

    
class XGBClassifierLevel2(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
        self.xgb = XGBClassifier(eval_metric="logloss", use_label_encoder=False)
    
    def fit(self, X, y):
        self.xgb.fit(X[self.cols], y)
        
    def predict_proba(self, X):
        return self.xgb.predict_proba(X[self.cols])

In [184]:
# Define metamodels
linreg = LinearRegressionLevel2(level1_cols)
logreg = LogisticRegressionLevel2(level1_cols)
xgbclf = XGBClassifierLevel2(level1_cols)

# Create level 2 features
metamodels = {
    'logreg': logreg,
    'linreg': linreg,
    'xgbclf': xgbclf
}
level2_cols = [model_name + '_pred' for model_name in metamodels.keys()]
for model_name in metamodels.keys():
    df = stack_oof_preds(df, metamodels[model_name], model_name)

logreg
fold=0, auc=0.945384812433713
fold=1, auc=0.9473373705243895
fold=2, auc=0.9491648414226123
fold=3, auc=0.9446545612365157
fold=4, auc=0.9534996170678716
linreg
fold=0, auc=0.9459627935863211
fold=1, auc=0.9479386183555204
fold=2, auc=0.950232113913658
fold=3, auc=0.9465230367760953
fold=4, auc=0.9543042009141052
xgbclf
fold=0, auc=0.9418400994063082
fold=1, auc=0.9447695124502069
fold=2, auc=0.9454844444976839
fold=3, auc=0.9407664919285357
fold=4, auc=0.9502164033235234


In [185]:
df.head()

Unnamed: 0,id,sentiment,review,kfold,lr_pred,lr_cnt_pred,rf_svd_pred,logreg_pred,linreg_pred,xgbclf_pred
0,4336_4,0,I watched this movie and the original Carlitos...,0,0.194985,0.01108888,0.35,0.053552,0.063751,0.048794
1,6718_2,0,I wanted to see the movie because of an articl...,0,0.274892,0.2456256,0.6,0.209615,0.278658,0.339733
2,1962_2,0,The only scary thing about this movie is the t...,0,0.005616,1.768062e-07,0.23,0.023239,-0.024696,0.002535
3,2773_1,0,"Up to this point, Gentle Rain was the movie I ...",0,0.016342,2.387402e-05,0.23,0.023968,-0.021198,0.003605
4,9713_10,1,This series premiered on the cable TV station ...,0,0.948322,0.999805,0.69,0.962137,0.986046,0.998475


Blend Level 2 predictions.

In [186]:
# Average best coef for each fold. Use average as final blending coefficients
coefs = []
for j in range(5):
    opt = OptimizeAUC()
    coefs.append(find_best_coef(df, level2_cols, fold=j, opt=opt))

best_coefs = sum(coefs)/5 

Optimization terminated successfully.
         Current function value: -0.949651
         Iterations: 87
         Function evaluations: 169
fold=0 auc=0.9460222272799732
[0.01642233 0.85312636 0.1141321 ]

Optimization terminated successfully.
         Current function value: -0.949126
         Iterations: 49
         Function evaluations: 111
fold=1 auc=0.9481351181256181
[0.0098476  0.86634474 0.06720761]

Optimization terminated successfully.
         Current function value: -0.948633
         Iterations: 60
         Function evaluations: 116
fold=2 auc=0.9503219555435971
[0.06304008 0.78625821 0.089259  ]

Optimization terminated successfully.
         Current function value: -0.949586
         Iterations: 39
         Function evaluations: 87
fold=3 auc=0.9461788742244825
[0.24736973 0.67406233 0.13163703]

Optimization terminated successfully.
         Current function value: -0.947411
         Iterations: 46
         Function evaluations: 100
fold=4 auc=0.9541767489327337
[0.3347

In [187]:
# Refit models on whole train set
for model_name in metamodels.keys():
    metamodels[model_name].fit(df, df.sentiment.values)
    df_test[model_name+'_pred'] = eval(model_name).predict_proba(df_test)[:, 1]

blended_test_preds = (df_test[level2_cols] * best_coefs).sum(axis=1)
print("Test blended AUC:", metrics.roc_auc_score(df_test.sentiment.values, blended_test_preds))

Test blended AUC: 0.9510374660425746


In [188]:
blended_train_preds = (df[level2_cols] * best_coefs).sum(axis=1)
print("Train blended AUC:", metrics.roc_auc_score(df.sentiment.values, blended_train_preds))

Train blended AUC: 0.9489021078214236


In [189]:
df.head()

Unnamed: 0,id,sentiment,review,kfold,lr_pred,lr_cnt_pred,rf_svd_pred,logreg_pred,linreg_pred,xgbclf_pred
0,4336_4,0,I watched this movie and the original Carlitos...,0,0.194985,0.01108888,0.35,0.053552,0.063751,0.048794
1,6718_2,0,I wanted to see the movie because of an articl...,0,0.274892,0.2456256,0.6,0.209615,0.278658,0.339733
2,1962_2,0,The only scary thing about this movie is the t...,0,0.005616,1.768062e-07,0.23,0.023239,-0.024696,0.002535
3,2773_1,0,"Up to this point, Gentle Rain was the movie I ...",0,0.016342,2.387402e-05,0.23,0.023968,-0.021198,0.003605
4,9713_10,1,This series premiered on the cable TV station ...,0,0.948322,0.999805,0.69,0.962137,0.986046,0.998475


### Different Folds (Shuffled)

We shuffle the kfold column and see whether there is significant drop in test score.

In [190]:
import random
df['kfold'] = random.sample(df.kfold.tolist(), len(df)) 

# Define metamodels
linreg = LinearRegressionLevel2(level1_cols)
logreg = LogisticRegressionLevel2(level1_cols)
xgbclf = XGBClassifierLevel2(level1_cols)

# Create level 2 features
metamodels = {
    'logreg': logreg,
    'linreg': linreg,
    'xgbclf': xgbclf
}
level2_cols = [model_name + '_pred' for model_name in metamodels.keys()]
for model_name in metamodels.keys():
    df = stack_oof_preds(df, metamodels[model_name], model_name)
    
# Get blending coefficients
coefs = []
for j in range(5):
    opt = OptimizeAUC()
    coefs.append(find_best_coef(df, level2_cols, fold=j, opt=opt))

best_coefs = sum(coefs)/5 

# Refit models on whole train set
for model_name in metamodels.keys():
    metamodels[model_name].fit(df, df.sentiment.values)
    df_test[model_name+'_pred'] = eval(model_name).predict_proba(df_test)[:, 1]

blended_test_preds = (df_test[level2_cols] * best_coefs).sum(axis=1)
print("Test blended AUC:", metrics.roc_auc_score(df_test.sentiment.values, blended_test_preds))


logreg
fold=0, auc=0.9501040818131891
fold=1, auc=0.9499241183771747
fold=2, auc=0.9499064438524856
fold=3, auc=0.9451329726898312
fold=4, auc=0.9447659104945992
linreg
fold=0, auc=0.9508122435965958
fold=1, auc=0.9511657757755636
fold=2, auc=0.9514574941700232
fold=3, auc=0.9455855446264058
fold=4, auc=0.9457376475847319
xgbclf
fold=0, auc=0.9476719354149082
fold=1, auc=0.9470740654397218
fold=2, auc=0.9472979325474002
fold=3, auc=0.9410262800661775
fold=4, auc=0.9426348899032251
Optimization terminated successfully.
         Current function value: -0.948559
         Iterations: 38
         Function evaluations: 81
fold=0 auc=0.9510200386286818
[0.03254031 0.3080599  0.05749097]

Optimization terminated successfully.
         Current function value: -0.948474
         Iterations: 38
         Function evaluations: 87
fold=1 auc=0.9513122867413105
[0.09500983 0.57026417 0.10258291]

Optimization terminated successfully.
         Current function value: -0.948445
         Iterations: 59

Test score actually dropped. And train score increased. Repeated this experiment thrice to get different sampling getting same behavior. Very curious!

In [191]:
blended_train_preds = (df[level2_cols] * best_coefs).sum(axis=1)
print("Train blended AUC:", metrics.roc_auc_score(df.sentiment.values, blended_train_preds))

Train blended AUC: 0.9490388184318046


In [192]:
df.head()

Unnamed: 0,id,sentiment,review,kfold,lr_pred,lr_cnt_pred,rf_svd_pred,logreg_pred,linreg_pred,xgbclf_pred
0,4336_4,0,I watched this movie and the original Carlitos...,3,0.194985,0.01108888,0.35,0.055443,0.06607,0.062589
1,6718_2,0,I wanted to see the movie because of an articl...,1,0.274892,0.2456256,0.6,0.215703,0.281514,0.559257
2,1962_2,0,The only scary thing about this movie is the t...,2,0.005616,1.768062e-07,0.23,0.023499,-0.025485,0.001189
3,2773_1,0,"Up to this point, Gentle Rain was the movie I ...",3,0.016342,2.387402e-05,0.23,0.025219,-0.017726,0.001061
4,9713_10,1,This series premiered on the cable TV station ...,1,0.948322,0.999805,0.69,0.961982,0.987177,0.99287


### Conclusion

Empirical results strongly indicate that we should use the **same folds** across levels of stacking. Indeed, the same is recommended by GM Abishek in his book AAAMLP. Searching around Kaggle, I found [this comment](https://www.kaggle.com/general/18793#424642) by [Trian](https://www.kaggle.com/trian2018) who is a Kaggle Master that supports this result:

> The following theoretical example shows that overfitting can happen [when using different folds], due to the second stage model taking advantage of a certain relationship between ground truth and first stage predictions, without this structure having any real meaning, which could generalise well to the test set.

Trian continues to give an example that I don't understand (even after two bottles of milk chocolate). So instead let's come up with an explanation that is of the same spirit but is more abstract.

Suppose we have $x_1, x_2, \ldots, x_{10}$ with five folds such that $x_1$ and $x_2$ are in the same fold $F_1$. Let $x_1, x_2 {\mapsto} y_1, y_2$ trained on $F_{\neg 1} = (x_3, \ldots, x_{10}).$ We can think of as defining distribution that the points in $x_1$ and $x_2$ are compared against. Suppose we reshuffle folds in the next level such that the first fold is $G_1 = (y_1, y_{10}).$ These points are compared against the distribution of $G_{\neg 1} = (y_2, \ldots y_9).$ Then, the model trained on $G_{\neg 1}$ can overfit slightly since $y_2$ is too adapted to the rest of the points $y_3, \ldots y_9$ in $G_{\neg 1},$ i.e. $y_2$ is mapped by the model trained on $x_3, \ldots, x_9.$ 


Trian further notes:
> Theoretically (as mentioned in https://datasciblog.github.io/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/), you always have leakage if you train a second level model on the same training set, which you used to derive the first stage predictions. This is because you used the ground truth to get those first stage predictions, and now you take those predictions as input, and try to predict the same ground truth. However, it seems to me, that this kind of leakage is more theoretical, and does not happen in practice, as long as you keep the same folds, as described above.

**Todo**
* Record CV scores in a DataFrame. This would allow us to calculate standard deviations which is a measure of stability of the folds. Also, an easy reference to see what's happening with the components of the network. 