# Ensembling: Blending and Stacking

In this notebook, we implement stacking of machine learning models. Stacking several uncorrelated models is known to generalize better than individual models. Stacking mainly requires good cross-validation strategy between levels of prediction. In particular, we will demostrate that maintaining the same cross-validation folds between levels minimizes overfitting.

In [1]:
import pandas as pd
import numpy as np
import time
import joblib
import random
from joblib import Parallel, delayed
from scipy.optimize import minimize
from xgboost import XGBClassifier

from sklearn import model_selection, linear_model, metrics, decomposition, ensemble
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin, clone

from tqdm.notebook import tqdm as tqdm
from functools import partial, reduce
from typing import List
import warnings
warnings.simplefilter(action='ignore')


NUM_FOLDS = 5

## Dataset

We do not really care too much about the dataset. The dataset used here is particularly nice. No issues. Idea is that we have text data in the form of a movie review, along with its sentiment classification. We will build a **sentiment classifier** using an ensemble of three models.

In [2]:
df = pd.read_csv('../input/kumarmanoj-bag-of-words-meets-bags-of-popcorn/labeledTrainData.tsv', sep='\t')
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### Train and test split

In [3]:
df_train, df_test = model_selection.train_test_split(df, test_size=0.20)
print(df_train.shape, df_test.shape)

(20000, 3) (5000, 3)


### Cross-validation folds

Here we create cross-validation folds. Very important for evaluating models, and creating Level 1 features that are not overfitted.

In [4]:
df_train.loc[:, 'kfold'] = -1 
df_train = df_train.sample(frac=1.0).reset_index(drop=True)
y = df_train['sentiment'].values

skf = model_selection.StratifiedKFold(n_splits=NUM_FOLDS)
for f, (t_, v_) in enumerate(skf.split(X=df_train, y=y)):
    df_train.loc[v_, "kfold"] = f

In [5]:
df_train.kfold.value_counts()

0    4000
1    4000
2    4000
3    4000
4    4000
Name: kfold, dtype: int64

## Stacking and Blending

We define a class that automates training and prediction of stacked models. Several models can be trained on the training set whose predict probabilities can be used as feature for a further metamodel called a **stacker**. Observe that this process can be iterated to several more levels. To avoid creating meta features that are overfitted to the train set, the meta features are generated by out-of-fold (OOF) training and prediction of the models on the features of the previous level. This requires defining cross-validation folds. The same cross-validation folds will be used to generate metafeatures at deeper levels. This will be justified later. 

After generating metafeatures, the models will be retrained on the whole training set (not just on train folds). This increases accuracy of prediction on the test set. Finally, prediction on the test set will simulate conditions when the model was trained &mdash; essentially the test set acts like an extra validation fold.


:::{note}
Alternatively, we could make predictions on the test dataset using each base model immediately after it gets fitted on each fold. In our case, this would generate test-set predictions for five of each base models. Then, we would average the predictions per model to generate our level 1 meta features.

One benefit to this is that it’s less time consuming than the first approach (since we don’t have to retrain each model on the full training dataset). It also helps that our train meta features and test meta features should follow a similar distribution. However, the test meta features are likely more accurate in the first approach since each base model was trained on the full training dataset (as opposed to 80% of the training dataset, five times in the 2nd approach).
:::



### Implementation

In [6]:
class StackingClassifier:
    """Implements model stacking for classification."""
    
    def __init__(self, model_dict_list):
        """Initialize by passing a list of dictionaries of name-model pairs 
        for each level."""
        
        self.model_dict_list = model_dict_list
        self.cv_scores_ = {}
        self.metafeatures_ = None
        
    def fit(self, df):
        """Fit classifier. This assumes `df` is a DataFrame with "id", "kfold", 
        "sentiment" (target) columns, followed by features columns."""
        
        df = df.copy()
        
        # Iterating over all stacking levels
        metafeatures = []
        for m in range(len(self.model_dict_list)):
            
            # Get models in current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Identify feature columns, i.e. preds of prev. layer
            if m == 0:
                feature_cols = ['review']
            else:
                prev_level_names = self.model_dict_list[m-1].keys()
                feature_cols = [f'{name}_{level-1}' for name in prev_level_names]
            
            # Iterate over models in the current layer
            for model_name in model_dict.keys():
                print(f'\nLevel {level} preds: {model_name}')
                self.cv_scores_[f'{model_name}_{level}'] = []
                model = model_dict[model_name]
                
                # Generate feature for next layer models from OOF preds
                oof_preds = []
                for j in range(df.kfold.nunique()):
                    oof_pred, oof_auc = self._oof_pred(df, feature_cols, model, 
                                                        model_name, fold=j, level=level)
                    oof_preds.append(oof_pred)
                    self.cv_scores_[f'{model_name}_{level}'].append(oof_auc)
                
                pred = pd.concat(oof_preds)
                df = df.merge(pred[['id', f'{model_name}_{level}']], on='id', how='left')   
                metafeatures.append(f'{model_name}_{level}')
        
                # Train models on entire feature columns for inference
                model.fit(df[feature_cols], df.sentiment.values)
        
        self.metafeatures_ = df[metafeatures]
        return self
        
    def predict_proba(self, test_df):
        """Return classification probabilities."""
        
        test_df = test_df.copy()
        
        # Iterate over layers to make predictions
        for m in range(len(self.model_dict_list)):
            
            # Get models for current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Get feature columns to use for prediction
            if m == 0:
                feature_cols = ['review']
            else:
                prev_names = self.model_dict_list[m-1].keys()
                feature_cols = [f"{model_name}_{level-1}" for model_name in prev_names]

            # Append predictions to test DataFrame
            for model_name in model_dict.keys():
                model = model_dict[model_name]
                pred = model.predict_proba(test_df[feature_cols])[:, 1] 
                test_df.loc[:, f"{model_name}_{level}"] = pred
                    
        # Return last predictions
        return np.c_[1 - pred, pred]
        
    def _oof_pred(self, df, feature_cols, model, model_name, fold, level):
        "Train on K-1 folds, predict on fold K. Return OOF predictions with IDs."

        # Get folds; include ID and target cols, and feature cols
        df_trn = df[df.kfold != fold][['id', 'sentiment']+feature_cols]
        df_oof = df[df.kfold == fold][['id', 'sentiment']+feature_cols]
        
        # Fit model.
        model.fit(df_trn[feature_cols], df_trn.sentiment.values)
        oof_pred = model.predict_proba(df_oof[feature_cols])[:, 1] 
        auc = metrics.roc_auc_score(df_oof.sentiment.values, oof_pred)
        print(f"fold={fold}, auc={auc}")

        # Return OOF predictions with ids
        df_oof.loc[:, f"{model_name}_{level}"] = oof_pred
        return df_oof[["id", f"{model_name}_{level}"]], auc

### Blending

Let's start with a simple stacked model where we simply perform a weighted average of the prediction probabilities. This method is called **blending**. We will use three base models to generate probabilities. Hopefully these are uncorrelated:
1. Logistic Regression + TF-IDF
2. Logistic Regression + Count Vectorizer
3. Random Forest + TF-IDF + SVD

In [7]:
class ReviewColumnExtractor(BaseEstimator, ClassifierMixin):
    """Extract text column, e.g. letting X = df_train[['review']]
    as train dataset for TfidfVectorizer and CountVectorizer does
    not work as expected."""
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X.review

Initialize base models:

In [8]:
lr = make_pipeline(
    ReviewColumnExtractor(),
    TfidfVectorizer(max_features=1000),
    linear_model.LogisticRegression()
)

lr_cnt = make_pipeline(
    ReviewColumnExtractor(),
    CountVectorizer(),
    linear_model.LogisticRegression(solver='liblinear')
)

rf_svd = make_pipeline(
    ReviewColumnExtractor(),
    TfidfVectorizer(max_features=None),
    decomposition.TruncatedSVD(n_components=120),
    ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
)

Run training:

In [9]:
basemodels = {'lr': lr, 'lr_cnt': lr_cnt, 'rf_svd': rf_svd}
stack = StackingClassifier([basemodels])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9359574839893711
fold=1, auc=0.9374319843579961
fold=2, auc=0.9327744831936209
fold=3, auc=0.936746484186621
fold=4, auc=0.930962930962931

Level 1 preds: lr_cnt
fold=0, auc=0.944461736115434
fold=1, auc=0.9481647370411844
fold=2, auc=0.9413639853409963
fold=3, auc=0.9453942363485591
fold=4, auc=0.9425671925671927

Level 1 preds: rf_svd
fold=0, auc=0.8823409705852426
fold=1, auc=0.8770538442634611
fold=2, auc=0.8744843436210858
fold=3, auc=0.8865643466410866
fold=4, auc=0.8681016181016181


<__main__.StackingClassifier at 0x7f318f360150>

Check if basemodels are uncorrelated:

In [10]:
stack.metafeatures_.corr()

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
lr_1,1.0,0.888349,0.830946
lr_cnt_1,0.888349,1.0,0.727274
rf_svd_1,0.830946,0.727274,1.0


The model saves learned probabilistic features:

In [11]:
stack.metafeatures_.head()

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
0,0.414118,0.761635,0.28
1,0.982611,0.999355,0.86
2,0.102193,0.000345,0.4
3,0.820037,0.999539,0.67
4,0.84269,0.965732,0.69


We can also check scores of the base models on each validation fold. This informs us of the stability of the folds and the cross-validation performance of the base models. 

In [12]:
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1
mean,0.934775,0.94439,0.877709
std,0.002778,0.002634,0.007124


Let's try to blend the probabilities using some hand-designed coefficients.

In [13]:
target = df_train.sentiment.values

# roc is scale invariant, so we dont bother dividing by total weights
avg_preds = (stack.metafeatures_ * [1, 1, 1]).sum(axis=1)
wtd_preds = (stack.metafeatures_ * [1, 3, 1]).sum(axis=1)
rank_avg_preds = (stack.metafeatures_.rank() * [1, 1, 1]).sum(axis=1)
rank_wtd_preds = (stack.metafeatures_.rank() * [1, 3, 1]).sum(axis=1)

# Calculate AUC over combined OOF preds
print(f"Train OOF-AUC (averaged):     ", metrics.roc_auc_score(target, avg_preds))
print(f"Train OOF-AUC (wtd. avg):     ", metrics.roc_auc_score(target, wtd_preds))
print(f"Train OOF-AUC (rank avg):     ", metrics.roc_auc_score(target, rank_avg_preds)) 
print(f"Train OOF-AUC (wtd. rank avg):", metrics.roc_auc_score(target, rank_wtd_preds))

Train OOF-AUC (averaged):      0.9476556811560451
Train OOF-AUC (wtd. avg):      0.9490704516653627
Train OOF-AUC (rank avg):      0.9429212994516678
Train OOF-AUC (wtd. rank avg): 0.9489835016340605


Since these coefficients are hand-designed, we may want to devise a strategy for automatically finding the optimal coefficients for blending. This is accomplished by the folowing class.

In [14]:
class Blender(BaseEstimator, ClassifierMixin):
    """Implement blending that maximizes AUC score."""
    
    def __init__(self, rank=False, random_state=42):
        self.coef_ = None
        self.rank = rank
        self.random_state = random_state

    def fit(self, X, y):
        """Find optimal blending coefficients."""
        
        if self.rank:
            X = X.rank()

        self.coef_ = self._optimize_auc(X, y)
        return self

    def predict_proba(self, X):
        """Return blended probabilities for class 0 and class 1."""
        
        if self.rank:
            X = X.rank()
            
        pred = np.sum(X * self.coef_, axis=1)
        return np.c_[1 - pred, pred]

    def _auc(self, coef, X, y):
        """Calculate AUC of blended predict probas."""

        auc = metrics.roc_auc_score(y, np.sum(X * coef, axis=1))
        return -1.0 * auc # min -auc = max auc
    
    def _optimize_auc(self, X, y):
        """Maximize AUC as a bound-constrained optimization problem using Nelder-Mead 
        method with Dirichlet init. 
        
        Reference: 
        https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
        """
        partial_loss = partial(self._auc, X=X, y=y) 
        rng = np.random.RandomState(self.random_state)
        init_coef = rng.dirichlet(np.ones(X.shape[1]))
        return minimize(partial_loss, init_coef, 
                        method='Nelder-Mead', 
                        bounds=[(0, 1)]*X.shape[1])['x']

This implementation uses `partial` from `functools` and `minimize` from `scipy.optimize` to minimize the coefficients constained in $(0, 1).$ The initial values of the coefficient are drawn from a Dirichlet distribution $\operatorname{Dir}(\boldsymbol{\alpha})$ with $\boldsymbol{\alpha} = [1, 1, 1].$

In [15]:
target = df_train.sentiment.values

# Blended predictions
blender = Blender()
blender.fit(stack.metafeatures_, target)
combined_oof_preds = (stack.metafeatures_ * blender.coef_).sum(axis=1)

# Blended ranked predictions
blender_rk = Blender(rank=True)
blender_rk.fit(stack.metafeatures_, target)
combined_oof_rk_preds = (stack.metafeatures_.rank() * blender_rk.coef_).sum(axis=1)

print(f"Train OOF-AUC (Blended):    ", metrics.roc_auc_score(target, combined_oof_preds))
print(f"Train OOF-AUC (Blended rk.):", metrics.roc_auc_score(target, combined_oof_rk_preds))

Train OOF-AUC (Blended):     0.9494805318129915
Train OOF-AUC (Blended rk.): 0.9499305419749952


Note that Train OOF-AUC is not the same as train AUC. However, this should be a better approximation of the test AUC. Calculating the AUC on the entire out-of-fold predictions involves tracking the rows of the confusion matrix, which is the sum of the confusion matrix of each fold, over all thresholds. On the other hand, the latter approach tracks each confusion matrix separately, then averages the individual AUCs. The two should be similar to cross-validation scores if error is well-distributed between folds &mdash; and we are blending probabilities. [^ref]

[^ref]: For some reason OOF-AUC is bad when blending ranking models, e.g. linear regression, with usual classifiers, even after transforming predict probabilities to rank.

In [16]:
# Inference
test_target = df_test.sentiment.values
test_features = []
for model_name in basemodels.keys():
    test_features.append(basemodels[model_name].predict_proba(df_test)[:, 1])

test_pred = (pd.DataFrame(np.c_[test_features].T) * blender.coef_).sum(axis=1)
test_rk_pred = (pd.DataFrame(np.c_[test_features].T) * blender_rk.coef_).sum(axis=1)
print('Test AUC (Blended):    ', metrics.roc_auc_score(test_target, test_pred))
print('Test AUC (Blended rk.):', metrics.roc_auc_score(test_target, test_rk_pred))

Test AUC (Blended):     0.9442348787929018
Test AUC (Blended rk.): 0.9442668789772228


:::{tip}
Using blended **rank probabilities** is a good trick when optimizing AUC score. Here individual probabilities are replaced by their rank index. Recall that AUC only cares about the predict probability of a randomly chosen negative examples to be assigned lower predict proba than a randomly chosen positive example. Note that this only works for ensembles; for single models using rank probabilities does not affect AUC score.
:::

### XGB Metamodel

Blending can be easily generalized to more complex machine learning model that learns and predicts with the metafeatures using more complex algorithms. For example, we can use `XGBoostClassifier`.

In [17]:
basemodels = {'lr': lr, 'lr_cnt': lr_cnt, 'rf_svd': rf_svd}
metamodel = {'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False)}
stack = StackingClassifier([basemodels, metamodel])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9359574839893711
fold=1, auc=0.9374319843579961
fold=2, auc=0.9327744831936209
fold=3, auc=0.936746484186621
fold=4, auc=0.930962930962931

Level 1 preds: lr_cnt
fold=0, auc=0.944461736115434
fold=1, auc=0.9481647370411844
fold=2, auc=0.9413639853409963
fold=3, auc=0.9453942363485591
fold=4, auc=0.9425671925671927

Level 1 preds: rf_svd
fold=0, auc=0.8860449715112428
fold=1, auc=0.8800172200043049
fold=2, auc=0.8816349704087426
fold=3, auc=0.8886420971605242
fold=4, auc=0.8707864957864959

Level 2 preds: xgb
fold=0, auc=0.9477989869497467
fold=1, auc=0.9475459868864966
fold=2, auc=0.9424769856192465
fold=3, auc=0.9470486117621529
fold=4, auc=0.9410519410519409


<__main__.StackingClassifier at 0x7f317c296c90>

In [18]:
y_train = df_train.sentiment.values
y_test = df_test.sentiment.values

print(f"Train AUC (XGB stack):", metrics.roc_auc_score(y_train, stack.predict_proba(df_train)[:, 1]))
print(f"Test AUC  (XGB stack):", metrics.roc_auc_score(y_test, stack.predict_proba(df_test)[:, 1]))

Train AUC (XGB stack): 0.9998812899572643
Test AUC  (XGB stack): 0.940547417553125


In [19]:
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,xgb_2
mean,0.934775,0.94439,0.881425,0.945185
std,0.002778,0.002634,0.006867,0.003174


Observe that cross-validated AUC scores is indicative of test performance. Meanwhile, train AUC is useless. A better estimate is the mean cross-validation AUC score. If we assume that each fold has the same error distribution, then this should approximate the test AUC which can be thought of as predicting on another fold. Indeed, the above results supports this.

### Conclusion

The above examples of building ensembles with **blending** or **stacking** (e.g. with XGBoost) show that stacked models significantly outperform single models.

## Experiment: CV Folds

Consider stacking three levels of models. Our current implementation does this by keeping the same cross-validation folds when training the level 2 models [^ref2]. It is unclear whether using same folds between levels 1 and 2 affect generalization error. To check this empirically, we compute cv scores with with fold indices shuffled after training each level. This simulates having different cv fold assignment for each training example. We calculate tran and test set AUCs, as well as cv scores. If cv scores decrease and test AUC decrease, while overall train AUC increases, then this indicates using a different validation fold between levels results in overfitting. 

[^ref2]: GM Abishek Thakur recommends keeping the same folds in [AAAMLP](https://github.com/abhishekkrthakur/approachingalmost/blob/master/AAAMLP.pdf).

In [20]:
class LinearRegressionClassifier(BaseEstimator, ClassifierMixin):
    """Linear regression for model-based AUC optimization.
    Note that we transform probabilities to rank probabilities!"""
    
    def __init__(self): 
        self.lr = linear_model.LinearRegression()
        
    def fit(self, X, y):
        self.lr.fit(pd.DataFrame(X).rank(), y)
        return self
        
    def predict_proba(self, X):
        return np.c_[[0]*len(X), self.lr.predict(pd.DataFrame(X).rank())]

Define models for stacking.

In [21]:
# Base models
level1 = {
    'lr': make_pipeline(
        ReviewColumnExtractor(),
        TfidfVectorizer(max_features=1000),
        linear_model.LogisticRegression()
    ), 
    
    'lr_cnt': make_pipeline(
        ReviewColumnExtractor(),
        CountVectorizer(),
        linear_model.LogisticRegression(solver='liblinear')
    ), 
    
    'rf_svd': make_pipeline(
        ReviewColumnExtractor(),
        TfidfVectorizer(max_features=None),
        decomposition.TruncatedSVD(n_components=120),
        ensemble.RandomForestClassifier(n_estimators=100, n_jobs=-1)
    )
}

# Meta models
level2 = {
    'lr': linear_model.LogisticRegression(),
    'linreg': make_pipeline(
        StandardScaler(), 
        LinearRegressionClassifier()
    ),
    'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False)
}

# Blender head: rank true for linear regression
level3 = {'blender': Blender(rank=True)}

### Same Folds

In [22]:
# Run training of stack models
stack = StackingClassifier([level1, level2, level3])
stack.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9359574839893711
fold=1, auc=0.9374319843579961
fold=2, auc=0.9327744831936209
fold=3, auc=0.936746484186621
fold=4, auc=0.930962930962931

Level 1 preds: lr_cnt
fold=0, auc=0.944461736115434
fold=1, auc=0.9481647370411844
fold=2, auc=0.9413639853409963
fold=3, auc=0.9453942363485591
fold=4, auc=0.9425671925671927

Level 1 preds: rf_svd
fold=0, auc=0.8814884703721176
fold=1, auc=0.8815248453812113
fold=2, auc=0.875981843995461
fold=3, auc=0.8860663465165866
fold=4, auc=0.8695517445517444

Level 2 preds: lr
fold=0, auc=0.9509604877401219
fold=1, auc=0.951691987922997
fold=2, auc=0.9457739864434966
fold=3, auc=0.9504322376080596
fold=4, auc=0.9435239435239435

Level 2 preds: linreg
fold=0, auc=0.9511337377834344
fold=1, auc=0.9533894883473722
fold=2, auc=0.9472134868033718
fold=3, auc=0.9516157379039345
fold=4, auc=0.9464746964746964

Level 2 preds: xgb
fold=0, auc=0.9442651110662778
fold=1, auc=0.9492149873037468
fold=2, auc=0.9422511105627777
fold=3, au

<__main__.StackingClassifier at 0x7f318c54eb10>

In [23]:
same_train_auc = metrics.roc_auc_score(y_train, stack.predict_proba(df_train)[:, 1])
same_test_auc = metrics.roc_auc_score(y_test, stack.predict_proba(df_test)[:, 1])

print(f"Train AUC (same):", same_train_auc)
print(f"Test AUC  (same):", same_test_auc)

pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Train AUC (same): 0.9950402682144966
Test AUC  (same): 0.9452795248100628


Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934775,0.94439,0.878923,0.948477,0.949965,0.944636,0.950131
std,0.002778,0.002634,0.006341,0.003611,0.002982,0.003294,0.003127


In [24]:
experiment_results = {
    'same': {'train': same_train_auc, 'test': same_test_auc}
}

### Different Folds

Now we train the same model except the folds are shuffled beyond training the level 1 models simulating the use of different cross-validation folds when training higher level models.

In [25]:
class StackingClassifierShuffledCV:
    """Implements model stacking for classification."""
    
    def __init__(self, model_dict_list):
        """Initialize by passing `model_dict` which is a list of dictionaries 
        of name-model pairs for each level."""
        
        self.model_dict_list = model_dict_list
        self.cv_scores_ = {}
        self.metafeatures_ = None
        
    def fit(self, df):
        """Fit classifier. This assumes `df` is a DataFrame with "id", "kfold", 
        "sentiment" (target) columns, followed by features columns."""
        
        df = df.copy()
        
        # Iterating over all stacking levels
        metafeatures = []
        for m in range(len(self.model_dict_list)):
            
            # Get models in current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Identify feature columns, i.e. preds of prev. layer
            if m == 0:
                feature_cols = ['review']
            else:
                prev_level_names = self.model_dict_list[m-1].keys()
                feature_cols = [f'{name}_{level-1}' for name in prev_level_names]
                
                # Shuffle folds for level 2 models and up <----------- SHUFFLE FOLDS HERE (!)
                df['kfold'] = random.sample(df.kfold.tolist(), len(df))
            
            # Iterate over models in the current layer
            for model_name in model_dict.keys():
                print(f'\nLevel {level} preds: {model_name}')
                self.cv_scores_[f'{model_name}_{level}'] = []
                model = model_dict[model_name]
                
                # Generate feature for next layer models from OOF preds
                oof_preds = []
                for j in range(df.kfold.nunique()):
                    oof_pred, oof_auc = self._oof_pred(df, feature_cols, model, 
                                                        model_name, fold=j, level=level)
                    oof_preds.append(oof_pred)
                    self.cv_scores_[f'{model_name}_{level}'].append(oof_auc)
                
                pred = pd.concat(oof_preds)
                df = df.merge(pred[['id', f'{model_name}_{level}']], on='id', how='left')   
                metafeatures.append(f'{model_name}_{level}')
        
                # Train models on entire feature columns for inference
                model.fit(df[feature_cols], df.sentiment.values)
        
        self.metafeatures_ = df[metafeatures]
        return self
        
    def predict_proba(self, test_df):
        """Return classification probabilities."""
        
        test_df = test_df.copy()
        
        # Iterate over layers to make predictions
        for m in range(len(self.model_dict_list)):
            
            # Get models for current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Get feature columns to use for prediction
            if m == 0:
                feature_cols = ['review']
            else:
                prev_names = self.model_dict_list[m-1].keys()
                feature_cols = [f"{model_name}_{level-1}" for model_name in prev_names]

            # Append predictions to test DataFrame
            for model_name in model_dict.keys():
                model = model_dict[model_name]
                pred = model.predict_proba(test_df[feature_cols])[:, 1] 
                test_df.loc[:, f"{model_name}_{level}"] = pred
                    
        # Return last predictions
        return np.c_[1 - pred, pred]
        
    def _oof_pred(self, df, feature_cols, model, model_name, fold, level):
        "Train on K-1 folds, predict on fold K. Return OOF predictions with IDs."

        # Get folds; include ID and target cols, and feature cols
        df_trn = df[df.kfold != fold][['id', 'sentiment']+feature_cols]
        df_oof = df[df.kfold == fold][['id', 'sentiment']+feature_cols]
        
        # Fit model. 
        model.fit(df_trn[feature_cols], df_trn.sentiment.values)
        oof_pred = model.predict_proba(df_oof[feature_cols])[:, 1] 
        auc = metrics.roc_auc_score(df_oof.sentiment.values, oof_pred)
        print(f"fold={fold}, auc={auc}")

        # Return OOF predictions with ids
        df_oof.loc[:, f"{model_name}_{level}"] = oof_pred
        return df_oof[["id", f"{model_name}_{level}"]], auc

:::{danger}
The implementation of `StackingClassifier` has a side-effect: models inside the dictionaries are trained. Hence, if we train another model using the same model dictionaries (as we do here in defining `stack_shuffled`), then the models inside the dictionaries will be retrained using a different algorithm. This means calling `stack.predict_proba(df_test)` will yield **different results** before and after training `stack_shuffled`! As usual, the stateful approach is error prone. We can modify the `StackingClassifier` to instead save a list of model dictionaries that are *clones* of the models. This allows all state to be localized within the  stacked model.
:::

Start training the stacked model with shuffling of folds:

In [26]:
stack_shuffled = StackingClassifierShuffledCV([level1, level2, level3])
stack_shuffled.fit(df_train)


Level 1 preds: lr
fold=0, auc=0.9359574839893711
fold=1, auc=0.9374319843579961
fold=2, auc=0.9327744831936209
fold=3, auc=0.936746484186621
fold=4, auc=0.930962930962931

Level 1 preds: lr_cnt
fold=0, auc=0.944461736115434
fold=1, auc=0.9481647370411844
fold=2, auc=0.9413639853409963
fold=3, auc=0.9453942363485591
fold=4, auc=0.9425671925671927

Level 1 preds: rf_svd
fold=0, auc=0.8810043452510863
fold=1, auc=0.8794494698623676
fold=2, auc=0.8751435937858985
fold=3, auc=0.8883572220893055
fold=4, auc=0.8696557446557447

Level 2 preds: lr
fold=0, auc=0.9494481405126081
fold=1, auc=0.946384953621176
fold=2, auc=0.9494623583315892
fold=3, auc=0.9512430457905252
fold=4, auc=0.9458207364551843

Level 2 preds: linreg
fold=0, auc=0.9516165171734631
fold=1, auc=0.9483091588013435
fold=2, auc=0.9500184919426893
fold=3, auc=0.9522944794852852
fold=4, auc=0.9478199869549966

Level 2 preds: xgb
fold=0, auc=0.9466299510558551
fold=1, auc=0.9435907154785806
fold=2, auc=0.9446388244775809
fold=3, a

<__main__.StackingClassifierShuffledCV at 0x7f318c5ea290>

In [27]:
shuffled_train_auc = metrics.roc_auc_score(y_train, stack_shuffled.predict_proba(df_train)[:, 1])
shuffled_test_auc = metrics.roc_auc_score(y_test, stack_shuffled.predict_proba(df_test)[:, 1])

print(f"Train AUC (shuffled):", shuffled_train_auc)
print(f"Test AUC  (shuffled):", shuffled_test_auc)

experiment_results['shuffled'] = {'train': shuffled_train_auc, 'test': shuffled_test_auc}

Train AUC (shuffled): 0.9960226985681715
Test AUC  (shuffled): 0.9452160044441854


In [28]:
pd.DataFrame(experiment_results)

Unnamed: 0,same,shuffled
train,0.99504,0.996023
test,0.94528,0.945216


Observe that the train AUC increased while test score decreased when using different folds between layers which indicates overfitting. Moreover, if we look at CV scores we see significant decrease in performance for `linreg_2`, `xgb_2`, and `blender_3` $(\sim 2 \times 10^{-4})$. On the other hand, standard deviation is generally higher with shuffling which indicates worse fold stability.

In [29]:
# Shuffled CV scores
pd.DataFrame(stack_shuffled.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934775,0.94439,0.878722,0.948472,0.950012,0.944789,0.949821
std,0.002778,0.002634,0.006957,0.002291,0.001968,0.002743,0.002589


In [30]:
# Same CV scores
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,rf_svd_1,lr_2,linreg_2,xgb_2,blender_3
mean,0.934775,0.94439,0.878923,0.948477,0.949965,0.944636,0.950131
std,0.002778,0.002634,0.006341,0.003611,0.002982,0.003294,0.003127


### Conclusion

Empirical results above strongly indicate that we should use the **same folds** across levels of stacking. The following theoretical example shows that, when using different folds, overfitting can happen due to the second stage model taking advantage of a certain relationship between ground truth and first stage predictions, without this structure generalizing well to the test set.

Consider a dataset $\{(x_1, t_1), (x_2, t_2) \ldots, (x_{10}, t_{10})\}$ with five folds such that the first fold is $F_1 = \{x_1, x_2\}$. Let $x_1 {\mapsto} y_1$ and $x_2 \mapsto y_2$ where the mapping is trained on $F_{\neg 1} = \{x_3, \ldots, x_{10}\}.$ We can think of modelling on $F_{\neg 1}$ as defining some rule or distribution that the points in $F_1$ are compared against. Suppose we reshuffle folds in the next level such that the first fold is $G_1 = \{y_1, y_{10}\}.$ Then, the model trained on $G_{\neg 1} = \{y_2, \ldots y_9\}$  overfits slightly since $y_2$ is modelled using the ground truths $(x_3, t_3), \ldots, (x_{9}, t_9).$ Note that this asymmetry does not apply to the other values $y_3, \ldots, y_9$ in $G_{\neg 1}.$ Keeping the same cross-validation scores allow all instances in the train fold are equivalent and prevents any such asymmetry from occurring. 

Theoretically (as mentioned in the [Kaggle Guide to Model Stacking](https://datasciblog.github.io/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/)), there is always some leakage if you train a second level model on the same training set, which you used to derive the first stage predictions. This is because you used the ground truth to get those first stage predictions, and now you take those predictions as input, and try to predict the same ground truth. However, this leakage doesn't seem to be significant in practice.

## Parallelizing Model Training


Generating features require training one model for each fold. This is very slow. Note that each training process is independent of the other (they only use static features from the previous level), so in principle can be easily parallelized. For this task, we parallelize only the training on cross-validation folds. During inference, parallelizing results in worse times, likely due to overhead. 

We implement parallelizing training on CV folds using `joblib.Parallel`. Some remarks:

* Setting the `backend='loky'` is important. In my MacBook Air 2015 laptop with Mojave 10.14.6, setting `backend='multiprocessing'` with an `XGBClassifier` model causes training to hang. In a Kaggle kernel, `multiprocessing` doesn't seem to work at all, even without using `XGBClassifier`. Not good. Using the `loky` backend seems to work consistently across platforms. 

+++

* Setting `nthread=1` for XGBClassifier decreases train trime from ~246s to ~100s with backend `loky` and `n_jobs=-1`. Note that the former time is way worse than sequential. 

+++

* Joblib pickles every object used inside `Parallel`. Best to use stateless objects. Careful about shared memory.

+++

* Using `n_jobs=1` enables to turn off parallel computing for debugging.

Note that we will start **cloning** models in the `model_dict_list` inside the `StackingCLassifierParallel` object to avoid leaking state outside the object instance. Results below show that there is significant speed up with parallelization using the `loky` backend. Consider this implementation the current stable version of our implementation of stacking.

In [31]:
class StackingClassifierParallel(BaseEstimator, ClassifierMixin):
    """Implements model stacking for classification."""
    
    def __init__(self, model_dict_list, n_jobs=1, backend='loky'):
        """Initialize by passing `model_dict` which is a list of dictionaries 
        of name-model pairs for each level. Models should have inter-level
        unique names."""
        
        self.model_dict_list = [
            { name: clone(model_dict[name]) for name in model_dict } 
                                            for model_dict in model_dict_list]
        self.cv_scores_ = {}
        self.metafeatures_ = None
        self.n_jobs = n_jobs
        self.backend = backend
    
    def fit(self, df):
        """Fit classifier. Assumes `df` is a DataFrame with 'id', 'kfold', and 
        'sentiment' (target) columns, followed by features columns."""
        
        # Iterating over all stacking levels
        df = df.copy()
        metafeatures = []
        for m in tqdm(range(len(self.model_dict_list)), leave=False):
            
            # Get models in current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Identify feature columns, i.e. preds of prev. layer
            if m == 0:
                feature_cols = ['review']
            else:
                prev_level_names = self.model_dict_list[m-1].keys()
                feature_cols = [f'{name}_{level-1}' for name in prev_level_names]
            
            # Parallel context manager. Prevents discarding of workers for each model
            with Parallel(n_jobs=self.n_jobs, backend=self.backend, verbose=1) as parallel:
                
                # Iterate over models in the current layer
                for model_name in tqdm(model_dict.keys(), leave=False):
                    
                    # Generate feature for next layer models from OOF preds
                    # Cloning the model here releases the weights from prev. fit.
                    model = model_dict[model_name]
                    out = parallel(delayed(self._predict_fold)(
                            df, feature_cols, fold,
                            model_name, clone(model),
                            level
                        ) for fold in df.kfold.unique()
                    )

                    # Load all OOF predictions and AUCs
                    fold_preds, cv_scores = list(zip(*out))
                    
                    # Assign cv scores for model and append predictions to df
                    self.cv_scores_[f'{model_name}_{level}'] = cv_scores
                    pred_df = pd.concat(fold_preds)
                    df = df.merge(pred_df, how='left', on='id')
                    metafeatures.append(f'{model_name}_{level}')
                    
                    # Refit model on entire feature columns for inference
                    model.fit(df[feature_cols], df.sentiment)
                    
        # Save learned metafeatures
        self.metafeatures_ = df[metafeatures]
        return self
    
    def predict_proba(self, df):
        """Return classification probabilities."""
        
        # Iterate over layers to make predictions
        df = df.copy()
        for m in range(len(self.model_dict_list)):
            
            # Get models for current layer
            model_dict = self.model_dict_list[m]
            level = m + 1
            
            # Get feature columns to use for prediction
            if m == 0:
                feature_cols = ['review']
            else:
                prev_names = self.model_dict_list[m-1].keys()
                feature_cols = [f"{model_name}_{level-1}" for model_name in prev_names]

            # Append predictions to test DataFrame
            for model_name in model_dict.keys():
                model = model_dict[model_name]
                pred = model.predict_proba(df[feature_cols])[:, 1] 
                df.loc[:, f"{model_name}_{level}"] = pred
                    
        # Return last predictions
        return np.c_[1 - pred, pred]

    def _predict_fold(self, df, feature_cols, fold, model_name, model, level):
        "Make out-of-fold predictions. Return predict probas and AUC."
        
        X_train = df[df.kfold != fold][feature_cols]
        y_train = df[df.kfold != fold].sentiment.values
        
        X_valid = df[df.kfold == fold][feature_cols] 
        y_valid = df[df.kfold == fold].sentiment.values
        pred_id = df[df.kfold == fold].id

        # Fit model
        model.fit(X_train, y_train)
        
        # Return fold predictions along with fold AUC
        pred = model.predict_proba(X_valid)[:, 1] 
        auc = metrics.roc_auc_score(y_valid, pred)
        return pd.DataFrame({"id": pred_id, f"{model_name}_{level}": pred}), auc

Define the models that we will use at each level.

In [32]:
# Base models
level1 = {
    'lr': make_pipeline(
        ReviewColumnExtractor(),
        TfidfVectorizer(max_features=1000),
        linear_model.LogisticRegression(random_state=42)
    ), 
    
    'lr_cnt': make_pipeline(
        ReviewColumnExtractor(),
        CountVectorizer(), 
        linear_model.LogisticRegression(solver='liblinear', random_state=42)
    ), 
}

# Meta models
level2 = {
    'lr': linear_model.LogisticRegression(),
    'linreg': make_pipeline(StandardScaler(), LinearRegressionClassifier()),
    'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False, nthread=1, random_state=42)
}

# Meta models
level3 = {
    'linreg': make_pipeline(StandardScaler(), LinearRegressionClassifier()),
    'xgb': XGBClassifier(eval_metric="logloss", use_label_encoder=False, nthread=1)
}

# Blender head: rank true for linear reg.
level4 = {'blender': Blender(rank=True, random_state=42)}

:::{caution}
Setting `nthread=1` for `XGBClassifier` decreases train time for the parallel stacker from ~250s to ~100s. This goes from worse to better than sequential. See also https://github.com/dmlc/xgboost/issues/2163.
:::

Start with timing experiments.

In [33]:
model_dict_list = [level1, level2, level3, level4]

In [34]:
times = []
for i in range(3):
    start_time = time.time()
    stack_parallel = StackingClassifierParallel(model_dict_list, n_jobs=-1)
    stack_parallel.fit(df_train)
    times.append(time.time() - start_time)
    
times = np.array(times)
times.mean(), times.std()

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   14.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   24.3s finished


  0%|          | 0/3 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.0s finished


  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.7s finished


  0%|          | 0/1 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.0s finished


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   10.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   24.2s finished


  0%|          | 0/3 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.0s finished


  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.8s finished


  0%|          | 0/1 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.0s finished


  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   10.8s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   23.2s finished


  0%|          | 0/3 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.4s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.0s finished


  0%|          | 0/2 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    5.8s finished


  0%|          | 0/1 [00:00<?, ?it/s]

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    4.0s finished


(80.29500238100688, 2.2682731989272282)

In [35]:
times = []
for i in range(3):
    start_time = time.time()
    stack = StackingClassifier(model_dict_list)
    stack.fit(df_train)
    times.append(time.time() - start_time)
    
times = np.array(times)
times.mean(), times.std()


Level 1 preds: lr
fold=0, auc=0.9359574839893711
fold=1, auc=0.9374319843579961
fold=2, auc=0.9327744831936209
fold=3, auc=0.936746484186621
fold=4, auc=0.930962930962931

Level 1 preds: lr_cnt
fold=0, auc=0.944461736115434
fold=1, auc=0.9481647370411844
fold=2, auc=0.9413639853409963
fold=3, auc=0.9453942363485591
fold=4, auc=0.9425671925671927

Level 2 preds: lr
fold=0, auc=0.9508922377230595
fold=1, auc=0.9515819878954969
fold=2, auc=0.9454064863516216
fold=3, auc=0.9496467374116845
fold=4, auc=0.9443194443194443

Level 2 preds: linreg
fold=0, auc=0.950668987667247
fold=1, auc=0.9531489882872469
fold=2, auc=0.9467019866754967
fold=3, auc=0.950918487729622
fold=4, auc=0.9466674466674467

Level 2 preds: xgb
fold=0, auc=0.9467834866958718
fold=1, auc=0.9498002374500593
fold=2, auc=0.9415947353986839
fold=3, auc=0.9475146118786529
fold=4, auc=0.9414463164463163

Level 3 preds: linreg
fold=0, auc=0.9515402378850595
fold=1, auc=0.9531754882938721
fold=2, auc=0.9466702366675591
fold=3, au

(97.44590123494466, 0.8955251835462358)

Notice that the parallelized version has a ~20% speed up over the sequential version!

Testing if predictions agree:

In [36]:
times = []
for i in range(3):
    start_time = time.time()
    parallel_pred = stack_parallel.predict_proba(df_test)[:, 1]
    times.append(time.time() - start_time)
    
times = np.array(times)
times.mean(), times.std()

(1.74370272954305, 0.05158894135175818)

In [37]:
times = []
for i in range(3):
    start_time = time.time()
    usual_pred = stack.predict_proba(df_test)[:, 1]
    times.append(time.time() - start_time)
    
times = np.array(times)
times.mean(), times.std()

(1.7447847525278728, 0.009084941568507832)

In [48]:
print('parallel AUC:', metrics.roc_auc_score(df_test.sentiment, parallel_pred))
print('usual AUC:   ', metrics.roc_auc_score(df_test.sentiment, usual_pred))

parallel AUC: 0.9443420794103774
usual AUC:    0.9443182392730581


Testing if the results agree at the fold level:

In [40]:
# parallel
pd.DataFrame(stack_parallel.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,lr_2,linreg_2,xgb_2,linreg_3,xgb_3,blender_4
mean,0.934775,0.944388,0.948371,0.949621,0.945454,0.949733,0.946076,0.949811
std,0.002778,0.002632,0.003298,0.002848,0.003488,0.003085,0.002404,0.002969


In [41]:
# sequential
pd.DataFrame(stack.cv_scores_).describe().loc[['mean', 'std']]

Unnamed: 0,lr_1,lr_cnt_1,lr_2,linreg_2,xgb_2,linreg_3,xgb_3,blender_4
mean,0.934775,0.94439,0.948369,0.949621,0.945428,0.949739,0.94617,0.949782
std,0.002778,0.002634,0.003298,0.002849,0.003737,0.003139,0.002788,0.003069


Checking if the two learned different model weights (previously I forgot to clone the models in `model_dict_list` so the models learned the same weights):

In [51]:
print(stack_parallel.model_dict_list[3]['blender'].coef_)
print(stack.model_dict_list[3]['blender'].coef_)

[0.25063917 0.04577294]
[0.25660091 0.0306614 ]


In [49]:
stack_parallel.metafeatures_.head()

Unnamed: 0,lr_1,lr_cnt_1,lr_2,linreg_2,xgb_2,linreg_3,xgb_3,blender_4
0,0.414118,0.761353,0.594294,-0.022445,0.613355,-0.006122,0.79925,670.670587
1,0.982611,0.999355,0.967132,0.114824,0.996484,0.139497,0.999487,1203.771676
2,0.102193,0.000346,0.04116,-0.139775,0.017038,-0.124591,0.036391,228.81862
3,0.820037,0.999538,0.938649,0.083109,0.983069,0.091374,0.967702,1008.00981
4,0.84269,0.965703,0.938076,0.047133,0.92496,0.064387,0.965395,916.861444


In [50]:
stack.metafeatures_.head()

Unnamed: 0,lr_1,lr_cnt_1,lr_2,linreg_2,xgb_2,linreg_3,xgb_3,blender_4
0,0.414118,0.761635,0.594502,-0.022393,0.576681,-0.006564,0.599075,607.936054
1,0.982611,0.999355,0.967132,0.114876,0.996751,0.139951,0.99955,1124.234802
2,0.102193,0.000345,0.04116,-0.139774,0.016802,-0.124136,0.012842,204.159542
3,0.820037,0.999539,0.93865,0.083109,0.986579,0.092108,0.988499,954.836527
4,0.84269,0.965732,0.938082,0.047134,0.92232,0.064557,0.919907,842.07884
