# **Tutorial: Stacking**
### By Kostas Hatalis

**Prerequisite Notebooks:** *Decision Trees, Ensemble Learning*

___
## **Stacking**

Stacking (also called stacked generalization or super learning), introduced in 1992 by Leo Breiman [1], involves training a **meta-model** on **meta-features** which are the predictions of several base learning algorithms with the aim of reducing the generalization error. In other words, the basic idea is to train several base models, and feed their predictions into a another model that learns to weigh and add the base predictions to get (ideally) better predictions.

For classification, the The meta-classifier can either be trained on the predicted class labels or probabilities from the ensemble. Stacking can also be used for regression known as stacked regression, introduced in 1996 [2]. The meta-regressor uses the numeric predictions from the individual base regressors as inputs to make a final prediction.

The standard stacking procedure, is to fit the base (first-level) models to the whole training set. Then use their predictions and the whole training set again to prepare the inputs for the meta (second-level) model. This type of Stacking is prone to **overfitting due to information leakage, and should be avoided.** Thus, it is advised to use stacking with the Cross-Validation (CV) algorithm.

For classification and regression, CV based stacking works as follows:
1. Split data set into training and testing sets.
2. Take the training set and split into k folds.
    - $k-1$ folds are used for training and 1 fold used for validation.
3. Fit the base models on the $k-1$ training folds.
4. Apply the base learners to predict the validation fold.
5. Stack the resulting predictions as input data to the meta-model.
6. Repeat steps 2 to 5 until the whole training set has been cycled through to create a full stack of predictions as input to the meta-model.
7. Train the meta-model on the stacked predictions.
8. After the meta-model has been trained, retrain the base models on the entire training set.
    - At this point validate your model on the testing set.

This process is illustrated  in the figure below (from [3]) for classification and regression:

<img src="images/stacking.PNG" width="800">

**Stacking with CV typically yields performance better than any single one of the trained base models.** 

### **Diversity**

It is important to try a diverse type of base and meta models!

In practice, a logistic regression model is often used as the meta-model. However any algorithm could be used as the meta-model. Stacking with nonlinear meta-models, such as GBMs and ANNs, for multiclass problems gives surprising gains. 

In the base models, the same algorithm could also be used multiple times with different training algorithms, different hyperparameters, and different feature subsets. For instance, you could have 20 neural networks, 20 support vector machines, and 20 random forests as the base models. There's no limit on how many models you can use, but after some point you will reach a plateau of performance after a certain number of models.

### **Blending**

The top-performers in the 2006 Netflix competition introduced a form of stacking called blending. With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only. It is simpler than stacking and it wards against an information leak where stackers use different data. However, you use less data overall and the final model may overfit to the holdout set, whereas stacking is more robust with CV. As for performance, both techniques are able to give similar results.

### **Multi-Layered Stacking**

Stacking is not restricted to just two layers, in theory you can add as many layers as you like. One layer feeds its predictions as features into the next layer of models. K-fold CV is again applied to each layer with the data avaliable from the layer below. While not as common as 2-layer stacking or other ensemble methods, due to complexity issues, multi-layered stacking can be fairly powerful and has been used as the winning approach of several Kaggle and KDD Cup competitions.

### **Code Tutorial CV Stacking**

Sklearn has no support for stacking. But there is another library that does called **mlxtend** [3] which extends sklearn.

mlxtend implements CV stacking with `StackingCVClassifier` and `StackingCVRegressor`.

Both these functions also support grid search and training base models on subsets of features [4].


In [1]:
from sklearn import datasets
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
import numpy as np

# Set random seed for all methods
RANDOM_SEED = 1

# Load in dataset
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

# Instantiate classifiers
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

# The StackingCVClassifier uses scikit-learn's check_cv
# internally, which doesn't support a random seed. Thus
# NumPy's random seed need to be specified explicitely for
# deterministic behavior
np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], 
                            use_probas=True,
                            meta_classifier=lr,
                            cv=5)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'Stacking']):

    scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))


3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.91 (+/- 0.06) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.04) [Stacking]


### **Code Tutorial Multi-Layered Stacking**

One of the few (clean) libraries I found that implements multi-layer stacking is **ml-ensemble** [5].
    

In [2]:
# TODO: make multi-level stacking example

___
## **References**

[1] Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992

[2] Breiman, Leo. "Stacked regressions." Machine learning 24.1 (1996): 49-64.

[3] https://rasbt.github.io/mlxtend/

[4] https://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/#methods
    
[5] http://ml-ensemble.com/info/start/ensembles.html