# **Tutorial: Ensemble Learning**
### By Kostas Hatalis

___
Ensemble learning is the application of multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In this tutorial we'll discuss ensemble learning in the context of classification. The 4 most widely use ensemble approaches, that I cover here, are:

1. Voting 
2. Bagging 
3. Stacking
4. Boosting

In the literature, most of these approaches use decision trees, but almost any method(s) can be used with each. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. 

Other more advanced, but lesser known, ensemble methods include:

1. Bayesian Model Averaging (BMA)
2. Bayesian Model Combination (BMC)
3. Fuzzy Fusion
4. Mixture of Experts
5. Meta Learning

___
## **Bias, Variance, and Noise**

In supervised learning our goal is to learn the function $ y = f(x) $. Since the actual $f()$ is unknown, we approximate it as $\hat{y} = \hat{f}(x)$ where $\hat{f} \approx f$. Our function $\hat{f}$ can be any model, such as logistic regression or a neural network, where it discards as much noise as possible and achieves a low predictive error on unseen datasets. **Overfitting** occures when $\hat{f}(x)$ fits the training set noise, and **underfitting** occurs when $\hat{f}(x)$ is not flexible enough to approximate $f$.

To avoid the mistake of overfitting and underfitting it is important to understand prediction errors (bias and variance). There is always a tradeoff between a model's ability to minimize bias and variance. 

More formaly, we define **the generalization error** of $\hat{f}$, on unseen data, as the decomposition of 3 erros: = 

$$
Error(x) = \left( E \left[ \hat{f}(x) \right]  - f(x) \right)^2 + E \left[ \left( \hat{f}(x)- E \left[ \hat{f}(x) \right] \right)^2 \right] + \sigma^{2}_{e}
$$

$$
Error(x) = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} 
$$

The **bias–variance tradeoff** problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.

**Note:** generalization error can not be calculated directly, it is a theoretical concept. But we can approximate it with the error from cross-validation (CV).

<img src="images/bias.png" width="400">
<img src="images/fit.png" width="800">

**Bias Error:**

Bias error is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data, where it misses the relevant relations between features and target outputs, and underfits the model. It always leads to high error on training and test data.

If $\hat{f}$ suffers from high bias, where CV error of $\hat{f} \approx $ training set error of $\hat{f}$ (test error much higher then desired error), then $\hat{f}$ is said to underfit the training set. To remedy underfitting:
- increase model complexity
- gather more relevant features

**Variance Error:**

Variance error tells you how much $\hat{f}$ is inconsistent over different training sets. It quantifies the variability of a models prediction for a given data point which tells us spread of our data. It measures the sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data (overfitting) and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but have high error rates on test data.

If $\hat{f}$ suffers from high variance, where CV error of $\hat{f}$ is greater than the training set error of $\hat{f}$, then $\hat{f}$ is said to overfit the training set. To remedy overfitting:
- decrease model complexity
- gather more relevant data

**Irreducible Error:**

Irreducible error is the noise, in the problem itself, that CAN NOT be predicted. Thus, this forms a lower bound on the expected error on unseen samples.

**Model Complexity:**

Model complexity sets the flexibility of $\hat{f}$. The more complex the model $ \hat{f}(x)$ is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger. Typically, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up overfitting your model, and hence your model will start suffering from the high variance.

<img src="images/complx.png" width="400">


___
## **Voting**

Voting methods, also known as averaging methods, is to combine 2 or more different machine learning methods and use a majority vote (hard vote) or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses. On average, the combined estimator is usually better than any of the single base estimator **because its variance is reduced.**

**Hard Voting:** the predicted class label for a particular sample is the class label that represents the mode (set of labels that appear most often) of the class labels predicted by each individual classifier.

**Soft Voting:** the predicted class label for a particular sample is max of the average predicted probabilities from all classifiers for every class.

<img src="images/voting.PNG" width="800">


In [23]:
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

from sklearn.datasets  import load_breast_cancer
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

# load data
data = load_breast_cancer()
X, y = data.data, data.target

# create classifiers
clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',random_state=1)
clf2 = RandomForestClassifier(n_estimators=5, random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

# run CV on each classifier
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Voting Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

# Using the VotingClassifier with GridSearch
params = {'lr__C': [1.0, 10.0], 'rf__n_estimators': [10, 50, 100]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X, y)

print()
print("Best parameters set found in grid search CV for voting classifier:")
print(grid.best_params_)
print("Accuracy: %0.2f" % grid.best_score_ )

Accuracy: 0.94 (+/- 0.02) [Logistic Regression]
Accuracy: 0.95 (+/- 0.02) [Random Forest]
Accuracy: 0.94 (+/- 0.02) [naive Bayes]
Accuracy: 0.95 (+/- 0.01) [Ensemble]

Best parameters set found in grid search:
{'lr__C': 1.0, 'rf__n_estimators': 100}
Accuracy: 0.96


___
## **Bagging**

Bagging, also known as bootstrap aggregation, is an ensemble method involving training the same algorithm many times using different subsets sampled with replacement (known as boostrap sampling) from the training data. It then aggregates their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to **reduce the variance** of a base estimator (e.g., a CART or ANN), by introducing randomization into its construction procedure and then making an ensemble out of it.

*As they provide a way to reduce overfitting, **bagging methods work best with strong and complex models** (e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).*

*When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging. The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement.*

Bagging methods come in many flavours but mostly differ from each other by the way they draw random subsets of the training set:
- **Bagging**: samples are drawn with replacement.
- **Pasting**: samples are drawn without replacement (need big training set).
- **Random Subspaces**: random subsets of the dataset are drawn as random subsets of the features.
- **Random Patches**: when base estimators are built on subsets of both samples and features.

**Advantages:**
- Reduces over-fitting of the model.
- Handles higher dimensionality data very well.
- Maintains accuracy for missing data.

**Disadvantages:**
- Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.

**Random Forest**: is an algorithm (explained in another Notebook) that makes a small tweak to Bagging and results in a very powerful prediction method.

<img src="images/bagging.PNG" width="800">


In [34]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# load data
data = load_breast_cancer()
X, y = data.data, data.target

# Instantiate a classification-tree 'dt'
dt = DecisionTreeClassifier(random_state=1)

# Instantiate a BaggingClassifier 'bc'
bc = BaggingClassifier(base_estimator=dt, n_estimators=300)

# Get accuracy score using CV for dt
scores = cross_val_score(dt, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'Decision Tree'))

# Get accuracy score using CV for bc
scores = cross_val_score(bc, X, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), 'Bagging'))

Accuracy: 0.92 (+/- 0.02) [Decision Tree]
Accuracy: 0.96 (+/- 0.03) [Bagging]


___
## **Stacking**

Stacking (also called stacked generalization or super learning), introduced in 1992 [5], involves training a **meta-model** on **meta-features** which are the predictions of several base learning algorithms with the aim of reducing the generalization error. In other words, the basic idea is to train several base models, and feed their predictions into a another model that learns to weigh and add the base predictions to get (ideally) better predictions.

For classification, the The meta-classifier can either be trained on the predicted class labels or probabilities from the ensemble. Stacking can also be used for regression known as stacked regression, introduced in 1996 [6]. The meta-regressor uses the numeric predictions from the individual base regressors as inputs to make a final prediction.

The standard stacking procedure, is to fit the base (first-level) models to the whole training set. Then use their predictions and the whole training set again to prepare the inputs for the meta (second-level) model. This type of Stacking is prone to **overfitting due to information leakage, and should be avoided.** Thus, it is advised to use stacking with the Cross-Validation (CV) algorithm.

For classification and regression, CV based stacking works as follows:
1. Split data set into training and testing sets.
2. Take the training set and split into k folds.
    - $k-1$ folds are used for training and 1 fold used for validation.
3. Fit the base models on the $k-1$ training folds.
4. Apply the base learners to predict the validation fold.
5. Stack the resulting predictions as input data to the meta-model.
6. Repeat steps 2 to 5 until the whole training set has been cycled through to create a full stack of predictions as input to the meta-model.
7. Train the meta-model on the stacked predictions.
8. After the meta-model has been trained, retrain the base models on the entire training set.
    - At this point validate your model on the testing set.

This process is illustrated  in the figure below (from [7]) for classification and regression:

<img src="images/stacking.PNG" width="800">

**Stacking with CV typically yields performance better than any single one of the trained base models.** 

### **Diversity**

It is important to try a diverse type of base and meta models!

In practice, a logistic regression model is often used as the meta-model. However any algorithm could be used as the meta-model. Stacking with nonlinear meta-models, such as GBMs and ANNs, for multiclass problems gives surprising gains. 

In the base models, the same algorithm could also be used multiple times with different training algorithms, different hyperparameters, and different feature subsets. For instance, you could have 20 neural networks, 20 support vector machines, and 20 random forests as the base models. There's no limit on how many models you can use, but after some point you will reach a plateau of performance after a certain number of models.

### **Blending**

The top-performers in the 2006 Netflix competition introduced a form of stacking called blending. With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only. It is simpler than stacking and it wards against an information leak where stackers use different data. However, you use less data overall and the final model may overfit to the holdout set, whereas stacking is more robust with CV. As for performance, both techniques are able to give similar results.

### **Multi-Layered Stacking**

Stacking is not restricted to just two layers, in theory you can add as many layers as you like. One layer feeds its predictions as features into the next layer of models. K-fold CV is again applied to each layer with the data avaliable from the layer below. While not as common as 2-layer stacking or other ensemble methods, due to complexity issues, multi-layered stacking can be fairly powerful and has been used as the winning approach of several Kaggle and KDD Cup competitions.

In [7]:
'''
TUTORIAL STACKING

Sklearn has no support for stacking. But there is another library that does called mlxtend which extends sklearn.

mlxtend implements CV stacking with StackingCVClassifier and StackingCVRegressor

StackingCVClassifier also support Grid Search and training base models on subsets of features.

https://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/#methods
'''
from sklearn import datasets

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier
import numpy as np

RANDOM_SEED = 1

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=RANDOM_SEED)
clf3 = GaussianNB()
lr = LogisticRegression()

# The StackingCVClassifier uses scikit-learn's check_cv
# internally, which doesn't support a random seed. Thus
# NumPy's random seed need to be specified explicitely for
# deterministic behavior
np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3], 
                            use_probas=True,
                            meta_classifier=lr,
                            cv=5)

print('3-fold cross validation:\n')

for clf, label in zip([clf1, clf2, clf3, sclf], 
                      ['KNN', 
                       'Random Forest', 
                       'Naive Bayes',
                       'StackingClassifier']):

    scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))


3-fold cross validation:

Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.91 (+/- 0.06) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [Naive Bayes]
Accuracy: 0.95 (+/- 0.04) [StackingClassifier]


In [9]:
'''
Multi-Layer Stacking

One of the few (clean) libraries I found that implements multi-layer stacking is ml-ensemble:

http://ml-ensemble.com/info/start/ensembles.html
''';

# TODO: make multi-level stacking example

___
## **Boosting**

Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. 

Advantages:

    Supports different loss function (we have used ‘binary:logistic’ for this example).
    Works well with interactions.

Disadvantages:

    Prone to over-fitting.
    Requires careful tuning of different hyper-parameters.


Gradiant Boosting Machines / Stochastic Gradient Boosting

xgboost
lightgbm
catboost
H2O

___
## **Which is the best:** Voting, Bagging, Boosting, or Stacking?

There’s not an outright winner; it depends on the data, the simulation and the circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimises the advantages and reduces pitfalls of the single model.

By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting.


___
## **References**

[1] https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

[2] https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

[3] https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

[4] https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/

[5] Wolpert, D., Stacked Generalization., Neural Networks, 5(2), pp. 241-259., 1992

[6] Breiman, Leo. "Stacked regressions." Machine learning 24.1 (1996): 49-64.

[7] https://rasbt.github.io/mlxtend/