# **Tutorial: Ensemble Learning**
### By Kostas Hatalis

___
Ensemble learning is the application of multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In this tutorial we'll discuss ensemble learning in the context of classification. The 4 most widely use ensemble approaches, that I cover here, are:

1. Voting 
2. Bagging 
3. Boosting
4. Stacking

In the literature, most of these approaches use decision trees, but almost any method(s) can be used with each. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. Many ensemble methods, therefore, seek to promote diversity among the models they combine. 

Other more advanced, but lesser known, ensemble methods include:

1. Bayesian Model Averaging (BMA)
2. Bayesian Model Combination (BMC)
3. Fuzzy Fusion
4. Mixture of Experts
5. Meta Learning

___
## **Bias, Variance, and Noise**

In supervised learning our goal is to learn the function $ y = f(x) $. Since the actual $f()$ is unknown, we approximate it as $\hat{y} = \hat{f}(x)$ where $\hat{f} \approx f$. Our function $\hat{f}$ can be any model, such as logistic regression or a neural network, where it discards as much noise as possible and achieves a low predictive error on unseen datasets. **Overfitting** occures when $\hat{f}(x)$ fits the training set noise, and **underfitting** occurs when $\hat{f}(x)$ is not flexible enough to approximate $f$.

To avoid the mistake of overfitting and underfitting it is important to understand prediction errors (bias and variance). There is always a tradeoff between a model's ability to minimize bias and variance. 

More formaly, we define **the generalization error** of $\hat{f}$, on unseen data, as the decomposition of 3 erros: = 

$$
Error(x) = \left( E \left[ \hat{f}(x) \right]  - f(x) \right)^2 + E \left[ \left( \hat{f}(x)- E \left[ \hat{f}(x) \right] \right)^2 \right] + \sigma^{2}_{e}
$$

$$
Error(x) = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error} 
$$

The **bias–variance tradeoff** problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.

**Note:** generalization error can not be calculated directly, it is a theoretical concept. But we can approximate it with the error from cross-validation (CV).

<img src="images/bias.png" width="400">
<img src="images/fit.png" width="800">

**Bias Error:**

Bias error is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data, where it misses the relevant relations between features and target outputs, and underfits the model. It always leads to high error on training and test data.

If $\hat{f}$ suffers from high bias, where CV error of $\hat{f} \approx $ training set error of $\hat{f}$ (test error much higher then desired error), then $\hat{f}$ is said to underfit the training set. To remedy underfitting:
- increase model complexity
- gather more relevant features

**Variance Error:**

Variance error tells you how much $\hat{f}$ is inconsistent over different training sets. It quantifies the variability of a models prediction for a given data point which tells us spread of our data. It measures the sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data (overfitting) and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but have high error rates on test data.

If $\hat{f}$ suffers from high variance, where CV error of $\hat{f}$ is greater than the training set error of $\hat{f}$, then $\hat{f}$ is said to overfit the training set. To remedy overfitting:
- decrease model complexity
- gather more relevant data

**Irreducible Error:**

Irreducible error is the noise, in the problem itself, that CAN NOT be predicted. Thus, this forms a lower bound on the expected error on unseen samples.

**Model Complexity:**

Model complexity sets the flexibility of $\hat{f}$. The more complex the model $ \hat{f}(x)$ is, the more data points it will capture, and the lower the bias will be. However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger. Typically, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up overfitting your model, and hence your model will start suffering from the high variance.

<img src="images/complex.png" width="400">


___
## **Voting**

Voting is to combine 2 or more different machine learning methods and use a majority vote (hard vote) or the average predicted probabilities (soft vote) to predict the class labels. Such a classifier can be useful for a set of equally well performing model in order to balance out their individual weaknesses.

**Hard Voting:** the predicted class label for a particular sample is the class label that represents the mode (set of labels that appear most often) of the class labels predicted by each individual classifier.

**Soft Voting:** the predicted class label for a particular sample is max of the average predicted probabilities from all classifiers for every class.

<img src="images/voting.PNG" width="800">

In [23]:
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

from sklearn.datasets  import load_breast_cancer
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier

# load data
data = load_breast_cancer()
X, y = data.data, data.target

# create classifiers
clf1 = LogisticRegression(solver='lbfgs', multi_class='multinomial',random_state=1)
clf2 = RandomForestClassifier(n_estimators=5, random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')

# run CV on each classifier
for clf, label in zip([clf1, clf2, clf3, eclf], ['Logistic Regression', 'Random Forest', 'naive Bayes', 'Ensemble']):
    scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

# Using the VotingClassifier with GridSearch
params = {'lr__C': [1.0, 10.0], 'rf__n_estimators': [10, 50, 100]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X, y)

print()
print("Best parameters set found in grid search:")
print(grid.best_params_)
print("Accuracy: %0.2f" % grid.best_score_ )

Accuracy: 0.94 (+/- 0.02) [Logistic Regression]
Accuracy: 0.95 (+/- 0.02) [Random Forest]
Accuracy: 0.94 (+/- 0.02) [naive Bayes]
Accuracy: 0.95 (+/- 0.01) [Ensemble]

Best parameters set found in grid search:
{'lr__C': 1.0, 'rf__n_estimators': 100}
Accuracy: 0.96


___
## **Bagging**

Bagging, also known as bootstrap aggregation, 

___
## **Boosting**


Gradiant Boosting Machines / Stochastic Gradient Boosting

xgboost
lightgbm
catboost
H2O

___
## **Stacking**



___
## **References**

[1] https://scikit-learn.org/stable/modules/ensemble.html#voting-classifier

[2] https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

