# Ensamble Methods intro

The aggregate predictions of a group of predictors (whether Classifiers or Regressors).

Examples of Ensamble Methods:

* Majority Voting
* Bagging
* Random Forests
* Boosting
* Stacking

**Regular Classification**

<div>
<img src="attachment:image.png" width="500"/>
</div>

## Voting Classifiers

### Hard Voting

To create a superior classifier, is better to aggregate the predictions of each classifier and predict the class that gets the most votes. 

> This majority-vote classifier is called a hard-voting classifier

> Best results get achieved when the ensamble has a sufficient number of weak learners and that they are sufficiently diverse. This because the errors one is classifier makes, will be averaged by another that is not making an error.

<div>
<img src="attachment:image.png" width="500"/>
</div>

🔥 On of the reasons of their power is :

> Ensamble methods work best when the predictors are independent. And also when we train different classifiers on very different algorithms 👉 This increases the chance that they **will make vey different types of errors**, improving the ensamble's accuracy.

### Soft Voting

If all classifiers are able to estimate class probabilities (ie. have a **predic_proba** method), then you can set sklearn algorithm to predict the class with the highest class probability, averaged over all the individual classifiers. 

> It achieves higher performance than hard-voting because it gives more weight to highly confident votes. 

## Bagging & Pasting

Short for Bootstrap Aggregating. Is when the same training algorithm is used on different random subsets of the training set (allowing training instances to be sampled several times across multiple models)

> Bagging 👉 when sampling is performed **with** replacement (it allows training instances to be sampled several times for the same predictor/model).

> Pasting 👉 when sampling is preformed **without** replacement.

<div>
<img src="attachment:image.png" width="500"/>
</div>

The predicted class gets chosen just like a hard voting classifier (or the average for regression).

> 🔥 The training can be performed in parallel, which means Bagging & Pasting **scale very well**

> This technique has a comparable bias between one model and an ensamble of them, but a smaller variance. **It makes roughly the same number of errors on the training set, but the decision boundary is less irregular**.

<div>
<img src="attachment:image.png" width="500"/>
</div>

## Random Forests

Example of a Random Forest:
* Train a group of Decision Tree Classifiers, each on a different subset of the training set
* To make predictions, you obtain the predictions of all the individual trees
* Then predict the class that gets the most votes

The Random Forest introduces extra randomness when growing trees, instead of searching for the very best features when splitting a node, it searches for the best feature among a random subset of features:

> It trades a higher bias (make more mistakes) for a lower variance (generalise better) yielding to an overall better model.

### Feature importance

🔥 Great quality of Random Forests, the make it easy to measure the relative importance of each feature. 

## Boosting

Any ensamble method that can combine several weak learners into a strong learner. The idea is train models sequentially, each trying to correct its predecessor. There are two very popular: 

1. AdaBoost
2. Gradient Boosting

### AdaBoost

In [1]:
from sklearn.ensemble import AdaBoostClassifier

It focuses on fixing predecessors mistakes. 

Each **instance weight is set to 1/m** (where m is the number of instances)

1. The algorithm trains a base classifier and it makes predictions on the training set. 
2. It computes a weighted error rate for the predictor (or classifier).
2. **It increases the relative weight of missclassified training instances**. Then all weights instances are normalised.
3. Then it trains a second classifier using the updated weights - **it boosts these instances**, and then make predictions and sets weights for missclassified instances again.
4. And it keeps repeating the process until the desired number of classifiers / predictors have been trained (or when a perfect predictor is found).

> 🔥 To make predictions, AdaBoost computes the prediction of all the predictors (individual models) and weighs them using the predirtor weights. The predicted class is the one that receives the majority of weighted votes


<div>
<img src="attachment:image.png" width="400"/>
</div>

> 👎 A downside of Boosting is that it **does Not** scale well - it cannot be pararellised, as it needs the predecessor to update the weights.

### Gradient Boosting

In [3]:
from sklearn.ensemble import GradientBoostingRegressor

Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, 

> this method tries to **fit the new predictor to the residual errors made by the previous predictor**