# Ensemble Learning
Ensemble learning is a machine learning technique that combines multiple models to improve the overall performance of a predictive model. The idea behind ensemble learning is that combining several models can lead to a more accurate and robust final model than any individual model alone. Ensemble learning may be particularly suitable for situations such as:
- **When the dataset is limited:** If the dataset is limited, it may be difficult to build a single model that captures all the variations in the data. In this case, an ensemble of models can help to capture more of the nuances in the data.
- **When the individual models have different strengths and weaknesses:** Different models may be better at capturing different aspects of the data. Combining these models can lead to a more accurate and robust final model.
- **When the problem is complex:** Ensemble learning can be particularly useful in complex problems, where it may be difficult to identify a single best model. In these cases, an ensemble of models can provide a more nuanced and accurate solution.

Best practices for choosing ensemble methods in machine learning:
- **Diversity:** The models used in the ensemble should be diverse in terms of their underlying algorithms, data preprocessing, and feature selection methods. This can help to capture different aspects of the data and reduce the risk of overfitting.
- **Quality of base models:** The individual models used in the ensemble should be of high quality. If the individual models are weak, combining them will not improve the overall performance.
- **Ensemble size:** The size of the ensemble should be carefully chosen. A larger ensemble may be more accurate, but may also be more complex and difficult to manage.
- **Combining predictions:** There are several ways to combine the predictions of the individual models in the ensemble, including averaging, voting, and stacking. The method used should be chosen based on the specific problem and the characteristics of the individual models.
- **Cross-validation:** Ensemble models should be evaluated using cross-validation to ensure that they generalize well to new data. Cross-validation can also be used to select the best combination of models for the ensemble.
- **Monitoring performance:** The performance of the ensemble model should be monitored over time to ensure that it continues to perform well as new data becomes available.

In [47]:
import pandas as pd

import xgboost
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder

In [25]:
iris = pd.read_csv('./data/iris.csv')
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [26]:
X = iris.iloc[:, 0:4]
y = iris['variety']

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, shuffle=True, random_state=27)

### Voting Classifier

A voting classifier is a type of ensemble learning technique in which multiple machine learning models are combined to make predictions. It works by aggregating the predictions of multiple models and choosing the class that receives the most votes. There are two main types of voting classifiers: **hard voting** and **soft voting**. Hard voting takes the majority vote of the models, while soft voting takes the average of the probabilities predicted by each model and chooses the class with the highest average probability. Voting classifiers can improve the accuracy and robustness of a model by leveraging the strengths of multiple models and reducing the impact of any individual model's weaknesses. They are commonly used in machine learning applications such as classification and regression problems.

In [12]:
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)

You have to change the `voting='soft'` if you want soft voting. However, you have to ensure that all your ensembles can predict probability. For example, Support Vector Machine does not do that automatically.

In [14]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.9210526315789473
RandomForestClassifier 0.9210526315789473
SVC 0.9473684210526315
VotingClassifier 0.9473684210526315


### Bagging and Pasting

Bagging (Bootstrap Aggregating) is a technique where multiple subsets of the training data are created by randomly selecting samples with replacement. A base model is then trained on each subset, and the results are combined using a voting or averaging method. The main idea behind bagging is to reduce the variance of the model by introducing randomness into the training data.

Pasting, on the other hand, is a similar technique to bagging, but it differs in the way that the samples are selected. In pasting, instead of selecting samples with replacement, subsets of the training data are selected without replacement. This ensures that each sample is only used once in a single subset, and the resulting models are less correlated than in bagging.

Predictors can all be trained in parallel, via different CPU cores or even different servers. Similarly, predictions can be made in parallel. This is one of the reasons bagging and pasting are such popular methods: they scale very well. Each individual predictor has a higher bias than if it were trained on the original training set, but aggregation reduces both bias and variance. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

In [15]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    oob_score=True,
    n_jobs=-1
)
bag_clf.fit(X_train, y_train)

If you want to use pasting instead, just set `bootstrap=False`. The data that are not sampled in each predictor are called **out-of-bag** instances.

In [17]:
y_pred = bag_clf.predict(X_test)
print('Test Accuracy', accuracy_score(y_test, y_pred))
print('Out of box error: ', bag_clf.oob_score_)

Test Accuracy 0.9210526315789473
Out of box error:  0.9464285714285714


### Random Patches, Random Subspaces, Random Forests

**Random patches:** This technique involves selecting random subsets of both the training instances and the input features for each base model. In other words, each base model is trained on a randomly selected subset of the training data and a randomly selected subset of the input features. **This can help to reduce overfitting** and improve the performance of the ensemble model.

**Random subspaces:** This technique involves selecting random subsets of the input features for each base model, but using the entire training dataset for each model. In other words, each base model is trained on the full training dataset, but with only a randomly selected subset of the input features. **This can help to reduce the impact of irrelevant or noisy features** and improve the performance of the ensemble model.

**Random forests:** A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method. The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. The algorithm results in greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model.

In [25]:
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
accuracy_score(y_pred_rf, y_test)

0.9210526315789473

The feature importance score in a Random Forest model is calculated based on the decrease in the impurity of the nodes of the decision trees. The impurity of a node in a decision tree is typically measured using either the Gini impurity or entropy. When building a Random Forest model, each decision tree is constructed using a random subset of the input features. For each tree, the feature importance score is calculated as the sum of the decreases in impurity over all the nodes in the tree that use the feature. The importance score for each feature is then averaged over all the trees in the forest to obtain a final feature importance score.

In [26]:
for name, score in zip(iris.columns, rnd_clf.feature_importances_):
    print(name, score)

sepal.length 0.1111024464441931
sepal.width 0.025268048224199237
petal.length 0.41767699041763284
petal.width 0.44595251491397486


### AdaBoost
The basic idea behind AdaBoost is to iteratively train a sequence of weak learners on modified versions of the training data, with each iteration giving more weight to the misclassified samples from the previous iteration. In other words, AdaBoost adapts the weights of the training samples based on their classification error, such that the next weak learner focuses more on the misclassified samples in the previous iteration. The final classifier is then a weighted sum of the individual weak classifiers, with the weights determined by their accuracy.

`Scikit-Learn` uses a multiclass version of AdaBoost called `SAMME` (which stands for Stagewise Additive Modeling using a Multiclass Exponential loss function). When there are just two classes, SAMME is equivalent to AdaBoost. If the predictors can estimate class probabilities, `Scikit-Learn` can use a variant of SAMME called `SAMME.R`, which relies on class probabilities rather than predictions and generally performs better.

In [18]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=200, algorithm="SAMME.R", learning_rate=0.5)
ada_clf.fit(X_train, y_train)

y_pred_ad = ada_clf.predict(X_test)
accuracy_score(y_pred_ad, y_test)

0.868421052631579

### Gradient Boosting
Just like AdaBoost, Gradient Boosting works by sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor.

In [19]:
gbrt = GradientBoostingClassifier(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X_train, y_train)

y_pred_gbrt = ada_clf.predict(X_test)
accuracy_score(y_pred_gbrt, y_test)

0.868421052631579

### XGboost
XGBoost (short for Extreme Gradient Boosting) is a popular and highly scalable gradient boosting algorithm that is widely used in machine learning competitions and real-world applications. The basic idea behind XGBoost is to iteratively add new decision trees to the ensemble while correcting the mistakes of the previous trees. The algorithm also includes several advanced features that can help to improve the accuracy and efficiency of the model, including:

- **Regularization:** XGBoost includes L1 and L2 regularization terms to prevent overfitting and improve the generalization performance of the model.
- **Tree pruning:** XGBoost includes a built-in mechanism for pruning the decision trees to reduce their complexity and improve the interpretability of the model.
- **Cross-validation:** XGBoost includes a built-in mechanism for performing cross-validation to select the optimal number of trees and other hyperparameters.
- **Gradient-based optimization:** XGBoost uses a novel technique called gradient-based optimization to efficiently find the optimal weights for each decision tree in the ensemble.
- **Handling missing values:** XGBoost includes a built-in mechanism for handling missing values in the input data, which can improve the accuracy and robustness of the model.

XGBoost is known for its exceptional performance on a wide range of machine learning tasks, including classification, regression, and ranking. It has been shown to outperform many other state-of-the-art algorithms on many benchmark datasets and is often used as a baseline in machine learning competitions. XGBoost is also highly scalable and can handle large datasets with millions of examples and thousands of features.

In [51]:
scaler = StandardScaler()
enc = LabelEncoder()

scaler.fit(X_train)
enc.fit(y_train.values)

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train_enc = enc.transform(y_train.values)
y_test_enc = enc.transform(y_test.values)

In [52]:
xgb_clf = xgboost.XGBClassifier()
xgb_clf.fit(X_train_scaled, y_train_enc)

In [53]:
y_pred = xgb_clf.predict(X_test_scaled)
accuracy_score(y_pred, y_test_enc)

0.9210526315789473