# **Ensemble Modeling**

Ensemble modeling is a process where multiple diverse base models are used to predict an outcome. The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error decreases when the ensemble approach is used. The approach seeks the wisdom of crowds in making a prediction. Even though the ensemble model has multiple base models within the model, it acts and performs as a single model. Most of the practical data science applications utilize ensemble modeling techniques.

> Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone…. Wikipedia

The fundamental principle of the ensemble model is that a group of weak learners come together to form a strong learner, which increases the accuracy of the model. When we try to predict the target variable by any machine learning technique, the main causes of the difference between the actual and predicted values are noise, variance and bias. The set reduces these factors (except noise, which is an irreducible error).


## **Simple Ensemble Techniques**

In this section, we will look at a few simple but powerful techniques, namely:

 + Max Voting
 + Averaging
 + Weighted Averaging

### **Max Voting**

The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.



### **Averaging**

Similar to the max voting technique, multiple predictions are made for each data point in averaging. In this method, we take an average of predictions from all the models and use it to make the final prediction. Averaging can be used for making predictions in regression problems or while calculating probabilities for classification problems.

### **Weighted Averaging**

This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. For instance, if two of your colleagues are critics, while others have no prior experience in this field, then the answers by these two friends are given more importance as compared to the other people.

## **Advanced Ensemble techniques**

Below are some advanced ensemble techniques:

 + Bagging
 + Boosting
 + Stacking

### **Bagging**

Bootstrap Aggregating is an ensemble method. First, we create random samples of the training data set with replacment (sub sets of training data set). Then, we build a model (classifier or Decision tree) for each sample. Finally, results of these multiple models are combined using average or majority voting.

> As each model is exposed to a different subset of data and we use their collective output at the end, so we are making sure that problem of overfitting is taken care of by not clinging too closely to our training data set. Thus, Bagging helps us to reduce the variance error.


*Combinations of multiple models decreases variance, especially in the case of unstable models, and may produce a more reliable prediction than a single model.*

Bagging is a simple and a very powerful ensemble method. It is a general procedure that can be used to reduce our model’s variance. A higher variance means that your model is overfitted. Certain algorithms such as decision trees usually suffer from high variance. In another way, decision trees are extremely sensitive to the data on which they have been trained. If the underlying data is changed even a little bit, then the resulting decision tree can be very different and as result our model’s predictions will change drastically. Bagging offers a solution to the problem of high variance. It can systematically reduce overfitting by taking an average of several decision trees. Bagging uses bootstrap sampling and finally aggregates the individual models by averaging to get the ultimate predictions. **Bootstrap sampling simply means sampling rows at random from the training dataset with replacement.**

![](https://miro.medium.com/max/850/0*lfH2wc6V2osrRBXk.png)

With bagging, it is therefore possible that you draw a single training example more than once. This results in a modified version of the training set where some rows are represented multiple times and some are absent. This also lets you create new data, which is similar to the data you started with. By doing this, you can fit many different but similar models.
Specifically, the way bagging works is as follows:

Step 1: You draw B samples with replacement from the original data set where B is a number less than or equal to n, the total number of samples in the training set.

![](https://miro.medium.com/max/1012/1*7NAo4D12sROHXDt1IovWfw.png)


Step 2: Train a decision trees on newly created bootstrapped samples. Repeat the Step1 and Step2 any number of times that you like. Generally, higher the number of trees, the better the model. But remember! Excess number of trees can make a model complicated and ultimately lead to overfitting as your model starts seeing relationships in the data that do not exist in the first place.

![](https://miro.medium.com/max/1086/1*l16JAxJR5MJea12jut-FLQ.png)

To generate a prediction using the bagged trees approach, you have to generate a prediction from each of the decision trees, and then simply average the predictions together to get a final prediction. Bagged or ensemble prediction is the average prediction across the sampled bootstrapped trees. Your bagged trees model works very similar to the council. Usually, when a council needs to take a decision, it simply considers a majority vote. The option that gets more votes (say- option A got 100 votes and option B got 90 votes), is the ultimately the council’s final decision. Similarly, in bagging, when you are trying to solve a problem of classification, you are basically taking a majority vote of all your decision trees. And, in case of regression we simply take an average of all our decision tree predictions. The collective knowledge of a diverse set of decision trees typically beats the knowledge of any individual tree. Bagged trees therefore offer better predictive performance.

**Algorithms :**

 1. ExtraTree
 2. Randomforest
 3. Bagging Clasifier/Bagging Regressor

**Random Forests**

Random forest is different from the vanilla bagging in just one way. It uses a modified tree learning algorithm that inspects, at each split in the learning process, a random subset of the features. We do so to avoid the correlation between the trees. Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictors, then in the collection of bagged trees, most or all of our decision trees will use the very strong predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated. Correlated
predictors cannot help in improving the accuracy of prediction. By taking a random subset of features, Random Forests systematically avoids correlation and improves model’s performance. The example below illustrates how Random Forest algorithm works.

![](https://miro.medium.com/max/884/1*5vlUF8FRR6flPPWK4wt-Kw.png)

Let’s look at a case when we are trying to solve a classification problem. As evident from the image above, our training data has four features- Feature1, Feature 2, Feature 3 and Feature 4. Now, each of our bootstrapped sample will be trained on a particular subset of features. For example, Decision Tree 1 will be trained on features 1 and 4 . DT2 will be trained on features 2 and 4, and finally DT3 will be trained on features 3 and 4. We will therefore have 3 different models, each trained on a different subset of features. We will finally feed in our new test data into each of these models, and get a unique prediction. The prediction that gets the maximum number of votes will be the ultimate decision of the random forest algorithm. For example, DT1 and DT3 predicted a positive class for a particular instance of our test data, while DT2 predicted a negative class. Since, the positive class got the majority number of votes(2), our random forest will ultimately classify this instance as positive. Again, I would like to stress on how the Random Forest algorithm uses a random subset of features to train several models, each model seeing only specific subset of the dataset.

Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model. If we were unlucky with how our training set was sampled, then it could contain some undesirable (but unavoidable) artifacts: noise, outliers and over- or underrepresented examples. By creating multiple random samples with replacement of our training set, we reduce the effect of these artifacts.



### **Boosting**

Boosting methods work in the same spirit as bagging methods: we build a family of models that are aggregated to obtain a strong learner that performs better. However, unlike bagging that mainly aims at reducing variance, boosting is a technique that consists in fitting sequentially multiple weak learners in a very adaptative way: each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence. Intuitively, each new model focus its efforts on the most difficult observations to fit up to now, so that we obtain, at the end of the process, a strong learner with lower bias (even if we can notice that boosting can also have the effect of reducing variance). Boosting, like bagging, can be used for regression as well as for classification problems.

Being mainly focused at reducing bias, the base models that are often considered for boosting are models with low variance but high bias. For example, if we want to use trees as our base models, we will choose most of the time shallow decision trees with only a few depths. Another important reason that motivates the use of low variance but high bias models as weak learners for boosting is that these models are in general less computationally expensive to fit (few degrees of freedom when parametrised). Indeed, as computations to fit the different models can’t be done in parallel (unlike bagging), it could become too expensive to fit sequentially several complex models.

Once the weak learners have been chosen, we still need to define how they will be sequentially fitted (what information from previous models do we take into account when fitting current model?) and how they will be aggregated (how do we aggregate the current model to the previous ones?). We will discuss these questions in the two following subsections, describing more especially two important boosting algorithms: adaboost and gradient boosting.

In a nutshell, these two meta-algorithms differ on how they create and aggregate the weak learners during the sequential process. Adaptive boosting updates the weights attached to each of the training dataset observations whereas gradient boosting updates the value of these observations. This main difference comes from the way both methods try to solve the optimisation problem of finding the best model that can be written as a weighted sum of weak learners.

![](https://miro.medium.com/max/1400/0*yDz8euzLbQvucBwx.png)

**Similarities and differences between Bagging and Boosting**

**Similarities**

+ Uses voting
+ Combines model of same type

**Differences**

+ Individual models built seprately in bagging whereas in boosting each new model get infulenced previous weak learner.
+ Equal weight is given to each model in bagging whereas in boosting weights a models contribution by its contribution.

**Algorithams :**

1. Adaboost(Adaboost)
2. Gradient boosting(GBM)
3. Extreme Gradient boosting(Xgboost)
4. Catboost
5. light gradient boostng(light GBM)





### **Stacking**

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. Below is a step-wise explanation for a simple stacked ensemble:

 1. A model (for ex: Logistic regression) is fitted and tuned on the given dataset using all standard ML approaches(cross validation, feature selection, hyper-parameter tuning). After sucessful tuning, prediction will be generated on training data using this final model. This prediction will be used as feature for future steps.
 2. Above process is repeated for another model(for ex: Decision tree). We can select some of best performing models.
 3. Another ML model will be then used based on Predicted values of above models/

## **Advantages/Benefits of ensemble methods**

Ensemble methods are used in almost all the ML hackathons to enhance the prediction abilities of the models. Let’s take a look at the advantages of using ensemble methods:

1. **More accurate prediction results :** We can compare the working of the ensemble methods to the Diversification of our financial portfolios. It is advised to keep a mixed portfolio across debt and equity to reduce the variability and hence, to minimize the risk. Similarly, the ensemble of models will give better performance on the test case scenarios (unseen data) as compared to the individual models in most of the cases.

2. **Stable and more robust model:** The aggregate result of multiple models is always less noisy than the individual models. This leads to model stability and robustness.

3. Ensemble models can be used to capture the linear as well as the non-linear relationships in the data.This can be accomplished by using 2 different models and forming an ensemble of the two.

**Disadvantages of ensemble methods**

1. **Reduction in model interpret-ability:** Using ensemble methods reduces the model interpret-ability due to increased complexity and makes it very difficult to draw any crucial business insights at the end.
2. **Computation and design time is high** It is not good for real time applications.
3. The selection of models for creating an ensemble is an art which is really hard to master.


## **Example: Titanic Survival prediction**

### **Max Voting**

In [None]:
# Setting the path
import os
os.chdir("/content/drive/My Drive/Introduction to Data Science - Python edition/dataset/titanic")

In [None]:
# Loading libraries
import numpy as np
import pandas as pd
import sklearn.model_selection as ms
import sklearn.metrics as sklm
import numpy.random as nr
import matplotlib.pyplot as plt
import math

%matplotlib inline

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
# load npy file
X_train = np.load('X_train.npy')
X_validation = np.load('X_validation.npy')
X_test = np.load('X_test.npy')

y_train = np.load('y_train.npy')
y_validation = np.load('y_validation.npy')
y_test = np.load('y_test.npy')

In [None]:
# Max voting
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics as sklm

model1= LogisticRegression()
model2 = KNeighborsClassifier()
model3 = DecisionTreeClassifier()

model1.fit(X_train,y_train)
model2.fit(X_train,y_train)
model3.fit(X_train,y_train)

pred1=model1.predict(X_validation)
pred2=model2.predict(X_validation)
pred3=model3.predict(X_validation)

f1_score_1 = sklm.f1_score(y_validation, pred1)
f1_score_2 = sklm.f1_score(y_validation, pred2)
f1_score_3 = sklm.f1_score(y_validation, pred3)

print(f1_score_1)
print(f1_score_2)
print(f1_score_3)

0.7961165048543688
0.7474747474747475
0.7184466019417477


In [None]:
import statistics
final_pred = np.array([])
for i in range(0,len(X_validation)):
    final_pred = np.append(final_pred, statistics.mode([pred1[i], pred2[i], pred3[i]]))

f1_score_4 = sklm.f1_score(y_validation, final_pred)
f1_score_4

0.8118811881188118

In [None]:
from sklearn.ensemble import VotingClassifier

model = VotingClassifier(estimators=[('lr', model1), ('KNN', model2), ('dt', model3)], voting='hard')
model.fit(X_train,y_train)
final_pred = model.predict(X_validation)
f1_score_4 = sklm.f1_score(y_validation, final_pred)
f1_score_4

0.8118811881188118

### **Bagging**

#### **Bagging classifier**

In [None]:
from sklearn.ensemble import BaggingClassifier

BC = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100)
BC.fit(X_train, y_train)
y_pred_BC = BC.predict(X_validation)

f1_score_BC = sklm.f1_score(y_validation, y_pred_BC)
f1_score_BC

0.7722772277227722

#### **Extratree classifier**

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

ET = ExtraTreesClassifier(n_estimators = 100)
ET.fit(X_train, y_train)
y_pred_ET = ET.predict(X_validation)

f1_score_ET = sklm.f1_score(y_validation, y_pred_ET)
f1_score_ET

0.7524752475247524

#### **Random forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier

RF = RandomForestClassifier(random_state = 123, n_estimators = 500)
RF.fit(X_train, y_train)
y_pred_RF = RF.predict(X_validation)

f1_score_RF = sklm.f1_score(y_validation, y_pred_RF)
f1_score_RF

0.7843137254901961

### **Boosting**

#### **Adaboost**

In [None]:
from sklearn.ensemble import AdaBoostClassifier

AB = AdaBoostClassifier(random_state = 123, n_estimators = 500)
AB.fit(X_train, y_train)
y_pred_AB = AB.predict(X_validation)

f1_score_AB = sklm.f1_score(y_validation, y_pred_AB)
f1_score_AB

0.8269230769230769

#### **Gradient boosting**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

GB = GradientBoostingClassifier(random_state = 123, n_estimators = 100)
GB.fit(X_train, y_train)
y_pred_GB = GB.predict(X_validation)

f1_score_GB = sklm.f1_score(y_validation, y_pred_GB)
f1_score_GB

0.836734693877551

#### **Extreme Gradient boosting**

In [None]:
from xgboost import XGBClassifier

XGB = XGBClassifier(random_state = 123, n_estimators = 500)
print(XGB)
XGB.fit(X_train, y_train)
y_pred_XGB = XGB.predict(X_validation)

f1_score_XGB = sklm.f1_score(y_validation, y_pred_XGB)
print(sklm.confusion_matrix(y_validation, y_pred_XGB))
print(sklm.accuracy_score(y_validation, y_pred_XGB))
print(sklm.classification_report(y_validation, y_pred_XGB))
print(f1_score_XGB)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=500, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=123,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)
[[74  8]
 [ 9 43]]
0.8731343283582089
              precision    recall  f1-score   support

           0       0.89      0.90      0.90        82
           1       0.84      0.83      0.83        52

    accuracy                           0.87       134
   macro avg       0.87      0.86      0.87       134
weighted avg       0.87      0.87      0.87       134

0.8349514563106797


### **Stacking**

In [None]:
from sklearn.ensemble import StackingClassifier

estimators = [
              ('XGB', XGBClassifier(random_state = 123, n_estimators = 500)),
              ('GB', GradientBoostingClassifier(random_state = 123, n_estimators = 100)),
              ('AB', AdaBoostClassifier(random_state = 123, n_estimators = 500))]

SC = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
SC.fit(X_train, y_train)
y_pred_SC = SC.predict(X_validation)

f1_score_SC = sklm.f1_score(y_validation, y_pred_SC)
f1_score_SC

0.8282828282828283