# 1. Ensemble Learning: Joining Forces
*(partially retrieved from https://towardsdatascience.com/ensemble-learning-bagging-and-boosting-23f9336d3cb0) and https://coding-blocks.github.io/DS-NOTES/7.%20Ensemble.html)*

Ensemble learning is a machine learning technique combining multiple individual models to create a stronger, more accurate predictive model. By leveraging the diverse strengths of different models, ensemble learning aims to mitigate errors, enhance performance, and increase the overall robustness of predictions, leading to improved results across various tasks in machine learning and data analysis.

There are **two** types of learners in Ensemble Learning:

- **Weak Learners** (single models): Individual models are known as weak learners. We call them weak learners because they either have a high bias or high variance.
- **Strong Learners** (ensemble models): Strong learners are the combination of various different weak learners that allow for a more accurate final model.

### Wisdom Of The Crowd

Suppose you pose a complex question to thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called *Wisdom of The Crowd*. Similarly, if you aggregate the predictions of group of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an *ensemle*; thus, this technique is called *Ensemble Learning*.  
 
Suppose you have a dataset with 10 instances and you have to perform classification task. As per current knowledge, **which model will you use?**  
  
    
Most probably you'll pick the one who will give the highest accuracy, right?

### It's a bird? It's a plane? No. It's Batman!

Consider the fable of the blind men and the elephant depicted in the image below. The blind men are each describing an elephant from their own point of view. Their descriptions are all correct but incomplete. Their understanding of the elephant would be more accurate and realistic if they came together to discuss and combined their descriptions.

<img src="https://pluralsight2.imgix.net/guides/27d18692-27b8-42aa-8aa5-9668e625c41c_1.jpg" width="400">

### Types of Ensemble Learning

1. Bagging (Bootstrap Aggregation)
    * Voting
2. Boosting

The main idea behind **Ensemble Learning** is the usage of multiple algorithms and models that are used together for the same task. While **single models** use only one algorithm (learner) to create prediction models. **Bagging** and **boosting** methods aim to combine several of those to achieve better prediction with higher consistency compared to individual learners.

<img src="https://pluralsight2.imgix.net/guides/81232a78-2e99-4ccc-ba8e-8cd873625fdf_2.jpg" width="500">

<!--<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*8vznmPx1HXTvfW5QzbnsCg.png" width="600">-->

### Monitoring Ensemble Learning Models
Ensemble learning improves a model’s performance in mainly three ways:

- By reducing the variance of weak learners
- By reducing the bias of weak learners

So,

- **Bagging** can be used to **reduce the variance** of weak learners. 

- **Boosting** can be used to **reduce the bias** of weak learners.

So when should we use it? Cleary, when we see **overfitting** or **underfitting** of our models.

One must choose the ensemble method that most fits the problem.

### Bias-Variance Trade-off

The next chart might be familiar to you, but it represents quite well the relationship and the tradeoff between bias and variance on the test error rate.

You might be familiar with the following concept, but I posit that it effectively illustrates the correlation and compromise between bias and variance with respect to the testing error rate.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*Za83whww3ucVQ7bDJq93AQ.png" width="500">

The relationship between the variance and bias of a model is such that a **reduction in variance** results in an **increase in bias**, and **vice versa**. To **achieve optimal performance**, the model must be positioned at an equilibrium point, where the test error rate is **minimized**, and the variance and bias are appropriately **balanced**.

### High-bias and High-variance Models

The individual models that we combine can **often** (**NOT ALWAYS**) either have a **high bias** or **high variance**. Because they either have high bias or variance, weak learners cannot learn efficiently and can perform poorly.

<img src="https://i.stack.imgur.com/UpdVr.jpg" width="400">

- A high-bias model results from **not learning data well enough** (underfit). It is not related to the distribution of the data. Hence future predictions will be unrelated to the data and thus incorrect.
- A high variance model results from learning the data **too well**. It varies with each data point. Hence it is impossible to predict the next point accurately.

Both high bias and high variance models thus cannot generalize properly. Thus, weak learners will either make incorrect generalizations or fail to generalize altogether. Because of this, the predictions of weak learners cannot be relied on by themselves.

As we know from the bias-variance trade-off, an **underfit model** has **high bias and low variance**, whereas an **overfit** model has **high variance and low bias**. In either case, there is no balance between bias and variance. For there to be a balance, both the bias and variance need to be low. **Ensemble learning** tries to balance this bias-variance trade-off by reducing either the bias or the variance.

# 2. Bagging (Bootstrap Aggregation) - Reduce Variance

One way to get a diverse set of classifiers is to use very different training algorithms, as just discussed. Another approach is to use same training algorithm for every predictor and train them on different random subsets of the training set. Whem sampling is performed with replacement, this method is called ***Bagging*** (short for **Bootstrap Aggregation**).  
  
In other words, bagging allows **training instances** to be sampled several times for the same predictor.  

<br>
<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem2.png" width="400">
<br>

Once all predictors are trained, the ensemble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the *statistical mode* (the prediction that is most occuring) for classification, or the *statistical mean* (average) for regression.  
  
You can train these different models in **parallel systems**, via different CPU Cores or even different servers. Similarly, predictions can be made in parallel as well. This is one of the reasons why bagging is a very popular method: It scales very well.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*9B7b8kDs9IrgfdcbWC0DLA.png" width="600">

The **Random Forest Classifier** belongs to **bagging** method since it is composed of multiple **Decision Tree Classifiers**.

Bagging consists of **two steps**:

## Step 1: Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset. In other words, subsets of data are taken from the initial dataset. These subsets of data are called bootstrapped datasets or, simply, bootstraps. Resampled ‘with replacement’ means an individual data point can be sampled multiple times. Each bootstrap dataset is used to train a weak learner.

## Step 2: Aggregating
The individual weak learners are trained independently from each other. Each learner makes independent predictions. The results of those predictions are aggregated at the end to get the overall prediction. The predictions are aggregated using either max voting or averaging.

### Max Voting (Classification)
It is a commonly used for classification problems that consists of taking the mode of the predictions (the most occurring prediction). It is called voting because like in election voting, the premise is that ‘the majority rules’. Each model makes a prediction. A prediction from each model counts as a single ‘vote’. The most occurring ‘vote’ is chosen as the representative for the combined model.

### Averaging (Regression)
It is generally used for regression problems. It involves taking the average of the predictions. The resulting average is used as the overall prediction for the combined model.

Using random subsets of data, the risk of overfitting is reduced and flattened by averaging the results of the sub-models. All models are calculated in parallel and then aggregated together afterward.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*oTtwV5r6Qg9bjepZtHSuGQ.png" width=600>

Now let's have a look on sklearn implementation of **Bagging**

We'll use a very common dataset known as Moon Dataset : This is a toy dataset for binary classification in which the data points are shaped as two interleaving half circles.   
  
To know more about Moon Dataset, visit: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html

#### Generate Dataset

In [None]:
from sklearn.datasets import make_moons
X_bagging, y_bagging = make_moons(n_samples=500, noise=0.3, random_state=1)

#### Visualize the "Moons"

In [None]:
from matplotlib import pyplot as plt

#Plotting the Data
plt.scatter(X_bagging[:,0], X_bagging[:,1], c = y_bagging)
plt.show()

#### Train & Test split

In [None]:
#Splitting Training & Validation Dataset
from sklearn.model_selection import train_test_split
X_train_bagging, X_test_bagging, y_train_bagging, y_test_bagging = train_test_split(X_bagging, y_bagging, test_size=0.2, stratify=y_bagging, random_state=1)

#Shape of data
print(f"Training => X:{X_train_bagging.shape}, y:{y_train_bagging.shape}")
print(f"Testing => X:{X_test_bagging.shape}, y:{y_test_bagging.shape}")

#### Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#Individual SVM Classifier
svm_bagging = SVC(gamma='auto', kernel="rbf", C=1)
svm_bagging_model = svm_bagging.fit(X_train_bagging,y_train_bagging)
y_pred_bagging = svm_bagging_model.predict(X_test_bagging)
print("Individual SVM Model:", accuracy_score(y_test_bagging,y_pred_bagging))

#Ensemble Classification model with SVM as base estimator
bag_clf = BaggingClassifier(estimator=svm_bagging_model, n_estimators=20, max_samples=0.8, bootstrap=True)
bag_clf.fit(X_train_bagging, y_train_bagging)
y_pred_bagging_2 = bag_clf.predict(X_test_bagging)
print("Ensemble SVM Model:",accuracy_score(y_test_bagging, y_pred_bagging_2))

Let's configure our **Bagging** classifier

- ***estimator***: You have to provide the underlying algorithm that should be used by the random subsets in the bagging procedure in the first parameter. This could be for example Logistic Regression, Support Vector Classification, Decision trees, or many more.
- ***n_estimators***: The number of estimators defines the number of bags you would like to create here and the default value for that is 10.
- ***max_samples***: The maximum number of samples defines how many samples should be drawn from X to train each base estimator. The default value here is one point zero which means that the total number of existing entries should be used. You could also say that you want only 80% of the entries by setting it to 0.8.

After setting the scenes, this model object works like many other models and can be trained using the ``fit()`` procedure including X and y data from the training set. The corresponding predictions on test data can be done using ``predict()``.

You can clearly observe the difference in their accuracies using this approach. This proves the effectiveness of Bagging. You can prepare a Pasting Model by setting ``bootstrap`` parameter of ``BaggingClassifier`` to "False", which means the replacement of instances is not allowed. 

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, thus increasing ensembling efficiency. Overall, in simpler terms Bagging often results in better models, which exaplains why it is generally preferred over pasting.

**Note :** The `BaggingClassifier` class supports sampling the features as well. Sampling is controlled by 2 hyperparameters: `max_features` and `bootstrap_features`. They allow random sampling of features for the model with or without repitition. Thus, each predictor can be trained on a random subset of the input features as well. Sampling feature results in even more predictor diversity, hence giving less correlation and making ensemble more effective.

## 2.1 Voting

<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem1.png" width="400">

**Voting** and **Bagging** enseble methods are similar in that they decide on the final result by combining multiple algorithms, but are different in terms of data sampling method. The diagram below may be able to explain their differences effectively.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*E_O_FSHK6SHL6LTNeYaDXw.png" width="500">

As can be seen, algorithms in voting method are trained with the same dataset, but the algorithms in bagging method are trained with different sampled datasets that have been bootstrapped. In other words, algorithms in bagging method are trained with datasets that have been random sampled from the original dataset with replacement. Furthermore, in bagging method, algorithms are identical while different algorithms are combined in voting method. 

For voting method, there are two methods of performing voting which are **hard voting** and **soft voting**. Hard voting is equivalent to majority vote, and soft voting is essentially averaging out the output of multiple algorithms. Soft voting is usually chosen as the voting method to go. The diagram below shows the mechanism of soft voting.

<img src="https://miro.medium.com/v2/resize:fit:640/format:webp/1*NZeYtOSBCzwigkC_QBYhgw.jpeg" width="300">



The above ensemble model is also called **Voting Classifier**, as the task performed is of classification and the prediction is made by majority votes. Surprising enough, this isn't the only case. This Voting Classifier **often** (**NOT ALWAYS**) achieves a higher accuracy than the best classifier in the ensemble.

### Example: Image Classification

The essential concept is encapsulated by means of a didactic illustration involving image classification. Supposing a collection of images, each accompanied by a categorical label corresponding to the kind of animal, is available for the purpose of training a model. In a traditional modeling approach, we would try several techniques and calculate the accuracy to choose one over the other. Imagine we used logistic regression, decision tree, and support vector machines here that perform differently on the given data set.

<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*RS7jOPKFRPSvg5yW9daWTw.png" width="600">

In the above example, it was observed that a specific record was predicted as a dog by the logistic regression and decision tree models, while a support vector machine identified it as a cat. As various models have their distinct advantages and disadvantages for particular records, it is the key idea of ensemble learning to combine all three models instead of selecting only one approach that showed the highest accuracy.

The procedure is called **aggregation** or **voting** and combines the predictions of all underlying models, to come up with one prediction that is assumed to be more precise than any sub-model that would stay alone.

Now let's try out ``VotingClassifier`` through Scikit-Learn library.

We'll use a dataset about Maternal Health Risk. This dataset contains data that has been collected from different hospitals, community clinics, maternal health cares from the rural areas of Bangladesh through the IoT based risk monitoring system.
  
To know more about this dataset, visit: https://archive.ics.uci.edu/dataset/863/maternal+health+risk

#### Dataset import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

dataset = pd.read_csv("maternal_risk.csv")
print(dataset.shape)
dataset.head()

#### Encode target variable

In [None]:
le = LabelEncoder()
dataset.RiskLevel = le.fit_transform(dataset.RiskLevel)
dataset.tail()

#### Normalization

In [None]:
X_maternal_risk = dataset.drop('RiskLevel', axis=1)
y_maternal_risk = dataset.RiskLevel

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_maternal_risk = scaler.fit_transform(X_maternal_risk)

#### Train & Test sets

In [None]:
#Train / Test split
X_train_voting, X_test_voting, y_train_voting, y_test_voting = train_test_split(X_maternal_risk, y_maternal_risk, test_size=0.2, stratify=y_maternal_risk, random_state=1)

#Shape of data
print(f"Training => X:{X_train_voting.shape}, y:{y_train_voting.shape}")
print(f"Testing => X:{X_test_voting.shape}, y:{y_test_voting.shape}")

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier

knn = KNeighborsClassifier(n_neighbors=20) # K-Nearest-Neighbours
lr = LogisticRegression(random_state=1) # Logistic Regression
svm = SVC(gamma='auto', kernel='rbf', C=1, probability=True) # SVM with probability=True
#dt = DecisionTreeClassifier(random_state=1) # Decision Tree

voting_clf = VotingClassifier(estimators=[('KNN', knn), ('LR', lr), ('SVM', svm)], voting='hard')
voting_clf.fit(X_train_voting,y_train_voting)

In [None]:
from sklearn.metrics import accuracy_score

for classifier in (knn, lr, svm, voting_clf):
    classifier.fit(X_train_voting,y_train_voting)
    y_pred_voting = classifier.predict(X_test_voting)
    print(classifier.__class__.__name__, "=>", accuracy_score(y_test_voting, y_pred_voting))

In here, the ``VotingClassifier`` achieves a **slight improve in accuracy score** when compared with the scores of the individual models.

This is the benefit of **Ensemble Learning**. However the difference in this accuracy will increase with the increase in *uncorrelation* of the error of these models.

# 3. Boosting - Reduce Bias

Boosting is a little variation of the bagging algorithm and uses sequential processing instead of parallel calculations. While bagging aims to reduce the variance of the model, the boosting method tries aims to reduce the bias to avoid underfitting the data. With that idea in mind, boosting also uses a random subset of the data to create an average-performing model on that.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*dcSp9TmyJGHzlUPM45hgAQ.png" width=600>
  
For that, it uses the miss-classified entries of the weak model with some other random data to create a new model. Therefore, the different models are not randomly chosen but are mainly influenced by wrong classified entries of the previous model. The steps for this technique are the following:

1. **Train initial (weak) model**: You create a subset of the data and train a weak learning model which is assumed to be the final ensemble model at this stage. You then analyze the results on the given training data set and can identify those entries that were misclassified.
2. **Update weights and train a new model**: You create a new random subset of the original training data but weight those misclassified entries higher. This dataset is then used to train a new model.
3. **Aggregate the new model with the ensemble model**: The next model should perform better on the more difficult entries and will be combined (aggregated) with the previous one into the new final ensemble model.

There are many boosting methods available, but by far the most popular are **AdaBoost** (short for Adaptive Boosting) and **Gradient Boosting**. Let's start with AdaBooost first. 

## 3.1 AdaBoost
One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is technique used by AdaBoost.  

For example, when training an AdaBoost classifier, the algorithm first trains a base classifier (such as Decision Tree) and uses it to make predictions on the training set. The algorithm then increases the relative weight of misclassified training instances. Then it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instances weights, and so on.

Consider the error equation :  

$\large{e = \sum_{i = 1}^n y_i - \hat{y}_i}$  

  
> For n = 100 :-  
> 
>> $e = (y_1 - \hat{y}_1) + (y_2 - \hat{y}_2) + \hspace{1mm}...\hspace{1mm} + \mathbf{6}(y_{26} - \hat{y}_{26}) + \hspace{1mm}...\hspace{1mm} + \mathbf{10} (y_{56} - \hat{y}_{56}) + \hspace{1mm}...\hspace{1mm}+(y_{100} - \hat{y}_{100})$  

Suppose my model was misclassifying instance no. **26** & **56**. So I can multiply these instances with some constants, so that my equation for the next model have some extra weight for these points, which means it will contain some bias for these instances, and try to focus on classifying them correctly. This is the base for Adaptive Boosting.  

Consider a Classification Dataset:

<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem3.png" width=400>

Now if we train an individual Decision Tree model, it’s decision boundary will look something like this, and as we can see it misclassifies a lot of points.

<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem4.png" width=400>

So, what we do in Adaptive Boosting is introduce some bias for the misclassified points, so our next model emphasize on them more which will somewhat enhance the decision boundary for our dataset.

So, we multiply the terms of these points with some constant that can differ for each point as well, as we use more and more models. Finally, the algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found. To make predictions, AdaBoost simply computes the predictions of all the predictors and weighs them using the predictor weights. The predicted class is the one that receives the majority of weighted votes.

<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem5.png" width=400>

Now, let's have a look on Scikit Learn Implementation of AdaBoost. Scikit-Learn uses a multiclass version of AdaBoost called ***SAMME*** [**S**tagewise **A**dditive **M**odeling using a **M**ulticlass **E**xponential loss function]. When there are just two classes (Binary Classification), SAMME is equivalent to AdaBoost. So let's compare the accuracies of individual decision tree and AdaBoost ensemble of Decision Trees.

The code below trains an AdaBoost Classifier based on **200 Decision Trees** using Scikit-Learn's `AdaBoostClassifier` class.

In order to check the performance of the AdaBoost, we'll use the dataset Heart Failure Predicition (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).


#### Import dataset

In [None]:
df_heart_failure = pd.read_csv('heart_failure.csv')
df_heart_failure.head()

#### Enconde all categorical values

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df_heart_failure['Sex']=le.fit_transform(df_heart_failure['Sex'])
df_heart_failure['RestingECG']=le.fit_transform(df_heart_failure['RestingECG'])
df_heart_failure['ChestPainType']=le.fit_transform(df_heart_failure['ChestPainType'])
df_heart_failure['ExerciseAngina']=le.fit_transform(df_heart_failure['ExerciseAngina'])
df_heart_failure['ST_Slope']=le.fit_transform(df_heart_failure['ST_Slope'])

df_heart_failure.head()

#### Create and run the ``AdaBoostClassifier``

In [None]:
from sklearn.ensemble import AdaBoostClassifier

X_ada = df_heart_failure.drop('HeartDisease', axis=1)
y_ada = df_heart_failure.HeartDisease

#Train / Test split
X_train_ada, X_test_ada, y_train_ada, y_test_ada = train_test_split(X_ada, y_ada, test_size=0.2, stratify=y_ada, random_state=1)

ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=10, random_state=1), n_estimators=200, random_state=1)
ada_clf.fit(X_train_ada, y_train_ada)

dt = DecisionTreeClassifier(random_state=1)
dt.fit(X_train_ada, y_train_ada)

y_pred_ensemble = ada_clf.predict(X_test_ada)
y_pred_dt = dt.predict(X_test_ada)

print("Individual Decision Tree Accuracy", accuracy_score(y_pred_dt, y_test_ada))
print("AdaBoost Ensemble Accuracy", accuracy_score(y_pred_ensemble, y_test_ada))

As you can see the accuracy increased from **~76,1%** to **~83.7%**. As we performed classification task using `AdaBoostClassifier`. Regression tasks can also be performed using `AdaBoostRegressor`.

## 3.2 Gradient Boosting

Another very popular boosting algorithm is **Gradient Boosting**. Just like AdaBoost, Gradient Boosting works sequentially adding predictors to an ensemble, each one correcting its predecessor. However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the *residual errors* made by the previous predictor. The concept of Gradient Boosting can be understood easily by taking an example.  
  
Consider the case of Linear Regression.  

> $e = y - \hat{y}$  
  
Let's take our I$^{st}$ model and put this equation.  

> $e_1 = y - \hat{y}_1$  
  
Suppose it returned some error $e_1$, so now we'll train our next model on this error like:  
  
> $e_2 = e_1 - \hat{e}_1\hspace{1.5cm}$ or we can say  
>   
> $e_2 = y - \hat{y}_1 - \hat{e}_1$  
  
Compare above equation with $e = y - \hat{y}$ , we can say  
  
> $\hat{y} = \hat{y}_1 + \hat{e}_1$  
  
For Mutiple Models:  

> $\hat{y} = \hat{y}_1 + \hat{e}_1 + \hat{e}_2 + \hat{e}_3 + \hat{e}_4 + ...$  

**Writing this programatically:**

We'll use the BMW Price prediciton dataset - https://www.kaggle.com/datasets/danielkyrka/bmw-pricing-challenge

#### Import dataset

In [None]:
df_bmw = pd.read_csv('bmw_pricing_challenge.csv')
df_bmw.head()

#### Feature selection

In [None]:
df_bmw = df_bmw.drop(['maker_key', 'sold_at'], axis=1, errors='ignore')
df_bmw.head()

#### Encode all categorical values

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df_bmw['model_key']=le.fit_transform(df_bmw['model_key'])
df_bmw['fuel']=le.fit_transform(df_bmw['fuel'])
df_bmw['paint_color']=le.fit_transform(df_bmw['paint_color'])
df_bmw['car_type']=le.fit_transform(df_bmw['car_type'])

# Convert registration_date to integer
df_bmw['registration_date'] = df_bmw['registration_date'].str.replace("-","").astype(int)

df_bmw.head()

#### Manual Gradient Boost

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error

X_gb = df_bmw.drop('price', axis=1, errors='ignore')
y_gb = df_bmw.price

#Train / Test split
X_train_gb, X_test_gb, y_train_gb, y_test_gb = train_test_split(X_gb, y_gb, test_size=0.2, random_state=1)

tree_reg1 = DecisionTreeRegressor(max_depth=20, random_state=1)
tree_reg1.fit(X_train_gb, y_train_gb)

e1 = y_train_gb - tree_reg1.predict(X_train_gb)
tree_reg2 = DecisionTreeRegressor(max_depth=20, random_state=1)
tree_reg2.fit(X_train_gb, e1)

e2 = e1 - tree_reg2.predict(X_train_gb)
tree_reg3 = DecisionTreeRegressor(max_depth=20, random_state=1)
tree_reg3.fit(X_train_gb, e2)

y_pred_gb_manual = sum(tree.predict(X_test_gb) for tree in (tree_reg1, tree_reg2, tree_reg3))

print('Mean Absolute Error:', mean_absolute_error(y_test_gb, y_pred_gb_manual))
print('Mean Squared Error:', mean_squared_error(y_test_gb, y_pred_gb_manual))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test_gb, y_pred_gb_manual)))

The above code can be fine-tuned by using the Sklearn's ``GradientBoostingRegressor`` implementation.

#### Sklearn ``GradientBoostingRegressor`` implementation

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

gbr = GradientBoostingRegressor(
    max_depth=20, 
    n_estimators=3, 
    loss='absolute_error', 
    learning_rate=0.6,
    random_state=1
)
gbr.fit(X_train_gb, y_train_gb)

y_pred_gb_sklearn = gbr.predict(X_test_gb)

print('Mean Absolute Error:', mean_absolute_error(y_test_gb, y_pred_gb_sklearn))
print('Mean Squared Error:', mean_squared_error(y_test_gb, y_pred_gb_sklearn))
print('Root Mean Squared Error:', np.sqrt(mean_squared_error(y_test_gb, y_pred_gb_sklearn)))

Now let’s try to understand the graphical representation of the error also (Refer to the figure below):

<img src="https://coding-blocks.github.io/DS-NOTES/_images/ensem6.png" width=600>

* The figure represents the predictions of these three trees in the left column, and the ensemble’s prediction in the right column. 
* In the first row, the ensemble has just one tree, so its predictions are exactly the same as the first tree’s predictions. 
* In the second row, a new tree is trained on the residual errors of the first tree. 
* On the right you can see that the ensemble’s predictions are equal to the sum of the predictions of the first two trees. * Similarly, in the third row another tree is trained on the residual errors of the second tree. You can see that the ensemble’s predictions gradually get better as trees are added to the ensemble.

The same way Gradient Boosting can also be used for classification tasks. You can import it by ``GradientBoostingClassifier`` under the ensemble module of scikit-learn.

# 4. Conclusion: Bagging vs. Boosting 

Bagging and boosting are important for ensuring the accuracy of models. They can help prevent undesirable consequences caused by inaccurate models. Below are some of the key takeaways from the article:

Ensemble learning combines multiple machine learning models into a single model. The aim is to increase the performance of the model.

* **Bagging** aims to **decrease variance**, **boosting** aims to **decrease bias**.
* **Bagging** and **boosting** combine **homogenous weak learners**.
* **Bagging** trains models in **parallel** and **boosting** trains the models **sequentially**.

The table below shows the similarities and the differences between the ensemble methods.

<img src="https://pluralsight2.imgix.net/guides/56a97436-3f9d-47b0-9a09-2dfadef5253d_8.jpg" width="600">