#### Boosting


Bagging is one such ensemble model which creates different training subsets from the training data with replacement. Then, an algorithm with the same set of hyperparameters is built on these different subsets of data.

In this way, the same algorithm with a similar set of hyperparameters is exposed to different subsets of the training data, resulting in a slight difference between the individual models. The predictions of these individual models are combined by taking the average of all the values for regression or a majority vote for a classification problem. Random forest is an example of the bagging method.

Bagging works well when the algorithm used to build our model has high variance. This means the model built changes a lot even with slight changes in the data. As a result, these algorithms overfit easily if not controlled. Recall that decision trees are prone to overfitting if the hyperparameters are not tuned well. Bagging works very well for high-variance models like decision trees.


**Boosting** is another popular approach to ensembling. This technique combines individual models into a strong learner by creating sequential models such that the final model has a higher accuracy than the individual models.

These individual models are connected in such a way that the subsequent models are dependent on errors of the previous model and each subsequent model tries to correct the errors of the previous models. 

#### Weak Learner

Weak learner, on the other hand, refers to a simple model which performs at least better than a random guesser (the error rate should be lesser than 0.5). It primarily identifies only the prominent pattern(s) present in the data and thus is not capable of overfitting. In boosting, such weak learners can be used to build your ensemble.

Decision stump is one such weak learner when talking about a shallow decision tree having a depth of only 1.

To summarize: Weak learners are combined sequentially such that each subsequent model corrects the mistakes of the previous model, resulting in a strong overall model that gives good predictions.

#### Adaboost

An overview of the steps that need to be taken in this boosting algorithm:

1. AdaBoost starts with a uniform distribution of weights over training examples, i.e., it gives equal weights to all its observations. These weights tell the importance of each datapoint being considered.
2. We start with a single weak learner to make the initial predictions.
3. Once the initial predictions are made, patterns which were not captured by the previous weak learner are taken care of by the next weak learner by giving more weightage to the misclassified datapoints.
4. Apart from giving weightage to each observation, the model also gives weightage to each weak learner. More the error in the weak learner, lesser is the weightage given to it. This helps when the ensembled model makes final predictions.
5. After getting the two weights for the observations and the individual weak learners, the next weak learner in the sequence trains on the resampled data (data sampled according to the weights) to make the next prediction.
6. The model will iteratively continue the steps mentioned above for a pre-specified number of weak learners. 
7. In the end, you need to take a weighted sum of the predictions from all these weak learners to get an overall strong learner.

A strong learner is formed by combining multiple weak learners which are trained on the mistakes of the previous model.



In AdaBoost, we start with a base model with equal weights given to every observation. In the next step, the observations which are incorrectly classified will be given a higher weight so that when a new weak learner is trained, it will give more attention to these misclassified observations.


In the end, you get a series of models that have a different say according to the predictions each weak model has made. If the model performs poorly and makes many incorrect predictions, it is given less importance, whereas if the model performs well and makes correct predictions most of the time, it is given more importance in the overall model.


The say/importance each weak learner — in our case the decision tree stump — has in the final classification depends on the total error it made. 
![image.png](attachment:image.png)

The value of the error rate lies between 0 and 1. So, let’s see how alpha and error is related.

1. When the base model performs with less error overall, then, as you can see in the plot above, the α is a large positive value, which means that the weak learner will have a high say in the final model. 
2. If the error is 0.5, it means that it is not sure of the decision, then the α = 0, i.e., the weak learner will have no say or significance in the final model.
3. If the model produces large errors (i.e., close to 1), then α is a large negative value, meaning that the predictions it makes are incorrect most of the time. Hence, this weak learner will have a very low say in the final model. 
4. After calculating the say/importance of each weak learner, you must determine the new weights of each observation present in the training data set. Use the following formula to compute the new weight for each observation:

 

new sample weight for the incorrectly classified observation = original sample weight * $ e^{\alpha} $

new sample weight for the correctly classified observation = original  sample weight * $ e^{-\alpha} $

After calculating, we normalise these values to proceed further using the following formula:

Normalized weights: $ p(x_{i} / ( \sum \limits_{i}^{n} p(x_{i})) $

The samples which the previous stump incorrectly classified will be given higher weights and the ones which the previous stump classified correctly will be given lower weights.

For next weak learner, We create a new and empty dataset that is the same size as the original one. Then we take the distribution of all the updated weights created by our first model.
Due to the weights given to each observation, the new data set will have a tendency to contain multiple copies of the observation(s) that were misclassified by the previous tree and may not contain all observations which were correctly classified. 

After doing this, the initial weights for each observation will be 1/n, thus we can continue the same process as learnt earlier to build the next weak learner.

 

This will help the next weak learner give more importance to the incorrectly classified sample so that it can correct the mistake and correctly classify it now. This process will be repeated till a pre-specified number of trees are built, i.e., the ensemble is built. 

 

The AdaBoost model makes predictions by having each tree in the ensemble classify the sample. Then, the trees are split into groups according to their decisions. For each group, the significance of every tree inside the group is added up. The final prediction made by the ensemble as a whole is determined by the sign of the weighted sum.

The final model is a strong learner made by the weighted sum of all the individual weak learners.

In [14]:
import numpy as np
x=np.log(4/3)

In [11]:
0.5*x

0.2027325540540821

In [15]:
np.exp(x)/(1+np.exp(x))

0.5714285714285715

In [9]:
2*.1*2

0.4