# Ensemble Techniques (Boosting & Bagging)

It's of 2 types
- Bagging (Bootstrap Aggregator)
    - Random Forest (uses multiple `decision trees`)
- Boosting
    - AdaBoost
    - Gradient Boosting
    - XGBoost
    
![ensemble](images/ensemble.png)

![ensemble2](images/ensemble-2.png)

## Bagging (bootstrap aggregation)

![bagging](images/bagging.png)

So if the traing data is D and the sub samples data are $ D_1, D_2,..., D_{t-1}, D_t $, then
$ D_1, D_2,..., D_{t-1}, D_t < D $
- In this way a group of models is created which will train with its corresbonding sub-sample data
- Then if we give our test to predict, every sub-model will give independent output
    - for `classification`, the final output will be measured by counting the majority of the vote/prediction that each sub-model provides.
    - for `regression`,  the final output will be measured by taking `mean/median` of the prediction that each sub-model provides.

![bagging2](images/bagging-2.webp)

**Jargon Alert**
- `Row sampling with replacement`: The process of taking subsample and creating a sub model in bagging
- `Bootstrap`: The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement
- `Aggregation`:  the aggregate is the output of Ensemble learning; "In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone."

## Random Forest

Random forest is one of the most popular `sklearn` algorithm because it performs better on any problem(classification/regression) than other algorithms by default.

From the above dicussion, we get sub-models by bootstrapping sub-samples. In `Random Forest` these sub-models are `decision trees`
- It uses both `RS & FS`(Row Sampling & Feature Sampling). So, if the dataset has $D$ data with $m$ columns and $n$ rows and a sub-model have $d'$ data with $m'$ columns and $n'$ rows
    - $d' < D$
    - $m' < m$
    - $n' < n$
- Some advantages:
    - It creates as many trees on the subset of the data and combines the output of all the trees. In this way it reduces overfitting problem in decision trees and also reduces the `variance` and therefore improves the accuracy.
    - can automatically handle missing values
    - No feature scaling (standardization and normalization) required in case of Random Forest 
    -  Random Forest algorithm is very stable. Even if a new data point is introduced in the dataset, the overall algorithm is not affected much since the new data may impact one tree, but it is very hard for it to impact all the trees.
    
It has two techniques.

- [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

**The differences:**
- criterion: `gini/entropy`(classification), `mse/mae`(regressor)

**gini vs entropy**

- These are the functions to measure the quality of a split
- `gini` for Gini Impurity
- `entropy` for Information Gain

$ P_{+} = $ the probability of yes

$ P_{-} = $ the probability of no

`gini formula:` $ 1 - (P_{+})^2 - (P_{-})^2 $

`entropy formula:` $ -P_{+}*log_{2}(P_{+}) - P_{-}*log_{2}(P_{-}) $

#### Entropy

Entropy is a measure of disorder or uncertainty and the goal of machine learning models and Data Scientists in general is to reduce uncertainty.
- https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8

#### Why `gini impurity` preferred than `entropy`?

- `gini impurity` takes shorter period of time for execution
- `entropy` has logarithmic calculation, which takes more time for computation

### Decision Tree

##### How the root node of decision tree is selected?

Ans: Based on information gain

$IG = H(S) - \frac{|A|}{|S|}H(A) - \frac{|B|}{|S|}H(B)$

where,

$IG =$ Information Gain

$H(S), H(A), H(B) =$ impurity of whole samples, sample A and sample B respectively where, $A+B=S$

$|S|, |A|, |B| =$ length of whole samples, sample A and sample B respectively

So, if our dataset(titanic) have total `887` samples in which `342` passengers survived and other `545` passengers died, gini impuruty will be:

$Giny = 2*p*(1-p)$

$Gini = 2*\frac{342}{887}*\frac{545}{887}$

$Gini = 0.4738$

##### gini formula:

$1 - p^2 - (1-p)^2$

$(1+p)(1-p)-(1-p)^2$

$(1-p)(1+p)-(1-p))$

$(1-p)(1+p-1+p)$

$2p(1-p)$

Now if we split our data based on $Age <= 30$, we'll have 
- 525 passengers on the left side ($survived=197$, $died=328$)
- other 362 passengers on the right side($survived=145$, $died=217$)

1. Giny for left side:

$Gini = 2*\frac{197}{525}*\frac{328}{525}$

$Gini = 0.4689$

2. Giny for right side:

$Gini = 2*\frac{145}{362}*\frac{217}{362}$

$Gini = 0.4802$

So the information gain,

$IG = H(S) - \frac{|A|}{|S|}H(A) - \frac{|B|}{|S|}H(B)$

$IG = 0.4738 - \frac{525}{887}*0.4689 - \frac{362}{887}*0.4802$

$IG = 0.003$

Now if we split our data based on $Sex$, we'll have 
- 314 passengers on the left side ($survived=233$, $died=81$)
- other 573 passengers on the right side($survived=109$, $died=464$)

1. Giny for left side:

$Gini = 2*\frac{233}{314}*\frac{81}{314}$

$Gini = 0.3828$

2. Giny for right side:

$Gini = 2*\frac{109}{573}*\frac{464}{573}$

$Gini = 0.3081$

So the information gain,

$IG = H(S) - \frac{|A|}{|S|}H(A) - \frac{|B|}{|S|}H(B)$

$IG = 0.4738 - \frac{314}{887}*0.3828 - \frac{573}{887}*0.3081$

$IG = 0.1393$

As information gain for the 2nd split is much better than the 1st split, our root node will be `Sex`