### Ensembles

* Meaning → Group of things.
* In the context of machine learning it is simply the combination multiple models so as to build a powerful model.
* Simple models → $[m_1, m_2, m_3, \dots, m_k]$ → also called as base models. We combine these base models to contruct a powerful model.

**4 types of ensembles**

* Bagging (Bootstraped aggregation)
* Boosting
* Stacking
* Cascading

(Using these techniques, we can build a very high-performing model)

> The more different the base models are, the better we can combine them.

### Bagging

<p><span style="color:red">B</span>ootstraped <span style="color:red">agg</span>regation → <span style="color:red">Bagging</span></p>

* RandomForest algorithm extensively uses bagging techniques.

* Divide the training data ($D_n$) into `k` samples of size `m` (sampling is done with replacement).

* Apply `k` models on each sample correspondingly.
    - Each model $m_i$ is built on each sample $D_n^i$ correspondingly.
    - Each model $m_i$ sees a different sample of data.

* Aggregate the models (combine) into a powerful model $M$. Typically this is called an aggregation stage.

![ensemble-1](https://user-images.githubusercontent.com/63333753/125157308-e2647100-e187-11eb-99e0-a9bf6fa4f755.png)

> If we are doing a classifiction task, the aggregation would be majority vote techinique. <br> If we are doing a regression task, the aggregation would be mean or median.

* For the prediction, we send the query point to each model $m_i$ and get the each prediction. From that set of predictions, we apply majority voting technique and get the actual prediction. (This is only classification task).
    - For regression, we simple compute the average or median of predictions from each model and get the actual prediction.

**Note**

* Variance → siginifies how much a model changes for a slight change in the data. If a model changes a lot, then it has high variance.

* Because, we apply sampling, the overall result doesn't change much though the data is changed a lot. This is one advantage of Bagging.

> Bagging can reduce variance in the model without impacting the bias.

$$\text{Model Error} = \text{Bias}^2 + \text{Variance}$$

### Construction of RandomForest (RF)

* The name comes from decision trees algorithms. A bunch of decision trees (or different models) with data being randomly sampled is called a RandomForest model. Bagging technique is used in RF.

$$\text{RF} = \text{DT} + \text{Bagging} + \text{Column Sampling}$$

* Column sampling is done without replacement, whereas bagging is done (row sampling) with replacement. Thus, the models are going to be different since the features will be different.

![ensemble-2](https://user-images.githubusercontent.com/63333753/125159903-d4b6e780-e197-11eb-8895-0619d3a48469.png)

All the models (DTs) are trained to the fullest depth having high variance and less bias. All the points which are remaining after selection of sample are called `Out of Bag` points which are used for cross validating the model. This process is default in RF.

### Bias-Variance Tradeoff

* RF → low bias because the base model start-off with having less bias.
    - $\text{Bias}(M) \equiv \text{Bias}(m_i)$

$$M = \text{Agg}(m_1, m_2, m_3, \dots, m_k)$$

* As $k$ ↑; Variance ↓
* As $k$ ↓; Variance ↑

### Bagging - Training & Runtime Complexity

* RF with `k` base-learners (base-models) $\implies$ (DTs)
    - **Training** → $O(n\log(n) * k)$
        - Since we have samples, we can bunch of samples parallely (this is known as Trivially Parallelizable).
    - **Runtime** → $O(\text{depth} * k)$
    - **Space** → $O(\text{DT} * k)$

### RandomForest Cases

* DT doesn't work well in the cases where the data features large categorical data.
* Wherever DT fails, RF also fails.
* In the case od DT, bias depends on the depth of tree which is not same as RF. In RF, the bias depends on `k` models and in each model the depth is reasonable.
* In DT, the feature importance depends on the weighted sum of overall reduction of entropy or IG because of certain features at various levels in DT.
    - If a feature contributes to reducing the entropy or IG more often and at more points, then that feature is considered to be important.
    - The same concept is used for RF in determing the feature importance but of course, for `k` models.

### Boosting

$$\text{Boosting} = \text{Base learners (high bias & low variance)} + \text{Additive Combining}$$

* There is no row sampling or col sampling and aggregation. Instead of this, we have additive combination.

* The ultimate purpose is to reduce the bias while keeping the variance low.

* (High bias indicates a tree which is very shallow or not deep).

**Process Steps**

1. **0th stage**
    - Train the model $m_0$ on the whole training data $\{x_i, y_i\}$.
    - $y = h_0(X) \rightarrow$ base model
    - Compute simple difference error $\implies e_i^{\text{stage 0}} = y_i - h_0(x_i)$
    - For each point in the training data we have $\{x_i, e_i^{\text{stage 0}}, y_i\}_{i=1}^n$

2. **1st stage**
    - Train the model $m_1$ on the whole training data. Instead of predicting $y_i$, predict the $e_i^{\text{stage 0}}$.
    - $e_i^{\text{stage 0}} = y_i - h_0(x_i)$
    - $h_1(X) \rightarrow$ base model
    - Main model at the end of the first stage is $F_1(X) = \alpha_0 h_0(X) + \alpha_1 h_1(X)$. We shall find the values of $\alpha_0$ and $\alpha_1$.
    - $e_i^{\text{stage 1}} = y_i - F_1(x_i)$

3. **3rd stage**
    - Train the model $m_2$ on the whole training data. Instead of predicting $y_i$, predict the $e_i^{\text{stage 0}}$.
    - $e_i^{\text{stage 1}} = y_i - F_1(x_i)$
    - $h_2(X) \rightarrow$ base model
    - Main model at the end of the second stage is $F_2(X) = \alpha_0 h_0(X) + \alpha_1 h_1(X) + \alpha_2 h_2(X)$. We shall find the values of $\alpha_0, \alpha_1$, and $\alpha_2$.
    - $e_i^{\text{stage 2}} = y_i - F_2(x_i)$

4. **...**
    - Main model at the end of `kth` stage is $F_k(X) = \sum_{i=0}^k \alpha_i h_i(X) \implies$ additive combination.
    - $e_i^{\text{stage k - 1}} = y_i - F_{k - 1}(x_i)$

Thus, $F_k(X)$ ends up having a low residual error which simply means low bias.

### Residuals, Loss Functions, and Gradients

$$F_k(x) = \sum_{i=0}^k \alpha_i h_i(x)$$

$$\text{Residuals at the end of the stage k} \implies e_i = y_i - F_k(x_i)$$

* In linear regression we mainly try to reduce the squared loss. If we think in the perspective linear regression then -

$$L[y_i, F_k(x_i)] = [y_i - F_k(x_i)]^2 \rightarrow (1)$$

* If we differentiate $(1)$ w.r.t $F_k(x_i)$, we get

$$\frac{\partial L}{\partial F_k(x_i)} = -2 [y_i - F_k(x_i)]$$

$$-\frac{\partial L}{\partial F_k(x_i)} = 2 [y_i - F_k(x_i)]$$

$$\text{Negative Derivative} = 2 (\text{Residual})$$

* Negative gradient or derivative can be thought of as a pseudo residual or proxy residual.

* Now, instead of taking residual at every stage, we can directly the proxy residual value. 
    - From this, we can apply any model (loss function) for boosting as long as it is differentiable.

### Gradient Boosting

* Wiki article → https://en.wikipedia.org/wiki/Gradient_boosting

<!-- ![gradient_boosting_algo](https://user-images.githubusercontent.com/63333753/125189670-d98ea080-e256-11eb-9d72-f311df31af48.PNG)

* Helpful article → https://explained.ai/gradient-boosting/index.html

**Credits** - Image from Wiki -->

Input: training set $\{(x_i, y_i)\}_{i=1}^n$, a differentiable loss function $L\big(y, F(x)\big)$, number of iterations $M$.

Algorithm:

1. Initialize model with a constant value:

$$F_0(x) = \text{argmin}_{\gamma} \sum_{i=1}^n L(y_i, \gamma).$$

2. For $m = 1$ to $M$:
    - Compute so-called pseudo-residuals:
$$r_{im} = - \bigg[\frac{\partial L \big(y_i, F(x_i)\big)}{\partial F(x_i)}\bigg]_{F(x)=F_{m-1}(x)} \ \text{for} \ i= 1, \dots, n$$

    - Fit a base learner (or weak learner, e.g, tree) $h_m(x)$ to pseudo-residuals, i.e. train it using the training set $\{(x_i, r_{im})\}_{i=1}^n$.
    
    - Compute multiplier $\gamma_m$ by solving the following one-dimensional optimization (check wiki) problem:
$$\gamma_m = \text{argmin}_{\gamma} \sum_{i=1}^n L\big(y_i, F_{m-1}(x_i) + \gamma h_m(x_i)\big).$$

    - Update the model:
$$F_m(x) = F_{m-1}(x) + \gamma_mh_m(x).$$

3. Output $F_m(x)$.

### Shrinkage

* The final model is → $F_M(x) = h_0(x) + \sum_{m=1}^M \gamma_m h_m(x)$

* $M$ is nothing but the total number of models. This is hyperparameter and the best value should be found by cross-validation.

* Shinkage is just an extra parameter which is basically a learning rate ($v$). If we apply shrinkage to the final model at stage `M`, we will have -

$$F_M(x) = F_{m-1}(x) + v * \gamma_m h_m(x), 0 < v \leq 1$$

* $M$ and $v$ are hyperparameters.

### Boosting - Training & Runtime Complexity

* **Training**
    - $O[n \log(m) * M]$
    - $M \rightarrow$ total number of base learners
    - GBDT is not trivially parallelizable
    - Takes more time to train than RandomForest algo.
* **Runtime**
    - $O(depth * M)$
    - Depth is smaller in GBDT than RandomForest algo.
* **Space**
    - $O(DT + \gamma_m)$

GBDT algos are extensively used at internet companies.

### AdaBoost Algorithm

* It is a popular boosting algo very similar to GBDT.

* It is typically used in image processing and computer vision applications.
    - Especially in the case of face detection applications.

### Stacking Models

Helpful link → http://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/

* The data (on whole) is trained by base learners at first.

* We then generate $X'$ which is actually the output obtained from base learners.

* At the last step we take $X'$ and $y$ to return the ensembel model to get final predctions.

![ensemble-3](https://user-images.githubusercontent.com/63333753/125252910-83813200-e316-11eb-9c98-314d4fe0b180.png)

**Stacking**

* Making predictions of a number of models in a hold-out set and then using a different model (meta) to train on those predictions.

* Helpful video → https://www.youtube.com/watch?v=enEerl0feRo&ab_channel=HasanShaukat

In [1]:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier

import numpy as np
import pandas as pd

**Code Stacking Classifier - Scratch**

In [2]:
data = datasets.load_iris()

In [3]:
X, y = data.data, data.target

In [4]:
data = {'col{}'.format(i) : X[:, i] for i in range(len(X[0]))}
data['y'] = y
df = pd.DataFrame(data)

In [5]:
df.head()

Unnamed: 0,col0,col1,col2,col3,y
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [6]:
class StackingClassifier():
    def __init__(self, df, label, base_models, meta_model):
        train_df, test_df = self.splitter(dframe=df)
        train_df, cv_df = self.splitter(dframe=train_df)
        
        self.X_train, self.y_train = self.split_features_targets(df=train_df, label=label)
        self.X_train = self.X_train.values
        self.y_train = self.y_train.values
        
        self.X_cv, self.y_cv = self.split_features_targets(df=cv_df, label=label)
        self.X_cv = self.X_cv.values
        self.y_cv = self.y_cv.values
        
        self.X_test, self.y_test = self.split_features_targets(df=test_df, label=label)
        self.X_test = self.X_test.values
        self.y_test = self.y_test.values
        
        self.base_models = base_models
        self.meta_model = meta_model
        self.cv_preds, self.test_preds = self.fit(base_models=self.base_models)
    
    def split_features_targets(self, df, label):
        X = df.drop(columns=[label], axis=1)
        y = df[label]
        return X, y
    
    def splitter(self, dframe, percentage=0.8, random_state=True):
        if random_state:
            dframe = dframe.sample(frac=1)

        thresh = round(len(dframe) * percentage)
        train_df = dframe.iloc[:thresh]
        test_df = dframe.iloc[thresh:]

        return train_df, test_df
    
    def fit(self, base_models):
        cv_preds = {}
        test_preds = {}
        
        for i in range(len(base_models)):
            clf = base_models[i]
            clf.fit(self.X_train, self.y_train)
            cv_preds['cv_preds{}'.format(i)] = clf.predict(self.X_cv)
            test_preds['test_preds{}'.format(i)] = clf.predict(self.X_test)
        
        return cv_preds, test_preds
    
    def predict(self):
        stacked_cv_preds = np.column_stack(tup=list(self.cv_preds.values()))
        stacked_test_preds = np.column_stack(tup=list(self.test_preds.values()))
        
        meta_model.fit(stacked_cv_preds, self.y_cv)
        preds = meta_model.predict(stacked_test_preds)
        
        return preds

In [7]:
clf1 = KNeighborsClassifier(n_neighbors=3)
clf2 = RandomForestClassifier()
clf3 = GaussianNB()

base_models = [clf1, clf2, clf3]
meta_model = LogisticRegression()

In [8]:
sm = StackingClassifier(
    df=df, 
    label='y', 
    base_models=base_models, 
    meta_model=meta_model
)

In [9]:
preds = sm.predict()

In [10]:
preds

array([2, 0, 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 1, 0, 1, 2, 0,
       2, 2, 1, 0, 0, 2, 1, 0])

### Cascading Classifiers

* They are typically used when the cost of making a mistake is high.
    - Credit card transactions
    - Medical domain

**Process**

* Train the model with whole training data in the first stage.
    - Approve those points which are perfectly predicted.
    - Remove those points that have been approved from the original training data.

* Let the modified training data be again trained on a new model in the second stage.
    - Approve those points which are perfectly predicted.
    - Remove those points that have been approved from the modified training data.

![ensemble-4](https://user-images.githubusercontent.com/63333753/125274793-ecbf7000-e32b-11eb-8da1-c286e04b8111.png)

* Continue this process.