# Ensemble Learning

**Ensemble Learning** is a general meta approach to machine learning that seeks better predictive performance by combining the predictions from multiple models <br><br>
    
Ensemble models is a machine learning approach to combine multiple other models in the prediction process. These weak models are referred to as **base estimators**.<br>
It is a solution to overcome the following technical challenges of building a single estimator:

- High variance: The model is very sensitive to the provided inputs to the learned features
- Low accuracy: One model or one algorithm to fit the entire training data might not be good enough to meet expectations
- Features noise and bias: The model relies heavily on one or a few features while making a prediction

### Ensemble Algorithm

Machine Learning algorithms have their limitations and producing a model with high accuracy and generalization is challenging. <br>
A single algorithm may not make the perfect prediction for a given dataset, but if we build and **combine the learning of multiple models**, the overall accuracy and generalization could get boosted. <br>
The combination can be implemented by aggregating the output from each model with two objectives: reducing the model error and maintaining its generalization.<br>
These implementation techniques are often referred to as <u>meta-algorithms</u> <br>

![ensembleLearning.png](images/ensembleLearning.png)

## Aggregation Techniques

When we ensemble multiple algorithms to adapt the prediction process to combine multiple models, we need an aggregating method. Three main techniques can be used:

### Max Voting
The max voting method is generally used for classification problems. In this technique, multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction. <br>
For example:

<table>
    <thead>
        <tr>
            <th> </th>
            <th scope="col">Judge 1 </th>
            <th scope="col">Judge 2 </th>
            <th scope="col">Judge 3 </th>
            <th scope="col">Judge 4 </th>
            <th scope="col">Judge 5 </th>
            <th scope="col">Final score </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th scope="row">score </th>
            <td>5 </td>
            <td>4 </td>
            <td>4 </td>
            <td>4 </td>
            <td>5 </td>
            <th scope="row">4 </th>
        </tr>
    </tbody>   
</table>

### Averaging
Averaging is typically used for regression problems where base estimators' predictions are averaged to make the final prediciton.


<table>
    <thead>
        <tr>
            <th> </th>
            <th scope="col">Judge 1 </th>
            <th scope="col">Judge 2 </th>
            <th scope="col">Judge 3 </th>
            <th scope="col">Judge 4 </th>
            <th scope="col">Judge 5 </th>
            <th scope="col">Final score </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th scope="row">score </th>
            <td>5 </td>
            <td>4.2 </td>
            <td>4.6 </td>
            <td>4 </td>
            <td>5 </td>
            <th scope="row">4.36 </th>
        </tr>
    </tbody>   
</table>

### Weighted Average
This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction. <br>
For example: to ensemble scores given by professional judges and sports bloggers, we can assign them different weights

<table>
    <thead>
        <tr>
            <th> </th>
            <th scope="col">Judge 1 </th>
            <th scope="col">Judge 2 </th>
            <th scope="col">Judge 3 </th>
            <th scope="col">Judge 4 </th>
            <th scope="col">Judge 5 </th>
            <th scope="col">Final score </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th scope="row">weight </th>
            <td>0.23 </td>
            <td>0.23 </td>
            <td>0.18 </td>
            <td>0.18 </td>
            <td>0.18 </td>
            <td></td>
        </tr>
        <tr>
            <th scope="row">score </th>
            <td>5 </td>
            <td>4.2 </td>
            <td>4.6 </td>
            <td>4 </td>
            <td>5 </td>
            <th scope="row">4.564 </th>
        </tr>
    </tbody>   
</table>

## Ensemble Techniques

### Bagging

The idea behing Bagging is combining the results of multiple models (for instance, Decision trees) trained on varying sub-datasets of the same dataset to get a generalized result. <br>
**Bagging** is a combination of **Bootstrapping** and **Aggregation** <br>
**Bootstrapping** --> **Random sampling with replacement**<br>
**Boosting** --> **Combining results of multiple models**<br><br>
    
In other words, Bagging is training a bunch of individual models parallely, where each model is trained on a random subset of data with resampling (each subset is of same size) and combining the output or prediction of all these (weak) models via **voting for classification** or **average for regression**

![bagging.png](images/bagging.png)

- Multiple subsets are created from the original dataset, selecting observations with replacement
- A base model (weak model) is created on each of these subsets
- The models run in parallel and are independent of each other
- The final predictions are determined by combining the predictions from all the models

### Boosting

**Boosting** is training a bunch of individual models in a **sequential** way. Each individual model learns from mistaked made by previous model. The succeeding models are dependent on the previous model.

![boosting.png](images/boosting.png)

- A subset is created from the original dataset. Initially, all data points are given equal weights
- A base model is created on this subset. This model is used to make predictions on the whole dataset
- Errors are calculated using the actual values and predicted values
- The observations which are incorrectly predicted, are given higher weights. 
- Another model is created and predictions are made on the dataset. This model tries to correct the errors from the previous model
- Similarly, multiple models are created, each correcting the errors of the previous model
- The final model (strong learner) is the weighted mean of all the models (weak learners)

### Stacking

Stacking is similar to boosting models; they produce more robust predictors. Stacking is a process of learning how to create such a stronger model from all weak learners’ predictions.<br>
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to best combine the input predictions to make a better output prediction. This works as a series of <br>
**Original data** --> **Base models** --> **Level 0 predictions** which will now act as features to build new model --> **Meta model** --> **Level 1 prediction** and so on 

![stacking.png](images/stacking.png)

### Blending

Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions. In other words, unlike stacking, the predictions are made on the holdout set only. The holdout set and the predictions are used to build a model which is run on the test set.

## Algorithms based on Bagging and Boosting 

Bagging and Boosting are two of the most commomly used techniques in Machine Learning. These techniques inspired the follwoing widely used algorithms:

1. Bagging algorithms:
- Bagging meta-estimator
- Random forest
2. Boosting algorithms:
- AdaBoost
- GBM
- XGBM
- Light GBM
- CatBoost

following are code snippets as to how to train Ensemble models using scikit-learn <br><br>
**NOTE:** snippets use `X_train`, `y_train`, `X_test` and `y_test` which are training and test sets repectively i.e. it skips over importing libraries and data, Feature Engineerng, Exploratory Data Analysis, Data Analysis and Visualization and directly jumps to model building and evaluation

## 1. Bagging Algorithms using scikit-learn

### 1.1 Bagging meta-estimator
Bagging meta-estimator is an ensembling algorithm that can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. <br>
Following are the steps for the bagging meta-estimator algorithm:
1. Random subsets are created from the original dataset (Bootstrapping)
2. The subset of the dataset includes all features
3. A user-specified base estimator is fitted on each of these smaller sets
4. Predictions from each model are combined to get the final result
<br>

Parameters for this algorithm in scikit-learn:
- `base_estimator`: It defines the base estimator to fit on random subsets of the dataset. Default base estimator is Decision Tree
- `n_estimators`: It is the number of base estimators to be created
- `max_samples`: It is the maximum number of samples to train each base estimator
- `max_features`: It defines the maximum number of features required to train each base estimator
- `n_jobs`: The number of jobs to run in parallel. Set this value equal to the cores in your system by setting n_jobs=-1
- `random_state`: It specifies the method of random split. This parameter is useful when you want to compare different models

In [None]:
# for Classification

from sklearn.ensemble import BaggingClassifier
from sklearn import tree
model = BaggingClassifier(tree.DecisionTreeClassifier(random_state=1))
model.fit(X_train, y_train)
score = model.score(X_test,y_test)
print('Bagging meta-estimator accuracy :', score)

In [None]:
# for Regression

from sklearn.ensemble import BaggingRegressor
model = BaggingRegressor(tree.DecisionTreeRegressor(random_state=1))
model.fit(X_train, y_train)
score = model.score(X_test,y_test)
print('Bagging meta-estimator accuracy :', score)

### 1.2 Random Forest
Random Forest uses radom sampling with replacement for Bagging and Decision Tree as base estimator. Random Forest selects a set of features which are used to decide the best split at each node of Decision Tree. It does so by measuring and comparing the Information Gain using each feature when used to split the tree at decision nodes using a function such as Gini Index or log loss.<br>
Following are the steps for Random Forest algorithm:
1. Random subsets are created from the original dataset (bootstrapping)
2. At each node in the decision tree, only a random set of features are considered to decide the best split using Information Gain or Gini Index
3. A decision tree model is fitted on each of the subsets
4. The final prediction is calculated by aggregating the predictions from all decision trees

Parameters for this algorithm in scikit-learn:
- `n_estimators`: It defines the number of decision trees to be created in a random forest
- `criterion`: It defines the function that is to be used for splitting. The function measures the quality of a split for each feature and chooses the best split
- `max_features`: It defines the maximum number of features allowed for the split in each decision tree
- `max_depth`: Random forest has multiple decision trees. This parameter defines the maximum depth of the trees
- `min_samples_split`: Used to define the minimum number of samples required in a leaf node before a split is attempted. If the number of samples is less than the required number, the node is not split
- `min_samples_leaf`: This defines the minimum number of samples required to be at a leaf node
- `max_leaf_nodes`: This parameter specifies the maximum number of leaf nodes for each tree. The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node
- `n_jobs`: This indicates the number of jobs to run in parallel. Set value to -1 if you want it to run on all cores in the system.
- `random_state`: This parameter is used to define the random selection. It is used for comparison between various models



In [None]:
# for Classification

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
score = model.score(X_test,y_test)
print('Bagging meta-estimator accuracy :', score)

In [None]:
# for Regression

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(X_train, y_train)
score = model.score(X_test,y_test)
print('Bagging meta-estimator accuracy :', score)

## 2. Libraries and Frameworks for Boosting Algorithms 

### 2.1 AdaBoost : scikit-learn
Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Multiple sequential models are created, each correcting the errors from the last model. Usually, decision trees are used for modelling.  AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly by adjusting the weights of incorrectly predicted observations (data points). <br>

Following are the steps for AdaBoost algorithm:

1. Initially, all observations in the dataset are given equal weights
2. A model is built on a subset of data, usually a Decision Tree
3. Using this model, predictions are made on the whole dataset
4. Errors are calculated by comparing the predictions and actual values
5. While creating the next model, higher weights are given to the data points which were predicted incorrectly. Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation
6. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached

Parameters for this algorithm in scikit-learn:

- `base_estimators`: specifies the machine learning algorithm to be used as base learner
- `n_estimators`: defines the number of base estimators. Defaullt values is 10
- `learning_rate`: it controls the contribution of the estimators in the final combination. 
There is a trade-off between learning_rate and n_estimators.
- `max_depth`: defines the maximum depth of the individual estimator
- `n_jobs`: specifies the number of processors it is allowed to use. Set value to -1 for maximum processors allowed
- `random_state` : an integer value to specify the random data split. A definite value of random_state will always produce same results if given with same parameters and training data.


In [None]:
# for Classification

from sklearn.ensemble import AdaBoostClassifier
adaBoost = AdaBoostClassifier(random_state=1)
adaBoost.fit(X_train, y_train)
score = adaBoost.score(X_test,y_test)
print('Adaboost model accuracy :', score)

In [None]:
# for Regression

from sklearn.ensemble import AdaBoostRegressor
adaBoost = AdaBoostRegressor()
adaBoost.fit(X_train, y_train)
score = adaBoost.score(X_test,y_test)
print('Adaboost model accuracy :', score)

### 2.2 Gradient Boosting (GBM) : scikit-learn
Gradient Boosting Model uses an internal regression model trained iteratively on the residuals, combining a number of weak learners to form a strong learner. Each subsequent tree in series is built on the errors calculated by the previous tree. The error produced by a tree is combined with the input of that tree to form the input of next subsequent tree i.e. **Input(tree 1) + Error(tree 1) = Output(tree 1) = Input(tree 2)** and so on <br>

Following are the steps for Gradient Boosting algorithm:
1. Train a weak learner-model 1 using given data-input 1 and predict the required output using this model-prediction 1
2. Calculate the error(residual) of that model-error 1
3. Combine the input of previous model (input 1) with the calculated residual using a learning rate. This will be the input for next iteration of model $output 1 = input 2 = input 1 + \alpha*error 1$
3. Iteratively create a new model-model 2 using the error calculated as target variable. The objective is to find the best split to minimize the error
4. Predictions made by this model (model 1) are combined with predictions made by previous model (model 2). This value is the new prediction-prediction 2
5. New errors are created using this prediction prediction 2
6. Steps 2 to 5 are repeated till the maximum number of iterations is reached (or error function does not change)

Parameters for this algorithm in scikit-learn:

- `min_samples_split`: defines the minimum number of samples (or observations) which are required in a node to be considered for splitting. It is used to control over-fitting
- `min_samples_leaf`: defines the minimum samples required in a terminal or leaf node
- `min_weight_fraction_leaf`: similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer
- `max_depth`: maximum depth of a tree. Used to control over-fitting. Should be tuned using CV
- `max_leaf_nodes`: maximum number of terminal nodes or leaves in a tree. If this is defined, GBM will ignore max_depth
- `max_features`: number of features to consider while searching for the best split. These will be randomly selected. As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total number of features

In [None]:
# for Classification

from sklearn.ensemble import GradientBoostingClassifier
gradBoost = GradientBoostingClassifier(learning_rate=0.01,random_state=1)
gradBoost.fit(X_train, y_train)
score = gradBoost.score(X_test,y_test)
print('Gradient Boosting Model accuracy :', score)

In [None]:
# for Regression

from sklearn.ensemble import GradientBoostingRegressor
gradBoost = GradientBoostingRegressor()
gradBoost.fit(X_train, y_train)
score = gradBoost.score(X_test,y_test)
print('Gradient Boosting Model accuracy :', score)

### 2.3 XGBoost (XGB)
XGBoost stands for **eXtreme Gradient Boosting** and it’s an open-source implementation of the gradient boosted trees algorithm. <br>
XGBoost algorithm uses Gradient Boosting framework and uses **Gradient Descent** to optimize the loss function.<br>
It's implementation was specifically engineered for optimal performance and speed. <br><br>

In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now. <br><br>

#### Hyperparameters
- `gamma`: helps with controlling overfitting. It specifies the minimum reduction in the loss required to make a further partition on a leaf node of the tree
- `booster`
- `reg_alpha` and `reg_lambda`
- `max_depth`
- `subsample`
- `num_estimators`


In [None]:
# hyperparameters, using it with library, XGBoost all the time?, further exploration(sklearn)

In order for XGBoost to be able to use our data, we’ll need to transform it into a specific format that XGBoost can handle. That format is called DMatrix. It’s a very simple one-linear to transform a numpy array of data to DMatrix format:

In [None]:
D_train = xgb.DMatrix(X_train, label=Y_train)
D_test = xgb.DMatrix(X_test, label=Y_test)

Setting the optimal hyperparameters of any ML model can be a challenge. So why not let Scikit Learn do it for you? We can combine Scikit Learn’s grid search with an XGBoost classifier quite easily:

from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train, Y_train)

Only do that on a big dataset if you have time to kill — doing a grid search is essentially training an ensemble of decision trees many times over!<br><br>

Once your XGBoost model is trained, you can dump a human readable description of it into a text file:

model.dump_model('dump.raw.txt')

### 2.4 LightGBM
LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. <br>
Light GBM beats all the other algorithms **when the dataset is extremely large**. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset. <br>
LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern. The images below will help you understand the difference in a better way.

![lightGBM.png](images/lightGBM.png)

Leaf-wise growth may cause over-fitting on smaller datasets but that can be avoided by using the ‘max_depth’ parameter for learning. <br>

Parameters for this algorithm in scikit-learn: 

- `num_iterations`: defines the number of boosting iterations to be performed
- `num_leaves` : set the number of leaves to be formed in a tree. In case of Light GBM, since splitting takes place leaf-wise rather than depth-wise, num_leaves must be smaller than 2^(max_depth), otherwise, it may lead to overfitting
- `min_data_in_leaf`: minimum number of data/sample/count per leaf (default is 20. A very small value may cause overfitting
- `max_depth`: specifies the maximum depth or level up to which a tree can grow
- `bagging_fraction`: specify the fraction of data to be used for each iteration
- `max_bin`: defines the max number of bins that feature values will be bucketed in

In [1]:
# for Classification

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)

#define parameters
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100) 
y_pred=model.predict(X_test)
for i in range(0, 185):
    if y_pred[i]>=0.5: 
    y_pred[i]=1
else: 
    y_pred[i]=0

IndentationError: expected an indented block (<ipython-input-1-1ccf7ef61d85>, line 12)

In [None]:
# for Regression

import lightgbm as lgb
train_data=lgb.Dataset(X_train, label=y_train)
params = {'learning_rate':0.001}
model= lgb.train(params, train_data, 100)
y_pred = mode.predict(X_test)

from sklearn.metrics import mean_squared_error
rmse = mean_squared_error(y_pred, y_test)**0.5
print('Root Mean Square Error :', rmse)

### 2.5 CatBoost
CatBoost is an open-source software library developed by Yandex. It provides a gradient boosting framework which among other features attempts to solve for Categorical features using a permutation driven alternative compared to the classical algorithm.<br>
CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms. This reduces the computations required drastically when categorical variables have too many labels as performing one-hot-encoding on them exponentially increases the dimensionality of the dataset.<br><br>
CatBoost algorithm effectively deals with categorical variables. Thus, you should not perform one-hot encoding for categorical variables. Just load the files, impute missing values, and you’re good to go.

In [None]:
# for Classification

from catboost import CatBoostClassifier
catBoost = CatBoostClassifier()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
catBoost.fit(X_train, y_train, cat_features=([0, 1, 2, 3, 4, 10]), eval_set=(X_test, y_test))
score = catBoost.score(x_test,y_test)
print('Catboost Model accuracy :', score)

In [None]:
# for Regression

from catboost import CatBoostRegressor
catBoost = CatBoostRegressor()
categorical_features_indices = np.where(df.dtypes != np.float)[0]
catBoost.fit(X_train, y_train, cat_features=([ 0,  1, 2, 3, 4, 10]), eval_set=(X_test, y_test))
score = scorecatBoost.score(x_test,y_test)
print('Catboost Model accuracy :', score)

## Analogy for evolution of tree-based algorithms

1. **Decision Tree:** Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria
2. **Bagging:** Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process
3. **Random Forest:** It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random. In other words, every interviewer will only test the interviewee on certain randomly selected qualifications (e.g. a technical interview for testing programming skills and a behavioral interview for evaluating non-technical skills)
4. **Boosting:** This is an alternative approach where each interviewer alters the evaluation criteria based on feedback from the previous interviewer. This ‘boosts’ the efficiency of the interview process by deploying a more dynamic evaluation process
5. **Gradient Boosting:** A special case of boosting where errors are minimized by gradient descent algorithm e.g. the strategy consulting firms leverage by using case interviews to weed out less qualified candidates
6. **XGBoost:** Think of XGBoost as gradient boosting on ‘steroids’ (well it is called ‘Extreme Gradient Boosting’ for a reason!). It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time


## Careful Considerations

- **Noise, bias, and Variance:** :The combination of decisions from multiple models can help improve the overall performance. Hence, one of the key factors to use ensemble models is overcoming these issues: noise, bias, and variance. If the ensemble model does not give the collective experience to improve upon the accuracy in such a situation, then a careful rethinking of such employment is necessary

- **Simplicity and Explainability:** Machine learning models are preferred to be simpler than complicated. The ability to explain the final model decision is empirically reduced with an ensemble

- **Generalizations:** Ensemlble models with no careful training process can quickly produce overfitting models

- **Inference Time:** Inference Time i.e the time required by a model to generate a prediction from a trained model with new/live data points is criticial in a Machine Learning environment. When deploying ensemble models into production, the amount of time needed to pass multiple models increases and could increase the inference time

## Summary

Ensemble models is an excellent method for machine learning. The ensemble models have a variety of techniques for classification and regression problems and are able to reduce bias at the cost of possible overfitting and less generalised model. <br>
It is recommended to **create various Machine Learning models using different Ensemble techniques and compare each model's performance**(using tabular comparison or heatmap) to finally select the best fitting model

### Tree based models vs Neural Networks 

Neural Networks tend to work tremendously well with unorganized data such as image, text or audio files, but tree-based models (XGBoost & random forests) outperform Neural Networks (NNs) on tabular datasets. <br>
Tree-based models offer more accurate predictions with less computation cost when the dataset size is small/medium (~10k data samples). <br>
Also, in such a setting, they perform well even when there is: 
- irregular pattern in the target function
- uninformative features
- non-rotationally invariant data <br>

However, this might not hold when additional regularization techniques such as Data Augmentation are added to random search or when the dataset size is massive <br>
#### So should we use just XGBoost all the time? 
We must test all possible algorithms for data at hand to identify the champion algorithm. Besides, picking the right algorithm is not enough. We must also choose the right configuration of the algorithm for a dataset by tuning the hyper-parameters. Furthermore, there are several other considerations for choosing the winning algorithm such as computational complexity, explainability, and ease of implementation. This is exactly the point where Machine Learning starts drifting away from science towards art, but honestly, that’s where the magic happens!

### References

- <a href="https://towardsdatascience.com/ensemble-models-5a62d4f4cb0c">Ensemble Models - towardsdatascience </a>
- <a href="https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/">A Comprehensive Guide to Ensemble Learning - analyticsvidhya</a>
- <a href="https://www.javatpoint.com/stacking-in-machine-learning#:~:text=Stacking%20is%20one%20of%20the,new%20model%20with%20improved%20performance.">Stacking in Machine Learning - javaTpoint </a>
- <a href="https://hal.archives-ouvertes.fr/hal-03723551">Why do tree-based models still outperform deep learning on tabular data? </a>
- <a href="https://github.com/manikpurivibhu/Python-Machine-Learning-notes/blob/master/Machine%20Learning%20-%20Random%20Forest/Random%20Forest.ipynb">Random Forest - manikpurivibhu </a>
- <a href="https://en.wikipedia.org/wiki/Gradient_boosting#Algorithm"> Gradient Boosting - Wikipedia  </a>
- <a href="https://towardsdatascience.com/xgboost-theory-and-practice-fb8912930ad6">  XGBoost: theory and practice - towardsdatascience   </a>
- <a href="https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d">XGBoost Algorithm: Long May She Reign! - towardsdatascience   </a>