# Introduction

When you want to purchase a new car, will you walk up to the first car shop and purchase one based on the advice of the dealer? It’s highly unlikely.

You would likely browser a few web portals where people have posted their reviews and compare different car models, checking for their features and prices. You will also probably ask your friends and colleagues for their opinion. In short, you wouldn’t directly reach a conclusion, but will instead make a decision considering the opinions of other people as well.

Ensemble models in machine learning operate on a similar idea. They combine the decisions from multiple models to improve the overall performance. This can be achieved in various ways, which you will discover in this tutorial.


## Content
1. [Data Preprocessing](#1)                          
1. [Modeling](#2)
    * [Bagging meta-estimator](#3)
    * [Random Forest](#4)
    * [AdaBoost](#5)
    * [Gradient Boosting](#6)
    * [XGBoost](#7)
    * [Stacking](#8)

<a id="1"></a>
# 1) Data Preprocessing

The dataset you are going to be using for this case study is popularly known as the Wisconsin Breast Cancer dataset. The task related to it is Classification.

The dataset contains a total number of 10 features labeled in either `benign` or `malignant` classes. The features have 699 instances out of which 16 feature values are missing. The dataset only contains numeric values.

You will implement the Ensembles using the mighty scikit-learn library.

Let's first import all the Python dependencies you will be needing for this case study.

In [1]:
import pandas as pd
import xgboost as xgb
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, \
    GradientBoostingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Let's load the dataset in a DataFrame object.

In [2]:
data = pd.read_csv('breast_cancer_wisconsin.csv')
data.head()

Unnamed: 0,Sample code number,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


The column `Sample code number` is just an indicator and it's of no use in the modeling. So, let's drop it:

In [3]:
data.drop(['Sample code number'], axis=1, inplace=True)
data.head()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2


You can see that the column is dropped now. Let's get some statistics about the data using Panda's `describe()` and `info()` functions:

In [4]:
data.describe()

Unnamed: 0,Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bland Chromatin,Normal Nucleoli,Mitoses,Class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
mean,4.41774,3.134478,3.207439,2.806867,3.216023,3.437768,2.866953,1.589413,2.689557
std,2.815741,3.051459,2.971913,2.855379,2.2143,2.438364,3.053634,1.715078,0.951273
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0
25%,2.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0
50%,4.0,1.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0
75%,6.0,5.0,5.0,4.0,4.0,5.0,4.0,1.0,4.0
max,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,4.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Clump Thickness              699 non-null    int64 
 1   Uniformity of Cell Size      699 non-null    int64 
 2   Uniformity of Cell Shape     699 non-null    int64 
 3   Marginal Adhesion            699 non-null    int64 
 4   Single Epithelial Cell Size  699 non-null    int64 
 5   Bare Nuclei                  699 non-null    object
 6   Bland Chromatin              699 non-null    int64 
 7   Normal Nucleoli              699 non-null    int64 
 8   Mitoses                      699 non-null    int64 
 9   Class                        699 non-null    int64 
dtypes: int64(9), object(1)
memory usage: 54.7+ KB


As mentioned earlier, the dataset contains missing values. The column named `Bare Nuclei` contains them. Let's verify.

In [6]:
data['Bare Nuclei'][:30]

0      1
1     10
2      2
3      4
4      1
5     10
6     10
7      1
8      1
9      1
10     1
11     1
12     3
13     3
14     9
15     1
16     1
17     1
18    10
19     1
20    10
21     7
22     1
23     ?
24     1
25     7
26     1
27     1
28     1
29     1
Name: Bare Nuclei, dtype: object

You can spot some `?` in it, right? Well, these are your missing values, and you will be imputing them with Mean Imputation. But first, you will replace those `?` with `0`.

In [7]:
data.replace('?', 0, inplace=True)
data['Bare Nuclei'][:30]

0      1
1     10
2      2
3      4
4      1
5     10
6     10
7      1
8      1
9      1
10     1
11     1
12     3
13     3
14     9
15     1
16     1
17     1
18    10
19     1
20    10
21     7
22     1
23     0
24     1
25     7
26     1
27     1
28     1
29     1
Name: Bare Nuclei, dtype: object

The `?` are replaced with `0` now. Let's do the missing value treatment now.

In [8]:
imputer = SimpleImputer()
imputed_data = imputer.fit_transform(data)
imputed_data

array([[ 5.,  1.,  1., ...,  1.,  1.,  2.],
       [ 5.,  4.,  4., ...,  2.,  1.,  2.],
       [ 3.,  1.,  1., ...,  1.,  1.,  2.],
       ...,
       [ 5., 10., 10., ..., 10.,  2.,  4.],
       [ 4.,  8.,  6., ...,  6.,  1.,  4.],
       [ 4.,  8.,  8., ...,  4.,  1.,  4.]])

Now if you take a look at the dataset itself, you will see that all the ranges of the features of the dataset are not the same. This may cause a problem. A small change in a feature might not affect the other. To address this problem, you will normalize the ranges of the features to a uniform range, in this case, 0 - 1.

In [9]:
scaler = MinMaxScaler(feature_range=(0, 1))
normalized_data = scaler.fit_transform(imputed_data)
normalized_data

array([[0.44444444, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.44444444, 0.33333333, 0.33333333, ..., 0.11111111, 0.        ,
        0.        ],
       [0.22222222, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.44444444, 1.        , 1.        , ..., 1.        , 0.11111111,
        1.        ],
       [0.33333333, 0.77777778, 0.55555556, ..., 0.55555556, 0.        ,
        1.        ],
       [0.33333333, 0.77777778, 0.77777778, ..., 0.33333333, 0.        ,
        1.        ]])

Wonderful!

You have performed all the preprocessing that was required in order to perform your Ensembling experiments.

<a id="2"></a>
# 2) Modeling

Separated data as train and test.

In [10]:
# Segregate the features from the labels
X = normalized_data[:, 0:9]
Y = normalized_data[:, 9]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, shuffle=True, random_state=42)

<a id="3"></a>
## Bagging meta-estimator

Bagging meta-estimator is an ensembling algorithm that can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. Following are the steps for the bagging meta-estimator algorithm:

1. Random subsets are created from the original dataset (Bootstrapping).
2. The subset of the dataset includes all features.
3. A user-specified base estimator is fitted on each of these smaller sets.
4. Predictions from each model are combined to get the final result.

In [11]:
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced', random_state=2),
                          random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9714285714285714


Confusion matrix:
 [[139   4]
 [  2  65]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.99      0.97      0.98       143
         1.0       0.94      0.97      0.96        67

    accuracy                           0.97       210
   macro avg       0.96      0.97      0.97       210
weighted avg       0.97      0.97      0.97       210



For the regression problem, replace `BaggingClassifier` with `BaggingRegressor`.

Parameters used in the  algorithms:

* <b>base_estimator</b>:
    * It defines the base estimator to fit on random subsets of the dataset.
    * When nothing is specified, the base estimator is a decision tree.
* <b>n_estimators</b>:
    * It is the number of base estimators to be created.
    * The number of estimators should be carefully tuned as a large number would take a very long time to run, while a very small number might not provide the best results.
* <b>max_samples</b>:
    * This parameter controls the size of the subsets.
    * It is the maximum number of samples to train each base estimator.
* <b>max_features</b>:
    * Controls the number of features to draw from the whole dataset.
    * It defines the maximum number of features required to train each base estimator.
* <b>n_jobs</b>:
    * The number of jobs to run in parallel.
    * Set this value equal to the cores in your system.
    * If -1, the number of jobs is set to the number of cores.
* <b>random_state</b>:
    * It specifies the method of random split. When random state value is same for two models, the random selection is same for both models.
    * This parameter is useful when you want to compare different models.

<a id="4"></a>
## Random Forest

Random Forest is another ensemble machine learning algorithm that follows the bagging technique. It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees. Unlike bagging meta estimator, random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.

Looking at it step-by-step, this is what a random forest model does:

1. Random subsets are created from the original dataset (bootstrapping).
2. At each node in the decision tree, only a random set of features are considered to decide the best split.
3. A decision tree model is fitted on each of the subsets.
4. The final prediction is calculated by averaging the predictions from all decision trees.

<i>Note: The decision trees in random forest can be built on a subset of data and features. Particularly, the sklearn model of random forest uses all features for decision tree and a subset of features are randomly selected for splitting at each node.</i>

To sum up, Random forest <b>randomly</b> selects data points and features, and builds <b>multiple trees (Forest)</b>.

In [12]:
model = RandomForestClassifier(class_weight='balanced', random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9761904761904762


Confusion matrix:
 [[140   3]
 [  2  65]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.99      0.98      0.98       143
         1.0       0.96      0.97      0.96        67

    accuracy                           0.98       210
   macro avg       0.97      0.97      0.97       210
weighted avg       0.98      0.98      0.98       210



Parameters:

* <b>n_estimators</b>:
    * It defines the number of decision trees to be created in a random forest.
    * Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.
* <b>criterion</b>:
    * It defines the function that is to be used for splitting.
    * The function measures the quality of a split for each feature and chooses the best split.
* <b>max_features</b>:
    * It defines the maximum number of features allowed for the split in each decision tree.
    * Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.
* <b>max_depth</b>:
    * Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.
* <b>min_samples_split</b>:
    * Used to define the minimum number of samples required in a leaf node before a split is attempted.
    * If the number of samples is less than the required number, the node is not split.
* <b>min_samples_leaf</b>:
    * This defines the minimum number of samples required to be at a leaf node.
    * Smaller leaf size makes the model more prone to capturing noise in train data.
* <b>max_leaf_nodes</b>:
    * This parameter specifies the maximum number of leaf nodes for each tree.
    * The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.
* <b>n_jobs</b>:
    * This indicates the number of jobs to run in parallel.
    * Set value to -1 if you want it to run on all cores in the system.
* <b>random_state</b>:
    * This parameter is used to define the random selection.
    * It is used for comparison between various models.

<a id="5"></a>
## AdaBoost

Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.

Below are the steps for performing the AdaBoost algorithm:

1. Initially, all observations in the dataset are given equal weights.
2. A model is built on a subset of data.
3. Using this model, predictions are made on the whole dataset.
4. Errors are calculated by comparing the predictions and actual values.
5. While creating the next model, higher weights are given to the data points which were predicted incorrectly.
6. Weights can be determined using the error value. For instance, higher the error more is the weight assigned to the observation.
7. This process is repeated until the error function does not change, or the maximum limit of the number of estimators is reached.

In [13]:
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced', random_state=2),
                           random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9333333333333333


Confusion matrix:
 [[138   5]
 [  9  58]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.94      0.97      0.95       143
         1.0       0.92      0.87      0.89        67

    accuracy                           0.93       210
   macro avg       0.93      0.92      0.92       210
weighted avg       0.93      0.93      0.93       210



For the regression problem, replace `AdaBoostClassifier` with `AdaBoostRegressor`.

Parameters:

* <b>base_estimators</b>:
    * It helps to specify the type of base estimator, that is, the machine learning algorithm to be used as base learner.
* <b>n_estimators</b>:
    * It defines the number of base estimators.
    * The default value is 10, but you should keep a higher value to get better performance.
* <b>learning_rate</b>:
    * This parameter controls the contribution of the estimators in the final combination.
    * There is a trade-off between learning_rate and n_estimators.
* <b>max_depth</b>:
    * Defines the maximum depth of the individual estimator.
    * Tune this parameter for best performance.
* <b>n_jobs</b>:
    * Specifies the number of processors it is allowed to use.
    * Set value to -1 for maximum processors allowed.
* <b>random_state</b>:
    * An integer value to specify the random data split.
    * A definite value of random_state will always produce same results if given with same parameters and training data.

<a id="6"></a>
## Gradient Boosting

Gradient Boosting or GBM is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree.

We will use a simple example to understand the GBM algorithm. We have to predict the age of a group of people using the below data:

| ID | Married | Gender | City | Monthly Income | Age (target) |
| :- | :- | :- | :- | :- | :- |
| 1 | Y	| F	| Hanoi	| 51.000 | 35 |
| 2 | N	| M	| HCM | 25.000 | 24 |
| 3	| Y	| F	| Hanoi | 70.000 | 38 |
| 4	| Y	| M	| HCM | 53.000 | 30 |
| 5	| N	| M	| Hanoi	| 47.000 | 33 |

1. Train the first model on the above dataset.
2. Calculate the error based on the error between the actual value and the predicted value.

| ID | Married | Gender | City | Monthly Income | Age (target) | Age (prediction 1) | Error 1 |
| :- | :- | :- | :- | :- | :- | :- | :- |
| 1 | Y	| F	| Hanoi | 51.000 | 35 | 32 | 3 |
| 2	| N	| M	| HCM | 25.000 | 24 | 32 | -8 |
| 3	| Y	| F	| Hanoi | 70.000 | 38 | 32 | 6 |
| 4	| Y	| M	| HCM | 53.000 | 30 | 32 | -2 |
| 5	| N	| M	| Hanoi | 47.000 | 33 | 32 | 1 |

3. A second model is created, using the same input features as the previous model, but the target is `Error 1`.
4. The predicted value of the second model is added to the predicted value of the first model.

| ID | Age (target) | Age (prediction 1) | Error 1 (new target) | Prediction 2 | Combine (Pred1+Pred2) |
| :- | :- | :- | :- | :- | :- |
| 1	| 35 | 32 | 3 | 3 | 35 |
| 2	| 24 | 32 | -8 | -5 | 27 |
| 3	| 38 | 32 | 6 | 3 | 35 |
| 4	| 30 | 32 | -2 | -5 | 27 |
| 5	| 33 | 32 | 1 | 3 | 35 |

5. The combined value in step 3 is considered as the new predicted value. We calculate the error (`Error 2`) based on the error between this value and the actual value.

| ID | Age (target) | Age (prediction 1) | Error 1 (new target) | Prediction 2 | Combine (Pred1+Pred2) | Error 2 |
| :- | :- | :- | :- | :- | :- | :- |
| 1	| 35 | 32 | 3 | 3 | 35 | 0 |
| 2	| 24 | 32 | -8 | -5 | 27 | -3 |
| 3	| 38 | 32 | 6 | 3 | 35 | 3 |
| 4	| 30 | 32 | -2 | -5 | 27 | 3 |
| 5	| 33 | 32 | 1 | 3 | 35 | -3 |

6. Steps 2 to 5 are repeated till the maximum number of iterations is reached (or error function does not change).

In [14]:
model = GradientBoostingClassifier(learning_rate=0.01, random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9571428571428572


Confusion matrix:
 [[139   4]
 [  5  62]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       143
         1.0       0.94      0.93      0.93        67

    accuracy                           0.96       210
   macro avg       0.95      0.95      0.95       210
weighted avg       0.96      0.96      0.96       210



For the regression problem, replace `GradientBoostingClassifier` with `GradientBoostingRegressor`.

Parameters:

* <b>min_samples_split</b>:
    * Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
    * Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
* <b>min_samples_leaf</b>:
    * Defines the minimum samples required in a terminal or leaf node.
    * Generally, lower values should be chosen for imbalanced class problems because the regions in which the minority class will be in the majority will be very small.
* <b>min_weight_fraction_leaf</b>:
    * Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
* <b>max_depth</b>:
    * The maximum depth of a tree.
    * Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
    * Should be tuned using CV.
* <b>max_leaf_nodes</b>:
    * The maximum number of terminal nodes or leaves in a tree.
    * Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    * If this is defined, GBM will ignore max_depth.
* <b>max_features</b>
    * The number of features to consider while searching for the best split. These will be randomly selected.
    * As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total number of features.
    * Higher values can lead to over-fitting but it generally depends on a case to case scenario.

<a id="7"></a>
## XGBoost

XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as <b>regularized boosting</b> technique.

Let us see how XGBoost is comparatively better than other techniques:

* <b>Regularization</b>:
    * Standard GBM implementation has no regularisation like XGBoost.
    * Thus XGBoost also helps to reduce overfitting.
* <b>Parallel Processing</b>:
    * XGBoost implements parallel processing and is faster than GBM .
    * XGBoost also supports implementation on Hadoop.
* <b>High Flexibility</b>:
    * XGBoost allows users to define custom optimization objectives and evaluation criteria adding a whole new dimension to the model.
* <b>Handling Missing Values</b>:
    * XGBoost has an in-built routine to handle missing values.
* <b>Tree Pruning</b>:
    * XGBoost makes splits up to the max_depth specified and then starts pruning the tree backwards and removes splits beyond which there is no positive gain.
* <b>Built-in Cross-Validation</b>:
    * XGBoost allows a user to run a cross-validation at each iteration of the boosting process and thus it is easy to get the exact optimum number of boosting iterations in a single run.
    
Since XGBoost takes care of the missing values itself, you do not have to impute the missing values. You can skip the step for missing value imputation from the code mentioned above. Follow the remaining steps as always and then apply xgboost as below.

In [15]:
model = xgb.XGBClassifier(random_state=2, eta=0.01)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9571428571428572


Confusion matrix:
 [[138   5]
 [  4  63]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.97      0.97      0.97       143
         1.0       0.93      0.94      0.93        67

    accuracy                           0.96       210
   macro avg       0.95      0.95      0.95       210
weighted avg       0.96      0.96      0.96       210



For the regression problem, replace `XGBClassifier` with `XGBRegressor`.

Parameters:

* <b>nthread</b>:
    * This is used for parallel processing and the number of cores in the system should be entered..
    * If you wish to run on all cores, do not input this value. The algorithm will detect it automatically.
* <b>eta</b>:
    * Analogous to learning rate in GBM.
    * Makes the model more robust by shrinking the weights on each step.
* <b>min_child_weight</b>:
    * Defines the minimum sum of weights of all observations required in a child.
    * Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
* <b>max_depth</b>:
    * It is used to define the maximum depth.
    * Higher depth will allow the model to learn relations very specific to a particular sample.
* <b>max_leaf_nodes</b>:
    * The maximum number of terminal nodes or leaves in a tree.
    * Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
    * If this is defined, GBM will ignore max_depth.
* <b>gamma</b>:
    * A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
    * Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
* <b>subsample</b>:
    * Same as the subsample of GBM. Denotes the fraction of observations to be randomly sampled for each tree.
    * Lower values make the algorithm more conservative and prevent overfitting but values that are too small might lead to under-fitting.
* <b>colsample_bytree</b>:
    * It is similar to max_features in GBM.
    * Denotes the fraction of columns to be randomly sampled for each tree.

<a id="8"></a>
## Stacking

Stacking is an ensemble learning technique that uses predictions from multiple models (for example decision tree, knn or svm) to build a new model. This model is used for making predictions on the test set. Below is a step-wise explanation for a simple stacked ensemble:

1. The train set is split into 10 parts.
<img src="stack1.png" width=200px>
2. A base model (suppose a decision tree) is fitted on 9 parts and predictions are made for the 10th part. This is done for each part of the train set.
<img src="stack2.png" width=250px>
3. The base model (in this case, decision tree) is then fitted on the whole train dataset.
4. Using this model, predictions are made on the test set.
<img src="stack3.png" width=250px>
5. Steps 2 to 4 are repeated for another base model (say knn) resulting in another set of predictions for the train set and test set.
<img src="stack4.png" width=280px>
6. The predictions from the train set are used as features to build a new model.
<img src="stack5.png" width=200px>
7. This model is used to make final predictions on the test prediction set.

In [16]:
estimators = [
    ('lr', LogisticRegression(class_weight='balanced', random_state=2)),
    ('knn', KNeighborsClassifier()),
    ('dt', DecisionTreeClassifier(class_weight='balanced', random_state=2)),
    ('svm', SVC(class_weight='balanced', random_state=2))
]
final_estimator = LogisticRegression(class_weight='balanced', random_state=2)
model = StackingClassifier(estimators=estimators, final_estimator=final_estimator, cv=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.9809523809523809


Confusion matrix:
 [[141   2]
 [  2  65]]


Classification report:
               precision    recall  f1-score   support

         0.0       0.99      0.99      0.99       143
         1.0       0.97      0.97      0.97        67

    accuracy                           0.98       210
   macro avg       0.98      0.98      0.98       210
weighted avg       0.98      0.98      0.98       210



For the regression problem, replace `StackingClassifier` with `StackingRegressor`.

Parameters:

* <b>estimators</b>:
    * Base estimators which will be stacked together.
    * Each element of the list is defined as a tuple of string (i.e. name) and an estimator instance.
* <b>final_estimator</b>:
    * A classifier which will be used to combine the base estimators.
    * The default classifier is LogisticRegression.
* <b>cv</b>:
    * Determines the cross-validation splitting strategy used in `cross_val_predict` to train `final_estimator`.
    * Possible inputs for cv are:
        * None, to use the default 5-fold cross validation,
        * integer, to specify the number of folds in a (Stratified) KFold,
        * An object to be used as a cross-validation generator,
        * An iterable yielding train, test splits.
* <b>stack_method</b>:
    * Methods called for each base estimator.
* <b>n_jobs</b>:
    * The number of jobs to run in parallel all `estimators` `fit`.
* <b>passthrough</b>:
    * When False, only the predictions of estimators will be used as training data for `final_estimator`.
    * When True, the `final_estimator` is trained on the predictions as well as the original training data.