![Ensemble_Learning](https://raw.githubusercontent.com/satishgunjal/images/master/Ensemble_Learning.png)

# Index
* [Introduction](#1)
* [Bagging](#2)
* [Boosting](#3)
* [Stacking](#4)
* [Python Example](#5)
  - [Import Libraries](#6)
  - [Load Data](#7)
  - [Train and Test Data](#8)
  - [Modeling](#9)
    - [Linear Regression](#10)
	- [Lasso Regression](#11)
	- [ElasticNet Regression](#12)
	- [KernelRidge Regression](#13)
  - [Ensemble Modeling](#14)
    - [Bagging](#15)
	- [Boosting](#16)
	  - [GradientBoostingRegressor](#17)
	  - [XGBRegressor](#18)
	  - [LGBMRegressor](#19)
	- [Stacking](#20)
    
# Introduction <a id= "1"></a>

Whenever we make any important decision we first discuss it with friends, family or an expert. Nowadays we check the reviews on social media or check a YouTube video. Considering other people's opinion just make final decision more informed and make sure to avoid any kind of surprises as we are combining multiple opinions about the same thing together. 

Ensemble modeling in machine learning operates on the same principle, where we combine the predictions from multiple models to generate the final model which provide better overall performance. Ensemble modeling helps to generalize the learning based on training data, so that it will be able to do predictions accurately on unknown data. 

Modeling is one of the most important step in machine learning pipeline. The main motivation behind ensemble learning is to correctly combine weak models to get a more accurate and robust model with bias-variance trade off. For example Random Forest algorithm is ensemble of Decision Tree and since it combine multiple decision  tree models it always perform better than single decision tree model.

Depending on how we combine the base models, ensemble learning can be classified in three different types Bagging, Boosting and  Stacking.

* **Bagging**: The working principle is to build several base models independently and then to average them for final predictions. 
* **Boosting**: Boosting models are built sequentially and tries to reduce the bias on final predictions. 
* **Stacking**: The predictions of each individual model are stacked together and used as input to a final estimator to compute the prediction. 
 
Ensemble learning approach makes the model more robust and helps to achieve the better performance.

# Bagging <a id= "2"></a>

![EnsembleI_Learning_Bagging](https://raw.githubusercontent.com/satishgunjal/images/master/Ensemble_Learning_Bagging.png)

* In bagging we build independent estimators on different samples of the original data set and average or vote across all the predictions.
* Bagging is a short form of **B*ootstrap *Agg*regat*ing*. It is an ensemble learning approach used to improve the stability and accuracy of machine learning algorithms.
* Since multiple model predictions are averaged together to form the final predictions, Bagging reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. 
* Bagging is a special case of the model averaging approach, in case of regression problem we take mean of the output and in case of classification we take the majority vote. 
* Bagging is more helpfull if we have over fitting (high variance) base models.
* We can also build independent estimators of same type on each subset. These independent estimators also enable us to parallelly process and increase the speed.
* Most popular bagging estimator is 'Bagging Tress' also knows as 'Random Forest'

**Bootstrapping**
* It is a resampling technique, where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample.
* So this technique will enable us to produce as many subsample as we required from the original training data.
* So the defination is simple to understand, but "replacement" word may be confusing sometimes. Here 'replacement' word signifies that the same obervation may repeat more than once in a given sample, and hence this technique is also known as **sampleing with replacement**

![Bootstrap_Sampling_ML](https://raw.githubusercontent.com/satishgunjal/images/master/Bootstrap_Sampling_ML.png)

* As you can see in above image we have training data with observations from X1 to X10. In first bootstrap training sample X6, X10 and X2 are repeated where as in second training sample X3, X4, X7 and X9 are repeated.
* Bootstrap sampling helps us to generate random sample from given training data for each model in order to genralise the final estimation.

So in case of Bagging we create multiple number of bootstrap samples from given data to train our base models. Each sample will contain training and test data sets which are different from each other and remember that training sample may contain duplicate observations.

# Boosting <a id= "3"></a>
* In case of boosting, machine learning models are used one after the other and the predictions made by first layer models are used as input to next layer models. The last layer of models will use the predictions from all previous layers to get the final predictions. 
* So boosting enables each subsequent model to boost the performance of the previous one by overcomming or reducing the error of the previous model.
* Unlike bagging, in case of boosting the base learners are trained in sequence on a weighted version of the data. Boosting is more helpful if we have biased base models.
* Boosting can be used to solve regression and classification problems.

![Ensemble_Learning_Boosting](https://raw.githubusercontent.com/satishgunjal/images/master/Ensemble_Learning_Boosting.png)

Different types of Boosting algorithms
* Gradient Boosting Machine (GBM)
* Extreme Gradient Boosting Machine (XGBM)
* LightGBM
* CatBoost

# Stacking <a id= "4"></a>
Model stacking is a method for combining models to reduce their biases. The predictions of each individual model are stacked together and used as input to a final estimator to compute the prediction. This final estimator is trained through cross-validation.

Note that in case of stacking we use heterogeneous weak learners (different learning algorithms) but in case bagging and boosting we mainly use homogeneous weak learners. 

![Ensemble_Learning_Stacking](https://raw.githubusercontent.com/satishgunjal/images/master/Ensemble_Learning_Stacking.png)

# When to use Ensemble Learning?
Since Ensemble learning results in better accuracy, high consistency and also helps to avoid bias variance tradeoff should'nt we use it everywhere? The short answer is it depends on the problem in hand. If our model with available training data is not performing well and showing the signs of overfitting/unterfitting and additinal compute power is not an issue then going for Ensemble Learning is best option. However one shouldnt skip the first steps of improving the input data and trying different hyperparmeters before going for ensemple approach. 

# Python Example <a id= "5"></a>

We are going to use [House Prices: Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition data. Our objective is to predict the final price of each house based on the 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. We will try all the ensemble learning approaches and compare their results.

![House_Prices_Advanced_Regression_Techniques](https://raw.githubusercontent.com/satishgunjal/images/master/House_Prices_Advanced_Regression_Techniques.png)

## Import Libraries <a id= "6"></a>

In [1]:
import os
import pandas as pd
import numpy as np
import warnings

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.kernel_ridge import KernelRidge

from sklearn.ensemble import BaggingRegressor

from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import StackingRegressor

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import mean_squared_error

# Global settings

warnings.filterwarnings("ignore") # To ignore warnings
n_jobs = -1 # This parameter conrols the parallel processing. -1 means using all processors.
random_state = 42 # This parameter controls the randomness of the data. Using some int value to get same results everytime this code is run.

## Load Data <a id= "7"></a>
Since the objective of this article is to test the different ensemble techniques, I have excluded the data preprocessing and EDA steps. I am going to use model ready dataset, so that we can straight away start modeling and ensembling. **X.csv** contains all the training data and **y.csv** contains the label values. In this case "SalePrice" is the label/target variable which represent the property's sale price in dollars that we are trying to predict.

Below is the list of features:
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

In [2]:
X = pd.read_csv('/kaggle/input/modelling-ready-data/X.csv')
print(f'Shape of X= {X.shape}')
X.head()

Shape of X= (1458, 220)


Unnamed: 0,MSSubClass,LotFrontage,LotArea,Street,Alley,LotShape,LandSlope,OverallQual,OverallCond,YearBuilt,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,2.885846,5.831328,19.212182,0.730463,0.730463,1.540963,0.0,2.440268,1.820334,14.187527,...,0,0,0,1,0,0,0,0,1,0
1,2.055642,6.221214,19.712205,0.730463,0.730463,1.540963,0.0,2.259674,2.440268,14.145138,...,0,0,0,1,0,0,0,0,1,0
2,2.885846,5.91494,20.347241,0.730463,0.730463,0.0,0.0,2.440268,1.820334,14.184404,...,0,0,0,1,0,0,0,0,1,0
3,3.01134,5.684507,19.691553,0.730463,0.730463,0.0,0.0,2.440268,1.820334,14.047529,...,0,0,0,1,1,0,0,0,0,0
4,2.885846,6.314735,21.32516,0.730463,0.730463,0.0,0.0,2.602594,1.820334,14.182841,...,0,0,0,1,0,0,0,0,1,0


In [3]:
y = pd.read_csv('/kaggle/input/modelling-ready-data/y.csv')
print(f'Shape of y= {y.shape}')
y.head()

Shape of y= (1458, 1)


Unnamed: 0,0
0,12.247699
1,12.109016
2,12.317171
3,11.849405
4,12.42922


##  Train and Test Data <a id= "8"></a>
We will use train_test_split() method to create training and test sets.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state = random_state)

print(f'Training set--> X_train shape= {X_train.shape}, y_train shape= {y_train.shape}')
print(f'Holdout set--> X_test shape= {X_test.shape}, y_test shape= {y_test.shape}')

Training set--> X_train shape= (976, 220), y_train shape= (976, 1)
Holdout set--> X_test shape= (482, 220), y_test shape= (482, 1)


## Modeling <a id= "9"></a>
* I am also skipping the hyperparameter tuning and used already tuned hyperparameters here
* We will use multiple models individually as well as in ensemble mode to test the final predictions.
* We are going to use **Root Mean Square Error(RMSE)** metric to compare the scores. Since this metrics is not available out of the box we will create a function for it.

Note: RMSE metric is used to express the loss in the same unit of measurement as label value, in this case house price in dollars. For example if RMSE for house price is 2, then we can loosely interpret it as 'on average incorrect predictions are wrong by around 2 house prices'

In [5]:
models_scores = [] # To store model scores

def rmse(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    return mean_squared_error(y_test, y_pred, squared= False) # squared= False > returns Root Mean Square Error                  

### Linear Regression <a id= "10"></a>

* Lest test using Ordinary least squares Linear Regression.
* LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [6]:
linear_regression = make_pipeline(LinearRegression())
score = rmse(linear_regression)

models_scores.append(['LinearRegression', score])
print(f'LinearRegression Score= {score}')

LinearRegression Score= 0.13179065013062716


### Lasso Regression <a id= "11"></a>
* The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. 
* This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline

In [7]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha=0.0005, random_state= random_state))

score = rmse(lasso)
models_scores.append(['Lasso', score])
print(f'Lasso Score= {score}')

Lasso Score= 0.1110532696244382


### ElasticNet Regression <a id= "12"></a>
* Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.
* A practical advantage of trading-off between Lasso and Ridge is that it allows Elastic-Net to inherit some of Ridge’s stability under rotation.
* This model may be very sensitive to outliers. So we need to made it more robust on them. For that we use the sklearn's Robustscaler() method on pipeline

In [8]:
elastic_net = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio= .9, random_state= random_state))

score = rmse(elastic_net)
models_scores.append(['ElasticNet', score])
print(f'ElasticNet Score= {score}')

ElasticNet Score= 0.11107756118615623


### KernelRidge Regression <a id= "13"></a>
* Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients.

In [9]:
kernel_ridge= KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
score = rmse(kernel_ridge)
models_scores.append(['KernelRidge', score])
print(f'KernelRidge Score= {score}')

KernelRidge Score= 0.13639161839448324


In [10]:
# Ranking the scores of each model
pd.DataFrame(models_scores).sort_values(by=[1], ascending=True)

Unnamed: 0,0,1
1,Lasso,0.111053
2,ElasticNet,0.111078
0,LinearRegression,0.131791
3,KernelRidge,0.136392


## Ensemble Modeling <a id= "14"></a>

### Bagging <a id= "15"></a>
* We are going to use sklearns "**BaggingRegressor**" to fit the base regressors (LinearRegression, Lasso, ElasticNet, KernelRidge)
* A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction.
* In particular, **max_samples** and **max_features** control the size of the subsets (in terms of samples and features), while **bootstrap** and **bootstrap_features** control whether samples and features are drawn with or without replacement.
* We are using 10 base estimators in ensemble. 
* Method bagging_predictions() calculate the score of each base estimator against the test data and also returns the test prediction values.
* Using **column_stack** we will store the predictions for each base estimator in a separate column and then take the average of all the predictions for final RMSE calculations.

In [11]:
def bagging_predictions(estimator):
    """
    I/P
    estimator: The base estimator from which the ensemble is grown.
    O/P
    br_y_pred: Predictions on test data for the base estimator.
    
    """
    regr = BaggingRegressor(base_estimator=estimator,
                            n_estimators=10,
                            max_samples=1.0,
                            bootstrap=True, # Samples are drawn with replacement
                            n_jobs= n_jobs,
                            random_state=random_state).fit(X_train, y_train)

    br_y_pred = regr.predict(X_test)

    rmse_val = mean_squared_error(y_test, br_y_pred, squared= False) # squared= False > returns Root Mean Square Error   

    print(f'RMSE for base estimator {regr.base_estimator_} = {rmse_val}\n')
    return br_y_pred


predictions = np.column_stack((bagging_predictions(linear_regression),
                              bagging_predictions(lasso),
                              bagging_predictions(elastic_net),
                              bagging_predictions(kernel_ridge)))
print(f"Bagged predictions shape: {predictions.shape}")
       
y_pred = np.mean(predictions, axis=1)
print("Aggregated predictions (y_pred) shape", y_pred.shape)

rmse_val = mean_squared_error(y_test, y_pred, squared= False) # squared= False > returns Root Mean Square Error   
models_scores.append(['Bagging', rmse_val])

print(f'\nBagging RMSE= {rmse_val}')

RMSE for base estimator Pipeline(steps=[('linearregression', LinearRegression())]) = 0.12130723991069779

RMSE for base estimator Pipeline(steps=[('robustscaler', RobustScaler()),
                ('lasso', Lasso(alpha=0.0005, random_state=42))]) = 0.10947971499614259

RMSE for base estimator Pipeline(steps=[('robustscaler', RobustScaler()),
                ('elasticnet',
                 ElasticNet(alpha=0.0005, l1_ratio=0.9, random_state=42))]) = 0.10955384344423397

RMSE for base estimator KernelRidge(alpha=0.6, coef0=2.5, degree=2, kernel='polynomial') = 0.14435769550845381

Bagged predictions shape: (482, 4)
Aggregated predictions (y_pred) shape (482,)

Bagging RMSE= 0.1116490577989883


In [12]:
# Ranking the scores of each model
pd.DataFrame(models_scores).sort_values(by=[1], ascending=True)

Unnamed: 0,0,1
1,Lasso,0.111053
2,ElasticNet,0.111078
4,Bagging,0.111649
0,LinearRegression,0.131791
3,KernelRidge,0.136392


As you can see from above results, because of the high score of "KernelRidge" estimator total bagging RMSE is less than that of "Lasso" and "ElasticNet". 

### Boosting <a id= "16"></a>

We are going to use GradientBoostingRegressor, XGBRegressor, LGBMRegressor algorithms.

#### GradientBoostingRegressor <a id= "17"></a>

* Gradient Boosting for regression.
* GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

In [13]:
gradient_boosting_regressor= GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05,
                                   max_depth=4, max_features='sqrt',
                                   min_samples_leaf=15, min_samples_split=10, 
                                   loss='huber', random_state = random_state)

score = rmse(gradient_boosting_regressor)
models_scores.append(['GradientBoostingRegressor', score])
print(f'GradientBoostingRegressor Score= {score}')

GradientBoostingRegressor Score= 0.12087469712016406


#### XGBRegressor <a id= "18"></a>
* XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
* It implements machine learning algorithms under the Gradient Boosting framework.
* XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

In [14]:
xgb_regressor= xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213,verbosity=0, nthread = -1, random_state = random_state)
score = rmse(xgb_regressor)
models_scores.append(['XGBRegressor', score])
print(f'XGBRegressor Score= {score}')

XGBRegressor Score= 0.11566132041864456


### LGBMRegressor <a id= "19"></a>
Light GBM is a gradient boosting framework that uses tree based learning algorithm.

In [15]:
lgbm_regressor= lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11,random_state = random_state)

score = rmse(lgbm_regressor)
models_scores.append(['LGBMRegressor', score])
print(f'LGBMRegressor Score= {score}')

LGBMRegressor Score= 0.118848879537224


### Stacking <a id= "20"></a>

* We can use sklearns **StackingClassifier** and **StackingRegressor** to for classification and regression problem respectively.
* Since "lasso" is our best performing model we will use it as our meta learner and rest models as base estimators.

In [16]:
estimators = [ ('elastic_net', elastic_net), ('kernel_ridge', kernel_ridge),('xgb_regressor', xgb_regressor) ]

stack = StackingRegressor(estimators=estimators, final_estimator= lasso, cv= 5, n_jobs= n_jobs, passthrough = True)

stack.fit(X_train, y_train)

pred = stack.predict(X_test)

rmse_val = mean_squared_error(y_test, pred, squared= False) # squared= False > returns Root Mean Square Error    
models_scores.append(['Stacking', rmse_val])
print(f'rmse= {rmse_val}')

rmse= 0.10994668918977511


In [17]:
# Ranking the scores of each model
pd.DataFrame(models_scores).sort_values(by=[1], ascending=True)

Unnamed: 0,0,1
8,Stacking,0.109947
1,Lasso,0.111053
2,ElasticNet,0.111078
4,Bagging,0.111649
6,XGBRegressor,0.115661
7,LGBMRegressor,0.118849
5,GradientBoostingRegressor,0.120875
0,LinearRegression,0.131791
3,KernelRidge,0.136392


As you can see from above list that 'Stacking' resulted in the best possible score. We can even try multiple permutation of each model till we get the best possible results. Apart from Bagging, Boosting and Stacking ensembling methods we can also try model blending approach where we can assign weightage to each model and combining the results till we get the best possible score.