# Histogram Based Gradient Boosting / LightGBM

Noah Rubin

Self Study - July 2021

Useful Resources:
1. https://medium.com/@mqofori/a-first-look-at-sklearns-histgradientboostingclassifier-9f5bea611c6a
2. https://machinelearningmastery.com/histogram-based-gradient-boosting-ensembles/

---

Main ideas:

* A random forest is an algorithm that can scale well to larger datasets as all trees are built independently in parallel, exploiting multiple CPU cores. The small downside is that it might not predict out of sample data as accurately as a gradient boosting model that builds it trees sequentially, building off the errors that all the previous trees made.
* However, gradient boosting can quickly become computationally expensive as the sample size increases
* With [Histogram-Based Gradient Boosting](https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting), we can create a model that is "orders of magnitude faster than GradientBoostingRegressor when the samples is larger than tens of thousands of samples" - Sklearn documentation
* It does this through in-built [discretisation](https://www.javatpoint.com/discretization-in-data-mining) of continous variables into a fixed number of distinct buckets, allowing us to obtain the benefits of boosting models while still remaining efficient from both a training speed and memeory usage perspective
* Ultimately, "these fast estimators first bin the input samples $X$ into integer-valued bins (typically 256 bins) which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees." - [Histogram-Based Gradient Boosting Scikit-Learn documentation](https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting)
---
* Despite the process of binning occuring, and resultantly, the approximation of input values in the dataset, Histogram-Based Gradient Boosting will often closely match the performance of regular gradient boosting
* At times it might even result in a slight improvement. In cases where it can't outperform gradient boosting, the difference is often negligable and well worth it given the speed in which this algorithm can operate when compared to regular gradient boosting.
* Another huge advantage for Histogram Based Gradient Boosting is that it has in-built techniques to handle missing values, and so imputers like a `KNNImputer()` (seen previously) are not needed.
* Ultimately, Histogram Based Gradient Boosting employs in-built discritisation of continous values "which tremendously reduces the number of splitting points to consider, and allows the algorithm to leverage integer-based data structures (histograms) instead of relying on sorted continuous values when building the trees." - Sklearn documentation. 
* As per the documentation, this Sklearn implementation has been inspired by the [LightGBM](https://lightgbm.readthedocs.io/en/latest/) gradient boosting framework developed by computer scientists at Microsoft

In [1]:
# Python files
import data_prep
import helper_funcs

import joblib
import pandas as pd
import numpy as np

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import HistGradientBoostingRegressor

In [2]:
train = pd.read_csv('../datasets/train_updated.csv')
test = pd.read_csv('../datasets/test_updated.csv')

# Split data
to_drop = ['Country', 'HDI', 'Life_exp']

X_train = train.drop(to_drop, axis='columns')
X_test = test.drop(to_drop, axis='columns')

y_train = train['Life_exp']
y_test = test['Life_exp']

## Hyperparameter Tuning

In [3]:
model_pipeline = data_prep.create_pipeline(HistGradientBoostingRegressor())

param_grid = {
    'model__max_iter': np.arange(100, 300, 5),  # Number of boosting iterations ie trees
    'model__learning_rate': np.linspace(0.05, 1, 10),
    'model__min_samples_leaf': np.arange(3, 30, 1),
    'model__max_depth': np.arange(3, 8),
    'model__l2_regularization': np.linspace(0, 1, 10),
    'model__loss': ['squared_error', 'absolute_error'],
}

# Get the best hyperparameters for each model and use that in the final model
final_model, best_params = data_prep.randomised_search_wrapper(X_train,
                                                               y_train,
                                                               model_pipeline, 
                                                               param_grid, 
                                                               cv=10,
                                                               n_iter=50)


Best Parameters were...

model__min_samples_leaf had optimal value as: 16
model__max_iter had optimal value as: 265
model__max_depth had optimal value as: 7
model__loss had optimal value as: squared_error
model__learning_rate had optimal value as: 0.15555555555555556
model__l2_regularization had optimal value as: 0.5555555555555556

The fitted model just initialised and fit now has all these parameters set up


## Evaluation Metrics

In [4]:
r2, mse, rmse, mae = helper_funcs.display_regression_metrics(y_test, final_model.predict(X_test))

All metrics are in terms of the unseen test set

R^2 = 0.9887922404061699
Mean Squared Error = 0.9080079140680164
Root Mean Squared Error = 0.9528944926213061
Mean Absolute Error = 0.6768143605417666


## Save Model

In [5]:
joblib.dump(final_model, './saved_models/Histogram-Based Gradient Boosting.joblib')

['./saved_models/Histogram-Based Gradient Boosting.joblib']