# XG Boost, Light GBM, catBoost and Scikit-Learn Gradient Boosting Performance Comparision
In this notebook we'll compare the speed and accuracy of several gradient boosting implementations from Scikit-Learn, XGBoost, LightGBM and CatBoost.

There are so many available and many times you don't know which one to choose for your machine learning problem therefore in this notebook we'll train classifiers with each and then compare the speed and accuracy to see which one is the winner. 

> ## Import Required Modules



In [4]:
# from IPython.xore.debugger import set_trace
# %load_ext nb_black
import numpy as np
import pandas as pd
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold

> # Gradient Boosting Machine | GBM
Gradient Boosting is an ensemble algorithm that fits boosted decision trees by minimizing an error gradient. 

In [11]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=0)

In [12]:
X

array([[ 5.28003572,  0.07866376, -0.06964284, ..., -3.46287848,
        -0.03862101,  3.39993039],
       [ 2.16926235, -2.33555961, -0.39400421, ..., -0.86901332,
        -0.82929365, -5.54782464],
       [ 0.50841459, -2.64924693,  2.11503792, ..., -0.7373016 ,
        -2.11204173,  1.53660878],
       ...,
       [-2.0093232 , -1.21052136, -0.89558391, ...,  5.14633699,
        -0.04689061, -2.59038402],
       [-5.48082318,  1.69027971,  1.55338889, ..., -0.0579665 ,
         0.98007555, -1.9660304 ],
       [ 2.57435489, -0.13578773,  2.6852563 , ..., -0.66826083,
        -1.22760575,  0.39196471]])

In [13]:
X.shape

(10000, 20)

> Define empty dictionaries in order to record accuracy and speed

In [14]:
accuracy = {}
speed = {}

> # 1. Scikit-Learn - `GradientBoostingClassifier`

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [15]:
model = GradientBoostingClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

speed['GradientBoosting'] = np.round(time() - start, 3)
accuracy['GradientBoosting'] = np.mean(score).round(3)

print(f"Mean Accuracy: {accuracy['GradientBoosting']}\nStd: {np.std(score):.3f}\nRun time: {speed['GradientBoosting']}s")

Mean Accuracy: 0.878
Std: 0.007
Run time: 45.31s


> # 2. Alternative - `HistGradientBoostingClassifier`

In [16]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier

In [17]:
model = HistGradientBoostingClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

speed['HistGradientBoosting'] = np.round(time() - start, 3)
accuracy['HistGradientBoosting'] = np.mean(score).round(3)

print(f"Mean Accuracy: {accuracy['HistGradientBoosting']}\nStd: {np.std(score):.3f}\nRun time: {speed['HistGradientBoosting']}s")

Mean Accuracy: 0.948
Std: 0.005
Run time: 6.808s


> # 3. XGBoost - `XGBClassifier`

In [22]:
from xgboost import XGBClassifier

In [25]:
model = XGBClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

speed['XGB'] = np.round(time() - start, 3)
accuracy['XGB'] = np.mean(score).round(3)

print(f"Mean Accuracy: {accuracy['XGB']}\nStd: {np.std(score):.3f}\nRun time: {speed['XGB']}s")

Mean Accuracy: 0.873
Std: 0.009
Run time: 10.409s


> # 4. LightGBM - `LGBMClassifier`

In [26]:
from lightgbm import LGBMClassifier

In [27]:
model = LGBMClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

speed['LGBM'] = np.round(time() - start, 3)
accuracy['LGBM'] = np.mean(score).round(3)

print(f"Mean Accuracy: {accuracy['LGBM']}\nStd: {np.std(score):.3f}\nRun time: {speed['LGBM']}s")

Mean Accuracy: 0.949
Std: 0.006
Run time: 5.627s


> # 5. CatBoost - `CatBoostClassifier`

In [None]:
!pip install catboost

In [30]:
from catboost import CatBoostClassifier

In [31]:
model = CatBoostClassifier()

start = time()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=0)
score = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

speed['CatBoost'] = np.round(time() - start, 3)
accuracy['CatBoost'] = np.mean(score).round(3)

print(f"Mean Accuracy: {accuracy['CatBoost']}\nStd: {np.std(score):.3f}\nRun time: {speed['CatBoost']}s")

Mean Accuracy: 0.964
Std: 0.004
Run time: 126.18s


> ## This is the best accuracy that we got so far

In [42]:
print("Accuracy: ")
{k: v for k, v in sorted(accuracy.items(), key=lambda i: i[1], reverse=True)}

Accuracy: 


{'CatBoost': 0.964,
 'GradientBoosting': 0.878,
 'HistGradientBoosting': 0.948,
 'LGBM': 0.949,
 'XGB': 0.873}

In [40]:
print("Speed:")
{ k: v for k, v in sorted(speed.items(), key=lambda i: i[1], reverse=False)}

Speed:


{'CatBoost': 126.18,
 'GradientBoosting': 45.31,
 'HistGradientBoosting': 6.808,
 'LGBM': 5.627,
 'XGB': 10.409}