# Gradient Boosting Decision Trees with LightGBM and XGBoost

In [1]:
import warnings 
warnings.filterwarnings ("ignore")

import numpy as np
import pandas as pd
import sklearn
import lightgbm as lgb
import xgboost as xgb
import json

from utils import (Timer, load_airline, convert_related_cols_categorical_to_numeric, 
                  convert_cols_categorical_to_numeric, binarize_prediction,
                  classification_metrics_binary, classification_metrics_binary_prob)

print(f"Numpy version:{np.__version__}")
print(f"Pandas version:{pd.__version__}")
print(f"Sklearn version:{sklearn.__version__}")
print(f"LightGBM version:{lgb.__version__}")
print(f"XGBoost version:{xgb.__version__}")

Numpy version:1.22.3
Pandas version:1.4.1
Sklearn version:1.0.2
LightGBM version:3.3.2
XGBoost version:1.5.2


## Gradient tree boosting

Gradient boosting is a machine learning technique that produces a prediction model in the form of an ensemble of weak classifiers. One of the most popular types of gradient boosting is gradient boosted decision trees (GBDT), that internally is made up of an ensemble of weak decision trees.

The two most popular GBDT frameworks are XGBoost and LightGBM. Both frameworks uses efficient and scalable implementations for gradient boosting that can be used for classification and regression.

Boosting is a technique that builds strong classifiers by sequentially emsambling weak ones. First, a model is trained on the data. Then a second model tries to correct the errors found in the first model. This process is repeated until the error is reduced to a limit or a maximum number of models is added.

In ADA boost, we take the data points that had bad performance and they are weighted more in the next model. In gradient boosting, we take the residual, which is the difference between the true labels $y$ and the labels predicted by the model $\hat y$.

* For the first estimator, the residual is $y - f_1(x)$.
* For the second estimator, the residual is $y - f_1(x) - f_2(x)$.
* For the $n^{th}$ estimator, the residual is $y - \sum_{i=1}^n f_n(x)$

To train the GBDT, we want to minimize a loss which the difference between the true and predicted labels and we add a regularization term $\Omega$ to reduce the complexity of the model.

$$ \mathcal {L} = \sum_{i=1}^n l(y_i, \hat y_i) + \sum_{i=1}^n \Omega (f_i)$$


The loss term $\sum_{i=1}^n l(y_i, \hat y_i)$ is not easy to compute, one way is via a Taylor expansion. A Taylor expansion is a way to approximate a complex function based on its derivative (gradient or Jacobian) and second derivate (Hessian). In particular, the hessian is not easy to compute, and this is one of the reasons finding the right split in gradient boosting trees can be difficult.

There are two different strategies to compute the trees: level-wise and leaf-wise. 

<img src="img/DecisionTrees_3_thumb.png" />

The level-wise strategy grows the tree level by level. In this strategy, each node splits the data prioritizing the nodes closer to the tree root. The leaf-wise strategy grows the tree by splitting the data at the nodes with the highest loss change. 

Level-wise growth is usually better for smaller datasets whereas leaf-wise tends to overfit. Leaf-wise growth tends to excel in larger datasets where it is considerably faster than level-wise growth.

A key challenge in training boosted decision trees is the computational cost of finding the best split for each leaf. Conventional techniques find the exact split for each leaf, and require scanning through all the data in each iteration. 

A different approach approximates the split by building histograms of the features. That way, the algorithm doesn't need to evaluate every single value of the features to compute the split, but only the bins of the histogram, which are bounded. This approach turns out to be much more efficient for large datasets, without adversely affecting accuracy.

## XGBoost vs LightGBM

XGBoost started in 2014, and it has become popular due to its use in many winning Kaggle competition entries. Originally XGBoost was based on a level-wise growth algorithm, but then they added an option for leaf-wise growth that implements split approximation using histograms. We refer to this version as XGBoost hist. 

LightGBM is a more recent arrival, started in March 2016 and open-sourced in August 2016. It is based on a leaf-wise algorithm and histogram approximation, and has attracted a lot of attention due to its speed. 

Apart from multithreaded CPU implementations, GPU acceleration is now available on both XGBoost and LightGBM too.

## Airline dataset

The Airline dataset contains flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. Its size is around 116 million records and 5.76 GB of memory. It has 13 features plus the target. The target attribute is Arrival Delay, it is a positive or negative value measured in minutes.

To download the dataset:

```bash
cd data
wget http://kt.ijs.si/elena_ikonomovska/datasets/airline/airline_14col.data.bz2
bzip2 -dk airline_14col.data.bz2
```

In this notebook, we are going to set a classification problem where the goal is to **classify wheather a flight has arrived delayed or not.**

In [2]:
N_ROWS = 10000000

In [3]:
%%time
df_plane = load_airline(nrows=N_ROWS)
print(df_plane.shape)

(10000000, 14)
CPU times: user 7.3 s, sys: 3.85 s, total: 11.2 s
Wall time: 11.1 s


In [4]:
df_plane.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


The first step is to convert the categorical features to numeric features.

In [5]:
%%time
df_plane_numeric = convert_related_cols_categorical_to_numeric(df_plane, col_list=['Origin','Dest'])
del df_plane

CPU times: user 8.84 s, sys: 1.48 s, total: 10.3 s
Wall time: 10.3 s


In [6]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,HP,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,DL,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,4,33,1515,0,17


In [7]:
%%time
df_plane_numeric = convert_cols_categorical_to_numeric(df_plane_numeric, col_list='UniqueCarrier')

CPU times: user 5.53 s, sys: 853 ms, total: 6.38 s
Wall time: 6.36 s


To simplify the pipeline, we are going to set a classification problem where the goal is to classify wheather a flight has arrived delayed or not. For that we need to binarize the variable `ArrDelay`.

If you want to extend this experiment, you can set a regression problem and try to identify the number of minutes of delay a fight has. Both XGBoost and LightGBM have regression classes.

In [8]:
df_plane_numeric = df_plane_numeric.apply(lambda x: x.astype('int16'))

In [9]:
%%time
df_plane_numeric['ArrDelayBinary'] = 1*(df_plane_numeric['ArrDelay'] > 0)

CPU times: user 67.8 ms, sys: 106 ms, total: 174 ms
Wall time: 51.9 ms


In [10]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay,ArrDelayBinary
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27,1
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5,1
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17,1
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2,0
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17,1


Once the features are prepared, let's split the dataset into train and test set. We won't use validation for this example (however, you can try to add it).

In [11]:
def split_train_val_test_df(df, val_size=0.2, test_size=0.2):
    train, validate, test = np.split(
        df.sample(frac=1),
        [int((1 - val_size - test_size) * len(df)), int((1 - test_size) * len(df))],
    )
    return train, validate, test

def generate_feables(df):
    X = df[df.columns.difference(['ArrDelay', 'ArrDelayBinary'])]
    y = df['ArrDelayBinary']
    return X,y

In [12]:
%%time
train, validate, test = split_train_val_test_df(df_plane_numeric, val_size=0, test_size=0.2)
print(train.shape)
print(validate.shape)
print(test.shape)

(8000000, 15)
(0, 15)
(2000000, 15)
CPU times: user 3.34 s, sys: 682 ms, total: 4.02 s
Wall time: 4 s


In [13]:
%%time
X_train, y_train = generate_feables(train)
X_val, y_val = generate_feables(validate)
X_test, y_test = generate_feables(test)

CPU times: user 225 ms, sys: 21.6 ms, total: 247 ms
Wall time: 237 ms


In [14]:
del train, validate, test

## Training

Now we are going to create two pipelines, one of XGBoost and one for LightGBM. The technology behind both libraries is different, so it is difficult to compare them in the exact same model setting. XGBoost grows the trees depth-wise and controls model complexity with `max_depth`. Instead, LightGBM uses a leaf-wise algorithm and controls the model complexity by `num_leaves`. As a tradeoff, we use XGBoost with `max_depth=8`, which will have max number leaves of 255, and compare it with LightGBM with `num_leaves=255`.

In [15]:
results_dict = dict()
num_rounds = 200

Let's start with the XGBoost classifier.

In [16]:
xgb_clf_pipeline = xgb.XGBRegressor(max_depth=8,
                                    n_estimators=num_rounds,
                                    min_child_weight=30,
                                    learning_rate=0.1,
                                    scale_pos_weight=2,
                                    gamma=0.1,
                                    reg_lambda=1,
                                    subsample=1,
                                    n_jobs=-1,
                                    random_state=77)

In [17]:
with Timer() as t:
    xgb_clf_pipeline.fit(X_train, y_train)

In [18]:
results_dict['xgb']={ 'train_time': t.interval }

Training XGBoost model with leave-wise growth

In [19]:
xgb_hist_clf_pipeline = xgb.XGBRegressor(max_depth=0,
                                        max_leaves=255,
                                        n_estimators=num_rounds,
                                        min_child_weight=30,
                                        learning_rate=0.1,
                                        scale_pos_weight=2,
                                        gamma=0.1,
                                        reg_lambda=1,
                                        subsample=1,
                                        grow_policy='lossguide',
                                        tree_method='hist',
                                        n_jobs=-1,
                                        random_state=77)

In [20]:
with Timer() as t:
    xgb_hist_clf_pipeline.fit(X_train, y_train)

In [21]:
results_dict['xgb_hist']={ 'train_time': t.interval }

Training LightGBM model.

In [22]:
lgbm_clf_pipeline = lgb.LGBMRegressor(num_leaves=255,
                                      n_estimators=num_rounds,
                                      min_child_weight=30,
                                      learning_rate=0.1,
                                      scale_pos_weight=2,
                                      min_split_gain=0.1,
                                      reg_lambda=1,
                                      subsample=1,
                                      n_jobs=-1,
                                      seed=77)

In [23]:
with Timer() as t:
    lgbm_clf_pipeline.fit(X_train, y_train)

In [24]:
results_dict['lgbm']={ 'train_time': t.interval }

## Evaluation

Now let's evaluate the model in the test set.

In [25]:
with Timer() as t:
    y_prob_xgb = np.clip(xgb_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [26]:
results_dict['xgb']['test_time'] = t.interval

In [27]:
with Timer() as t:
    y_prob_xgb_hist = np.clip(xgb_hist_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [28]:
results_dict['xgb_hist']['test_time'] = t.interval

In [29]:
with Timer() as t:
    y_prob_lgbm = np.clip(lgbm_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [30]:
results_dict['lgbm']['test_time'] = t.interval

## Metrics

We are going to obtain some metrics to evaluate the performance of each of the models.

In [31]:
y_pred_xgb = binarize_prediction(y_prob_xgb)
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)
y_pred_lgbm = binarize_prediction(y_prob_lgbm)

In [32]:
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)

In [33]:
results_dict['xgb']['performance'] = report_xgb

In [34]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [35]:
results_dict['xgb_hist']['performance'] = report_xgb_hist

In [36]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [37]:
results_dict['lgbm']['performance'] = report_lgbm

## Results

In [38]:
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8867442907165578,
            "Accuracy": 0.8034565,
            "F1": 0.8199721636892316,
            "Precision": 0.8311130298346029,
            "Recall": 0.8091260279073803
        },
        "test_time": 13.667672777002736,
        "train_time": 100.18000323400338
    },
    "xgb": {
        "performance": {
            "AUC": 0.8630098847547171,
            "Accuracy": 0.7289615,
            "F1": 0.7904801844440107,
            "Precision": 0.6905300362424293,
            "Recall": 0.9242615968921901
        },
        "test_time": 3.1140623839964974,
        "train_time": 2579.341359557002
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.6845214380533686,
            "Accuracy": 0.553596,
            "F1": 0.7124852669341697,
            "Precision": 0.5534212556393912,
            "Recall": 0.9998662296836333
        },
        "test_time": 0.4432450289968983,
        "train_time": 25.775370984

In [39]:
del xgb_clf_pipeline, xgb_hist_clf_pipeline, lgbm_clf_pipeline, X_train, X_test, X_val

## Summary

The experiments have been conducted on an Intel i7 @ 1.30GHz with 32Gb of RAM.

| Airline subsample size | Lib | Training time (s) | Test time (s) | AUC | F1 |
|:-----------------------|:----|:-----------------:|:-------------:|:---:|:--:|
| 10,000     | xgb      |    2.7050 |   0.0146 | 0.8346 | 0.8010 |
| 10,000     | xgb_hist |    0.3963 |   0.0075 | 0.7009 | 0.7478 |
| 10,000     | lgb      |    0.2183 |   0.0161 | 0.8264 | 0.7971 |
| 100,000    | xgb      |    5.6744 |   0.0432 | 0.9015 | 0.8357 |
| 100,000    | xgb_hist |    0.4495 |   0.0265 | 0.7059 | 0.7351 |
| 100,000    | lgb      |    0.8666 |   0.1032 | 0.9090 | 0.8564 |
| 1,000,000  | xgb      |  106.1552 |   0.2917 | 0.9107 | 0.8401 |
| 1,000,000  | xgb_hist |    2.9730 |   0.0463 | 0.7061 | 0.7545 |
| 1,000,000  | lgb      |    9.7474 |   1.1421 | 0.9262 | 0.8750 |
| 10,000,000 | xgb      | 2579.3413 |   3.1140 | 0.8630 | 0.7904 |
| 10,000,000 | xgb_hist |   25.7753 |   0.4432 | 0.6845 | 0.7124 |
| 10,000,000 | lgb      |  100.1800 |  13.6676 | 0.8867 | 0.8199 |

## References
* Lessons Learned From Benchmarking Fast Machine Learning Algorithms
 https://docs.microsoft.com/en-us/archive/blogs/machinelearning/lessons-learned-benchmarking-fast-machine-learning-algorithms
* <div class="csl-entry">Chen, T., &#38; Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. <i>KDD</i>. https://github.com/dmlc/xgboost</div> 
* <div class="csl-entry">Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., &#38; Liu, T.-Y. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. <i>NIPS</i>. https://github.com/Microsoft/LightGBM.</div>
* Gradient Boosted Decision Trees-Explained https://towardsdatascience.com/gradient-boosted-decision-trees-explained-9259bd8205af