<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# LightGBM: A Highly Efficient Gradient Boosting Decision Tree
This notebook will give you an example of how to train a LightGBM model to estimate click-through rates on an e-commerce advertisement. We will train a LightGBM based model on a [publicly available dataset from Criteo](https://www.kaggle.com/c/criteo-display-ad-challenge).

[LightGBM](https://github.com/Microsoft/LightGBM) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:
* Fast training speed and high efficiency.
* Low memory usage.
* Great accuracy.
* Support for parallel and GPU learning.
* Capable of handling large-scale data.

## Global Settings and Imports

In [1]:
# Install these packages for this notebook in Azure Notebooks
!pip install papermill==0.19.1 category_encoders>=1.3.0

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
import sys, os
sys.path.append("../../")
import numpy as np
import lightgbm as lgb
import papermill as pm
import pandas as pd
import category_encoders as ce
from tempfile import TemporaryDirectory
from sklearn.metrics import roc_auc_score, log_loss

import reco_utils.recommender.lightgbm.lightgbm_utils as lgb_utils
import reco_utils.dataset.criteo as criteo

print("System version: {}".format(sys.version))
print("LightGBM version: {}".format(lgb.__version__))

System version: 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) 
[GCC 7.2.0]
LightGBM version: 2.2.1


### Parameter Setting
Let's set the main related parameters for LightGBM now. Basically, the task is a binary classification (predicting click or no click), so the objective function is set to binary logloss, and 'AUC' metric, is used as a metric which is less effected by imbalance in the classes of the dataset.

Generally, we can adjust the number of leaves (MAX_LEAF), the minimum number of data in each leaf (MIN_DATA), maximum number of trees (NUM_OF_TREES), the learning rate of trees (TREE_LEARNING_RATE) and EARLY_STOPPING_ROUNDS (to avoid overfitting) in the model to get better performance.

Besides, we can also adjust some other listed parameters to optimize the results. [In this link](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst), a list of all the parameters is shown. Also, some advice on how to tune these parameters can be found [in this url](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters-Tuning.rst). 

In [3]:
MAX_LEAF = 64
MIN_DATA = 20
NUM_OF_TREES = 100
TREE_LEARNING_RATE = 0.15
EARLY_STOPPING_ROUNDS = 20
METRIC = "auc"
SIZE = "sample"

In [4]:
params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'num_class': 1,
    'objective': "binary",
    'metric': METRIC,
    'num_leaves': MAX_LEAF,
    'min_data': MIN_DATA,
    'boost_from_average': True,
    #set it according to your cpu cores.
    'num_threads': 20,
    'feature_fraction': 0.8,
    'learning_rate': TREE_LEARNING_RATE,
}

## Data Preparation
Here we use CSV format as the example data input. Our example data is a sample (about 100 thousand samples) from [Criteo that was used in a previous kaggle competition](https://www.kaggle.com/c/criteo-display-ad-challenge). The Criteo dataset is a well-known industry benchmarking dataset for developing CTR prediction models, and it's frequently adopted as an evaluation dataset by research papers. The original dataset is too large for a lightweight demo, so we sample a small portion from it as a demo dataset.

Specifically, there are 39 columns of features in Criteo, where 13 columns are numerical features (I1-I13) and the other 26 columns are categorical features (C1-C26).

In [6]:
nume_cols = ["I" + str(i) for i in range(1, 14)]
cate_cols = ["C" + str(i) for i in range(1, 27)]
label_col = "Label"

header = [label_col] + nume_cols + cate_cols
with TemporaryDirectory() as tmp:
    all_data = criteo.load_pandas_df(size=SIZE, local_cache_path=tmp, header=header)
display(all_data.head())

8.79MB [00:04, 1.77MB/s]                            


Unnamed: 0,Label,I1,I2,I3,I4,I5,I6,I7,I8,I9,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,0,1.0,1,5.0,0.0,1382.0,4.0,15.0,2.0,181.0,...,e5ba7672,f54016b9,21ddcdc9,b1252a9d,07b5194c,,3a171ecb,c5c50484,e8b83407,9727dd16
1,0,2.0,0,44.0,1.0,102.0,8.0,2.0,2.0,4.0,...,07c540c4,b04e4670,21ddcdc9,5840adea,60f6221e,,3a171ecb,43f13e8b,e8b83407,731c3655
2,0,2.0,0,1.0,14.0,767.0,89.0,4.0,2.0,245.0,...,8efede7f,3412118d,,,e587c466,ad3062eb,3a171ecb,3b183c5c,,
3,0,,893,,,4392.0,,0.0,0.0,0.0,...,1e88c74f,74ef3502,,,6b3a5ca6,,3a171ecb,9117a34a,,
4,0,3.0,-1,,0.0,2.0,0.0,3.0,0.0,0.0,...,1e88c74f,26b3c7a7,,,21c9516a,,32c7478e,b34f3128,,


First, we create three datasets that we will use throughout estimation and evaluation:

- `train_data` (first 80%): used to estimate the model
- `valid_data` (middle 10%): used to validate during training
- `test_data` (last 10%): used to validate after training

Note that the dataset is a time-series, which is also very common in recommendation scenario, so we split perform a chronological split.

In [7]:
# split data to 3 sets    
length = len(all_data)
train_data = all_data.loc[:0.8*length-1]
valid_data = all_data.loc[0.8*length:0.9*length-1]
test_data = all_data.loc[0.9*length:]

## Basic Usage

### Ordinal Encoding

LightGBM can handle low-frequency features and missing value, so for basic usage, we only encode the string-like categorical features by an ordinal encoder. We use the standard [ordinal encoder](http://contrib.scikit-learn.org/categorical-encoding/ordinal.html) from the `category_encoders` module.

In [8]:
ord_encoder = ce.ordinal.OrdinalEncoder(cols=cate_cols)

def encode_csv(df, encoder, label_col, typ='fit'):
    if typ == 'fit':
        df = encoder.fit_transform(df)
    else:
        df = encoder.transform(df)
    y = df[label_col].values
    del df[label_col]
    return df, y

train_x, train_y = encode_csv(train_data, ord_encoder, label_col)
valid_x, valid_y = encode_csv(valid_data, ord_encoder, label_col, 'transform')
test_x, test_y = encode_csv(test_data, ord_encoder, label_col, 'transform')

print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
      .format(trn_x_shape=train_x.shape,
              trn_y_shape=train_y.shape,
              vld_x_shape=valid_x.shape,
              vld_y_shape=valid_y.shape,
              tst_x_shape=test_x.shape,
              tst_y_shape=test_y.shape,))
train_x.head()

Train Data Shape: X: (80000, 39); Y: (80000,).
Valid Data Shape: X: (10000, 39); Y: (10000,).
Test Data Shape: X: (10000, 39); Y: (10000,).



Unnamed: 0,I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,...,C17,C18,C19,C20,C21,C22,C23,C24,C25,C26
0,1.0,1,5.0,0.0,1382.0,4.0,15.0,2.0,181.0,1.0,...,1,1,1,1,1,1,1,1,1,1
1,2.0,0,44.0,1.0,102.0,8.0,2.0,2.0,4.0,1.0,...,2,2,1,2,2,1,1,2,1,2
2,2.0,0,1.0,14.0,767.0,89.0,4.0,2.0,245.0,1.0,...,3,3,2,3,3,2,1,3,2,3
3,,893,,,4392.0,,0.0,0.0,0.0,,...,4,4,2,3,4,1,1,4,2,3
4,3.0,-1,,0.0,2.0,0.0,3.0,0.0,0.0,1.0,...,4,5,2,3,5,1,2,5,2,3


### Create model
When both hyper-parameters and data are ready, we can create a model:

In [9]:
lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params, categorical_feature=cate_cols)
lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
lgb_test = lgb.Dataset(test_x, test_y.reshape(-1), reference=lgb_train, categorical_feature=cate_cols)
lgb_model = lgb.train(params,
                      lgb_train,
                      num_boost_round=NUM_OF_TREES,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=lgb_valid,
                      categorical_feature=cate_cols)

[1]	valid_0's auc: 0.728695
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's auc: 0.742373
[3]	valid_0's auc: 0.747298
[4]	valid_0's auc: 0.747969
[5]	valid_0's auc: 0.751102
[6]	valid_0's auc: 0.753734
[7]	valid_0's auc: 0.755335
[8]	valid_0's auc: 0.75658
[9]	valid_0's auc: 0.757071
[10]	valid_0's auc: 0.758572
[11]	valid_0's auc: 0.759742
[12]	valid_0's auc: 0.760415
[13]	valid_0's auc: 0.760602
[14]	valid_0's auc: 0.761192
[15]	valid_0's auc: 0.7616
[16]	valid_0's auc: 0.761697
[17]	valid_0's auc: 0.762255
[18]	valid_0's auc: 0.76253
[19]	valid_0's auc: 0.763092
[20]	valid_0's auc: 0.762172
[21]	valid_0's auc: 0.762066
[22]	valid_0's auc: 0.761866
[23]	valid_0's auc: 0.761433
[24]	valid_0's auc: 0.761588
[25]	valid_0's auc: 0.761017
[26]	valid_0's auc: 0.761086
[27]	valid_0's auc: 0.761177
[28]	valid_0's auc: 0.760893
[29]	valid_0's auc: 0.760635
[30]	valid_0's auc: 0.760104
[31]	valid_0's auc: 0.759298
[32]	valid_0's auc: 0.759176
[33]	valid_0's auc: 0.7

Now let's see what is the model's performance:

In [10]:
test_preds = lgb_model.predict(test_x)
auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
res_basic = {"auc": auc, "logloss": logloss}
print(res_basic)
pm.record("res_basic", res_basic)

{'auc': 0.7674356153037237, 'logloss': 0.466876775528735}




<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>
## Optimized Usage

### Label-encoding and Binary-encoding

Next, we will iterate on this model to see if we can make it better.

Specifically we will do some feature engineering to treat the categorical variables differently. We will convert all the categorical features in the original dataset into numeric features by label-encoding [3] and binary-encoding [4]. 

Due to the time-series nature of the Criteo dataset, the label-encoding we adopted is executed one-by-one, which means we encode the samples (i.e. rows) in order, and we incorporate the information from the previous samples into the features for the current row (sequential label-encoding and sequential count-encoding). We also do a few additional clean up tasks. See [lgb_utils.NumEncoder](https://github.com/microsoft/recommenders/blob/master/reco_utils/recommender/lightgbm/lightgbm_utils.py) for details.

Specifically, in `lgb_utils.NumEncoder`, the main steps are as follows:

* First, we convert the low-frequency categorical features to `"LESS"` and the missing categorical features to `"UNK"`. 
* Second, we convert the missing numerical features into the mean of corresponding columns. 
* Third, the string-like categorical features are ordinal encoded like the example above. 
* Fourth, we label-encode the categorical features in the samples order one-by-one. For each sample, we add information about the label and count of its previous samples to produce new features. For each categorical variable, we create a new label-encoded feature ($LF$) such that for the current sample $x_i$, we calculate the average click through rate in all previous samples ($j=1..(i-1)$) where the category value was the same $c$. Formally, this looks like: 
$$LF = \frac{\sum\nolimits_{j=1}^{i-1} I(x_j=c) \cdot y}{\sum\nolimits_{j=1}^{i-1} I(x_j=c)}$$
where $x_i$ is the $i$th sample, $c$ is the observed category for $x_i$, and $I(\cdot)$ is the indicator function that determines whether a *former* sample contains $c$ or not.
* Fifth, we also add the count frequency of $c$ as a new count feature ($CF$). This formally evaluates to:
$$CF = \frac{\sum\nolimits_{j=1}^{i-1} I(x_j=c)}{i-1}$$ 
* Finally, based on the results of ordinal encoding, we add the binary encoding results as new columns into the data.

Note that the statistics used in the above process only updates when fitting the training set, while maintaining static when transforming the testing set because the label of test data should be considered as unknown.

In [11]:
label_col = 'Label'
num_encoder = lgb_utils.NumEncoder(cate_cols, nume_cols, label_col)
train_x, train_y = num_encoder.fit_transform(train_data)
valid_x, valid_y = num_encoder.transform(valid_data)
test_x, test_y = num_encoder.transform(test_data)
del num_encoder
print('Train Data Shape: X: {trn_x_shape}; Y: {trn_y_shape}.\nValid Data Shape: X: {vld_x_shape}; Y: {vld_y_shape}.\nTest Data Shape: X: {tst_x_shape}; Y: {tst_y_shape}.\n'
      .format(trn_x_shape=train_x.shape,
              trn_y_shape=train_y.shape,
              vld_x_shape=valid_x.shape,
              vld_y_shape=valid_y.shape,
              tst_x_shape=test_x.shape,
              tst_y_shape=test_y.shape,))


2019-06-05 13:30:02,641 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:08<00:00,  8.32it/s]
100%|██████████| 13/13 [00:00<00:00, 420.94it/s]
2019-06-05 13:30:11,698 [INFO] Ordinal encoding cate features
2019-06-05 13:30:14,632 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:06<00:00,  4.15it/s]
2019-06-05 13:30:21,066 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:07<00:00,  8.29it/s]
100%|██████████| 26/26 [00:15<00:00,  1.70it/s]
2019-06-05 13:30:44,676 [INFO] Filtering and fillna features
100%|██████████| 26/26 [00:00<00:00, 132.71it/s]
100%|██████████| 13/13 [00:00<00:00, 988.76it/s]
2019-06-05 13:30:44,900 [INFO] Ordinal encoding cate features
2019-06-05 13:30:44,990 [INFO] Target encoding cate features
100%|██████████| 26/26 [00:00<00:00, 35.08it/s]
2019-06-05 13:30:45,736 [INFO] Start manual binary encoding
100%|██████████| 65/65 [00:04<00:00, 14.53it/s]
100%|██████████| 26/26 [00:02<00:00, 11.62it/s]
2019-06-05 13:30:52,601 [INFO

Train Data Shape: X: (80000, 268); Y: (80000, 1).
Valid Data Shape: X: (10000, 268); Y: (10000, 1).
Test Data Shape: X: (10000, 268); Y: (10000, 1).



### Training and Evaluation

In [12]:
lgb_train = lgb.Dataset(train_x, train_y.reshape(-1), params=params)
lgb_valid = lgb.Dataset(valid_x, valid_y.reshape(-1), reference=lgb_train)
lgb_model = lgb.train(params,
                      lgb_train,
                      num_boost_round=NUM_OF_TREES,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      valid_sets=lgb_valid)

[1]	valid_0's auc: 0.731759
Training until validation scores don't improve for 20 rounds.
[2]	valid_0's auc: 0.747705
[3]	valid_0's auc: 0.751667
[4]	valid_0's auc: 0.75589
[5]	valid_0's auc: 0.758054
[6]	valid_0's auc: 0.758094
[7]	valid_0's auc: 0.759904
[8]	valid_0's auc: 0.761098
[9]	valid_0's auc: 0.761744
[10]	valid_0's auc: 0.762308
[11]	valid_0's auc: 0.762473
[12]	valid_0's auc: 0.763606
[13]	valid_0's auc: 0.764222
[14]	valid_0's auc: 0.765004
[15]	valid_0's auc: 0.765933
[16]	valid_0's auc: 0.766507
[17]	valid_0's auc: 0.767192
[18]	valid_0's auc: 0.767284
[19]	valid_0's auc: 0.767859
[20]	valid_0's auc: 0.768619
[21]	valid_0's auc: 0.769045
[22]	valid_0's auc: 0.768987
[23]	valid_0's auc: 0.769601
[24]	valid_0's auc: 0.77011
[25]	valid_0's auc: 0.770183
[26]	valid_0's auc: 0.770539
[27]	valid_0's auc: 0.77096
[28]	valid_0's auc: 0.771164
[29]	valid_0's auc: 0.771296
[30]	valid_0's auc: 0.771402
[31]	valid_0's auc: 0.771596
[32]	valid_0's auc: 0.771476
[33]	valid_0's auc: 0.

In [13]:
test_preds = lgb_model.predict(test_x)
auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
res_optim = {"auc": auc, "logloss": logloss}
print(res_optim)
pm.record("res_optim", res_optim)

{'auc': 0.7757371640011422, 'logloss': 0.4606505068849181}




## Model saving and loading
Now we finish the basic training and testing for LightGBM, next let's try to save and reload the model, and then evaluate it again.

In [14]:
with TemporaryDirectory() as tmp:
    save_file = os.path.join(tmp, r'finished.model')
    lgb_model.save_model(save_file)
    loaded_model = lgb.Booster(model_file=save_file)

# eval the performance again
test_preds = loaded_model.predict(test_x)

auc = roc_auc_score(np.asarray(test_y.reshape(-1)), np.asarray(test_preds))
logloss = log_loss(np.asarray(test_y.reshape(-1)), np.asarray(test_preds), eps=1e-12)
print({"auc": auc, "logloss": logloss})

{'auc': 0.7757371640011422, 'logloss': 0.4606505068849181}


## Additional Reading

\[1\] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems. 3146–3154.<br>
\[2\] The parameters of LightGBM: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst <br>
\[3\] Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin. 2018. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363 (2018).<br>
\[4\] Scikit-learn. 2018. categorical_encoding. https://github.com/scikit-learn-contrib/categorical-encoding<br>
