## LightGBM estimation
## Tutorial

This project aims to explore some main functionalities of LightGBM library for estimating GBMs in such a way that efficient and high-performance models come up based on functions and classes whose synthax is somewhat different from traditional sci-kit learn. All codes for model estimation follow from [LightGBM documentation](https://lightgbm.readthedocs.io/en/latest/index.html). Some of its crucial aspects are [installation guide](https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html), [tutorials](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html), [main features](https://lightgbm.readthedocs.io/en/latest/Features.html) of functions and classes, complete list of [parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html) and their alternatives, and, finally, main possibilities of [parameters tuning](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html). Some additional tutorials can be found in [LightGBM Github](https://github.com/microsoft/LightGBM/tree/master/examples/python-guide).
<br>
<br>
Main functions and classes used during LightGBM estimation concern dataset creation (using [Dataset class](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset)), hyper-parameters definition, training the model using an entire training data (using [train class](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train)), and training the model using K-folds cross-validation (using [cv class](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.cv.html#lightgbm.cv)). The hyper-parameters definition is particularly important, since it follows a synthax based on a dictionary whose keys are parameters and values are some specific choices for them. Consequently, given that there are a huge collection of hyper-parameters that can be set, [parameters documentation](https://lightgbm.readthedocs.io/en/latest/Parameters.html) is probably the most relevant for LightGBM implementation.
<br>
<br>
The creation of a [dataset object](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Dataset.html#lightgbm.Dataset) depends on declaring the training data with *data* argument for input variables and with *labels* argument for the output variable. Some additional and interesting arguments are *weight*, for modifying how each instance contributes with model training, and *feature_name*, for defining names of features (important if plotting tools are expected to be used).
<br>
<br>
Training a LightGBM model with [train class](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html#lightgbm.train) is pretty straightforward. The two most relevant arguments are *params*, dictionary containing all explicitly declared hyper-parameters and their values, and *train_set*, the dataset object previously created. Initialization argument *num_boost_round* is optional, since it can be declared in the parameters dictionary. *valid_sets* and *valid_names* are arguments that should be declared for monitoring model performance on a validation dataset, necessary when implementing early stopping. The initialization arguments *fobj*, which receives a customized cost function, and *feval*, similar to *fobj* but used for model evaluation, are optional, since standard alternatives are available to be declared in parameters dictionary. Finally, some other relevant parameters are *learning_rates* (receives a list or function of customized learning schedules) and *callbacks* (receives a list of callbacks for model monitoring).
<br>
<br>
Training a LightGBM model through [K-folds CV](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.cv.html#lightgbm.cv) requires the initialization of an object using basically the same arguments as those for the train class. The most important extra arguments are *nfolds* (number of folds for splitting data), *shuffle* (whether data should be randomized prior to data split), and *stratified* (if *shuffle* is set to true, this parameter forces folds of data to be representative samples according to output variable).
<br>
<br>
There are several collections of parameters available: [learning control parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#learning-control-parameters) (hyper-paramenters that control how GBM estimation occurs), [dataset parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters) (affects the creation of the dataset object), [predict parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#predict-parameters) (defines how predictions are executed), [objective parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#objective-parameters) (referring to the learning task), and [metric parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric-parameters) (which refers to learning or performance metrics evaluated during model training). [This section](https://lightgbm.readthedocs.io/en/latest/Parameters.html#core-parameters) presents the core parameters for applying LightGBM. Except from dataset (used in dataset class) and predict (used in predict method) parameters, all of them should be declared in the parameters dictionary that feeds *param* argument of train class.
<br>
<br>
Special attention should be focused on [core](https://lightgbm.readthedocs.io/en/latest/Parameters.html#core-parameters) and [learning control](https://lightgbm.readthedocs.io/en/latest/Parameters.html#learning-control-parameters) parameters. Core parameters, for instance, gather the objective function (*objective*) for model training, the type of boosting (*boosting*) to be performed (traditional boosted trees, DART, etc.), the number of estimators in the ensemble (*num_iterations*), the learning rate (*learning_rate*). Learning control parameters complete the collection of most relevant hyper-parameters for GBM estimation: maximum allowed number of levels in each tree in the ensemble (*max_depth*) and subsample parameter (*bagging_fraction*, together with *bagging_freq* that should be defined to be larger than one if stochastic GBM is to be implemented).
<br>
<br>
Additional hyper-parameters can lead to an even further fine tuning of model performance. *pos_bagging_fraction* and *neg_bagging_fraction* help to deal with unbalanced datasets for binary classification. *feature_fraction* and *feature_fraction_bynode* allows the use of only a random sample of features at each iteration and split, respectively. Minimum node size (*min_data_in_leaf*) controls tree complexity, while *min_gain_to_split* defines minimum improvements in cost function for further splittings. *lambda_l1* and *lambda_l2* provide additional regularization. The complete [list of parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html#learning-control-parameters) is huge, just illustrating how powerful it is to use LightGBM for implementing boosted models. Finally, a very nice section of LightGBM documentation refers to [parameters tuning](https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html). Several distinct hints are given concerning computational time and model performance.
<br>
<br>
Early stopping can be easily implemented with LightGBM. It requires the definition of arguments *valid_sets* (list with validation dataset objects), *valid_names* (respective names of validation datasets) and *early_stopping_rounds* (number of tolerated iterations without improvement in the performance metric declared in parameters dictionary). Attribute *best_iteration* can then be used for predicting with the best subset of estimators or for saving the model as it was at the best iteration.
<br>
<br>
In addition to Python API for dataset construction and model estimation, one can make great use of documenation presenting how to implement LightGBM following the [sklearn interface](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api). More sophisticated usages involve [distributed version](https://lightgbm.readthedocs.io/en/latest/Python-API.html#dask-api) of the library and the development of [callbacks](https://lightgbm.readthedocs.io/en/latest/Python-API.html#callbacks) for monitoring model training and performance. The [plotting](https://lightgbm.readthedocs.io/en/latest/Python-API.html#plotting) of model outcomes is also a useful feature of LightGBM.
<br>
<br>
No straightforward method seems available for defining hyper-parameters based on grid or random search under LightGBM. Therefore, third party libraries or customized Python modules should be combined for improving the performance of LightGBM models.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing data](#imports)<a href='#imports'></a>.
    * [Features and label](#feats_label)<a href='#feats_label'></a>.
    * [Model assessment](#model_assess)<a href='#model_assess'></a>.
<br>
<br>
5. [Model estimation](#model_estimation)<a href='#model_estimation'></a>.
    * [Train-test estimation](#train_test)<a href='#train_test'></a>.
    * [Early stopping](#es)<a href='#es'></a>.
    * [CV estimation](#cv)<a href='#cv'></a>.
    * [Grid and random searches](#grid_random_searches)<a href='#grid_random_searches'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
import pandas as pd
import numpy as np

import os
import json

import re
# pip install unidecode
from unidecode import unidecode

from datetime import datetime
import time

from scipy.stats import uniform, norm, randint

# pip install lightgbm
import lightgbm as lgb

from sklearn.metrics import roc_auc_score, average_precision_score, auc, precision_recall_curve, brier_score_loss

<a id='functions_classes'></a>

## Functions and classes

In [2]:
import utils
from utils import loading_data, running_time

In [3]:
import kfolds
from kfolds import Kfolds_fit

<a id='settings'></a>

## Settings

In [4]:
# Declare whether to export results:
export = True

# Define the dataset_id:
dataset_id = 2706

<a id='imports'></a>

## Importing data

<a id='feats_label'></a>

### Features and label

#### Training data

In [5]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_train = loading_data(path=f'../Datasets/dataset_{dataset_id}_train.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

# Accessory variables:
drop_vars = ['y', 'order_id', 'epoch', 'date']

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2020-12-31 to 2021-02-17.
----------------------------------------




#### Test data

In [6]:
print('----------------------------------------')
print(f'\033[1mDataset {dataset_id}:\033[0m')

df_test = loading_data(path=f'../Datasets/dataset_{dataset_id}_test.csv',
                        dtype={'order_id': str, 'store_id': int, 'epoch': str},
                        id_var='order_id')

print('----------------------------------------')
print('\n')

----------------------------------------
[1mDataset 2706:[0m
Shape of df: (7217, 1286).
Number of distinct instances: 7217.
Time period: from 2021-02-17 to 2021-03-31.
----------------------------------------




<a id='model_estimation'></a>

## Model estimation

<a id='train_test'></a>

### Train-test estimation

#### Creating the object for training data

In [8]:
features_names = [re.sub('[^A-Za-z0-9_]+', '', f) for f in df_train.drop(drop_vars, axis=1).columns]

# Creating the training data object:
train_data = lgb.Dataset(data=df_train.drop(drop_vars, axis=1).values,
                         label=df_train['y'].values,
                         feature_name=features_names)

#### Training the model

In [9]:
help(lgb.train)

Help on function train in module lightgbm.engine:

train(params, train_set, num_boost_round=100, valid_sets=None, valid_names=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, evals_result=None, verbose_eval=True, learning_rates=None, keep_training_booster=False, callbacks=None)
    Perform the training with given parameters.
    
    Parameters
    ----------
    params : dict
        Parameters for training.
    train_set : Dataset
        Data to be trained on.
    num_boost_round : int, optional (default=100)
        Number of boosting iterations.
    valid_sets : list of Datasets or None, optional (default=None)
        List of data to be evaluated on during training.
    valid_names : list of strings or None, optional (default=None)
        Names of ``valid_sets``.
    fobj : callable or None, optional (default=None)
        Customized objective function.
        Should accept two parameters: preds, train_da

In [10]:
# Declaring dictionary with hyper-parameters:
param = {'objective': 'binary',
         'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 3, 'num_iterations': 500,
         'bagging_freq': 1,
         'verbose': -1}

In [11]:
# Training the model:
lgb_model = lgb.train(params=param, train_set=train_data,
                      feature_name=features_names,
                      verbose_eval = False)



#### Performance metrics

In [12]:
# Predicting scores for test data:
score_pred_test = lgb_model.predict(df_test.drop(drop_vars, axis=1).values)

In [13]:
# Performance metrics:
test_roc_auc = roc_auc_score(df_test['y'], score_pred_test)
test_prec_avg = average_precision_score(df_test['y'], score_pred_test)
test_brier = brier_score_loss(df_test['y'], score_pred_test)

print('\033[1mPerformance metrics (test data):\033[0m')
print(f'Test ROC-AUC: {test_roc_auc}.')
print(f'Test average precision score: {test_prec_avg}.')
print(f'Test Brier score: {test_brier}.')

[1mPerformance metrics (test data):[0m
Test ROC-AUC: 0.992877420577318.
Test average precision score: 0.963687610627367.
Test Brier score: 0.004888418242082601.


#### Relevant attributes or methods of the trained model

Visualizing the ensemble of trees using a dataframe

In [14]:
lgb_model.trees_to_dataframe().head(25)

Unnamed: 0,tree_index,node_depth,node_index,left_child,right_child,parent_index,split_feature,split_gain,threshold,decision_type,missing_direction,missing_type,value,weight,count
0,0,1,0-S0,0-S1,0-S2,,feat_879,3582.459961,2.838081,<=,left,,-3.11388,0.0,5404
1,0,2,0-S1,0-S3,0-L2,0-S0,feat_886,728.320007,5.626141,<=,left,,-3.12136,212.768,5224
2,0,3,0-S3,0-L0,0-L4,0-S1,feat_804,16.8099,3.609614,<=,left,,-3.1229,211.302,5188
3,0,4,0-L0,,,0-S3,,,,,,,-3.123086,210.405654,5166
4,0,4,0-L4,,,0-S3,,,,,,,-3.07968,0.896036,22
5,0,3,0-L2,,,0-S1,,,,,,,-2.899256,1.466241,36
6,0,2,0-S2,0-S4,0-L3,0-S0,feat_223,55.989498,0.015152,<=,left,,-2.89653,7.33121,180
7,0,3,0-S4,0-L1,0-L5,0-S2,feat_228,6.13987,0.379437,<=,left,,-2.88517,6.27226,154
8,0,4,0-L1,,,0-S4,,,,,,,-2.880714,5.213303,128
9,0,4,0-L5,,,0-S4,,,,,,,-2.907126,1.058952,26


<a id='es'></a>

### Early stopping

#### Creating the object for training data

In [77]:
features_names = [re.sub('[^A-Za-z0-9_]+', '', f) for f in df_train.drop(drop_vars, axis=1).columns]

# Creating the training data object:
train_data = lgb.Dataset(data=df_train.drop(drop_vars, axis=1).values,
                         label=df_train['y'].values,
                         feature_name=features_names)

# Creating the validation data object:
val_data = lgb.Dataset(data=df_test.drop(drop_vars, axis=1).values,
                       label=df_test['y'].values,
                       feature_name=features_names)

#### Training the model

In [86]:
# Declaring dictionary with hyper-parameters:
param = {'objective': 'binary', 'metric': 'auc',
         'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 3, 'num_iterations': 500,
         'bagging_freq': 1,
         'verbose': -1}

In [88]:
# Training the model:
lgb_model = lgb.train(params=param, train_set=train_data,
                      feature_name=features_names,
                      valid_sets=[val_data], valid_names=['validation_data'], early_stopping_rounds=50,
                      verbose_eval = True)

[1]	validation_data's auc: 0.95666
Training until validation scores don't improve for 50 rounds
[2]	validation_data's auc: 0.956607
[3]	validation_data's auc: 0.958165
[4]	validation_data's auc: 0.95817
[5]	validation_data's auc: 0.958159
[6]	validation_data's auc: 0.958194
[7]	validation_data's auc: 0.958171
[8]	validation_data's auc: 0.966776
[9]	validation_data's auc: 0.966669
[10]	validation_data's auc: 0.966694
[11]	validation_data's auc: 0.966677
[12]	validation_data's auc: 0.966682
[13]	validation_data's auc: 0.966683
[14]	validation_data's auc: 0.966708
[15]	validation_data's auc: 0.966521
[16]	validation_data's auc: 0.970675
[17]	validation_data's auc: 0.970677
[18]	validation_data's auc: 0.97068
[19]	validation_data's auc: 0.970668
[20]	validation_data's auc: 0.970671
[21]	validation_data's auc: 0.970673
[22]	validation_data's auc: 0.970678
[23]	validation_data's auc: 0.972838
[24]	validation_data's auc: 0.972838
[25]	validation_data's auc: 0.972834
[26]	validation_data's auc

[258]	validation_data's auc: 0.992467
[259]	validation_data's auc: 0.99248
[260]	validation_data's auc: 0.992484
[261]	validation_data's auc: 0.992489
[262]	validation_data's auc: 0.992449
[263]	validation_data's auc: 0.992401
[264]	validation_data's auc: 0.992379
[265]	validation_data's auc: 0.992365
[266]	validation_data's auc: 0.992322
[267]	validation_data's auc: 0.992325
[268]	validation_data's auc: 0.992348
[269]	validation_data's auc: 0.992414
[270]	validation_data's auc: 0.992412
[271]	validation_data's auc: 0.992396
[272]	validation_data's auc: 0.992386
[273]	validation_data's auc: 0.992392
[274]	validation_data's auc: 0.992441
[275]	validation_data's auc: 0.992441
[276]	validation_data's auc: 0.992529
[277]	validation_data's auc: 0.992535
[278]	validation_data's auc: 0.992565
[279]	validation_data's auc: 0.992579
[280]	validation_data's auc: 0.99259
[281]	validation_data's auc: 0.992591
[282]	validation_data's auc: 0.992573
[283]	validation_data's auc: 0.992778
[284]	validati

#### Performance metrics

In [89]:
# Predicting scores for test data:
score_pred_val = lgb_model.predict(df_test.drop(drop_vars, axis=1).values,
                                   num_iteration=lgb_model.best_iteration)

In [91]:
# Performance metrics:
val_roc_auc = roc_auc_score(df_test['y'], score_pred_val)
val_prec_avg = average_precision_score(df_test['y'], score_pred_val)
val_brier = brier_score_loss(df_test['y'], score_pred_val)

print('\033[1mPerformance metrics (validation data):\033[0m')
print(f'Test ROC-AUC: {val_roc_auc}.')
print(f'Test average precision score: {val_prec_avg}.')
print(f'Test Brier score: {val_brier}.')

[1mPerformance metrics (test data):[0m
Test ROC-AUC: 0.9928389315792501.
Test average precision score: 0.9594216812829266.
Test Brier score: 0.005455706706754901.


<a id='cv'></a>

### CV estimation

In [25]:
help(lgb.cv)

Help on function cv in module lightgbm.engine:

cv(params, train_set, num_boost_round=100, folds=None, nfold=5, stratified=True, shuffle=True, metrics=None, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', early_stopping_rounds=None, fpreproc=None, verbose_eval=None, show_stdv=True, seed=0, callbacks=None, eval_train_metric=False, return_cvbooster=False)
    Perform the cross-validation with given paramaters.
    
    Parameters
    ----------
    params : dict
        Parameters for Booster.
    train_set : Dataset
        Data to be trained on.
    num_boost_round : int, optional (default=100)
        Number of boosting iterations.
    folds : generator or iterator of (train_idx, test_idx) tuples, scikit-learn splitter object or None, optional (default=None)
        If generator or iterator, it should yield the train and test indices for each fold.
        If object, it should be one of the scikit-learn splitter classes
        (https://scikit-

In [26]:
# Declaring dictionary with hyper-parameters:
param = {'metric': 'cross_entropy', 'objective': 'binary',
         'bagging_fraction': 1.0, 'learning_rate': 0.01, 'max_depth': 3, 'num_iterations': 500,
         'verbose': -1}

In [27]:
# Creating the training data object:
train_data = lgb.Dataset(data=df_train.drop(drop_vars, axis=1).values,
                         label=df_train['y'].values)

In [28]:
# Training the model:
lgb_cv = lgb.cv(param, train_data, nfold=3, shuffle=False, metrics=['cross_entropy', 'auc'],
                verbose_eval = False)



<a id='grid_random_searches'></a>

### Grid and random searches

In [29]:
estimation_id = int(time.time())

In [31]:
# Declare grid of hyper-parameters:
grid_param = {'bagging_fraction': uniform(0.5, 0.5),
              'learning_rate': uniform(0.0001, 0.1),
              'max_depth': randint(1, 10),
              'num_iterations': [100, 250, 500]}
default_param = {'bagging_fraction': 1.0, 'learning_rate': 0.01, 'max_depth': 3, 'num_iterations': 500,
                 'verbose': -1}

# Declare estimation object:
model = Kfolds_fit(task='binary', method='light_gbm',
                   metric='roc_auc', num_folds=3, pre_selecting=False,
                   random_search=True, n_samples=10,
                   grid_param=grid_param, default_param=default_param)

# Running train-test estimation:
model.fit(train_inputs=df_train.drop(drop_vars, axis=1),
          train_output=df_train['y'],
          test_inputs=df_test.drop(drop_vars, axis=1),
          test_output=df_test['y'])







































































---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 3.
   Number of samples for random search: 10.
   Estimation method: light gbm.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'bagging_fraction': 0.7885873020172298, 'learning_rate': 0.05315523619935781, 'max_depth': 6, 'num_iterations': 100}.
   CV performance metric associated with best hyper-parameters: 0.983.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9906
   test_prec_avg = 0.9635
   test_brier = 0.0045
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 1.19 minutes.
Start time: 2021-06-06, 19:13:10
End time: 2021-06-06, 19:14:21
------------------------------------


#### Performance metrics

In [36]:
# Performance metrics:
test_roc_auc = model.performance_metrics['test_roc_auc']
test_prec_avg = model.performance_metrics['test_prec_avg']
test_brier = model.performance_metrics['test_brier']

print('\033[1mPerformance metrics (test data):\033[0m')
print(f'Test ROC-AUC: {test_roc_auc}.')
print(f'Test average precision score: {test_prec_avg}.')
print(f'Test Brier score: {test_brier}.')
print('\n')

print('\033[1mBest hyper-parameters:\033[0m')
print(model.best_param)
print('\n')

print('\033[1mLightGBM parameters:\033[0m')
print(model.model.params)

[1mPerformance metrics (test data):[0m
Test ROC-AUC: 0.9905934580546053.
Test average precision score: 0.9635302207608056.
Test Brier score: 0.004497669768866669.


[1mBest hyper-parameters:[0m
{'bagging_fraction': 0.7885873020172298, 'learning_rate': 0.05315523619935781, 'max_depth': 6, 'num_iterations': 100}


[1mLightGBM parameters:[0m
{'metric': 'cross_entropy', 'objective': 'binary', 'bagging_fraction': 0.7885873020172298, 'learning_rate': 0.05315523619935781, 'max_depth': 6, 'verbose': -1, 'num_iterations': 100, 'early_stopping_round': None}


#### Model outcomes

In [37]:
# Features importances:
importances = list(model.model.feature_importance())
features = list(df_train.drop(drop_vars, axis=1).columns)

model_outcomes_gbm = {
    'features': features,
    'importances': importances
}

#### Exporting the model

In [38]:
if export:
    # JSON format:
    exp_model = model.model.dump_model()

    with open(f'../Models/model_{estimation_id}.json', 'w') as json_file:
        json.dump(exp_model, json_file, indent=2)

    # Txt format:
    exp_model = model.model.save_model(f'../Models/model_{estimation_id}.txt')

#### Importing the model

In [39]:
imp_model = lgb.Booster(model_file=f'../Models/model_{estimation_id}.txt')

#### Making predictions

In [40]:
imp_model.predict(df_test.drop(drop_vars, axis=1).iloc[np.random.randint(low=0,
                                                                                high=len(df_test),
                                                                                size=1), :])

array([0.00065104])