## lightGBM

2016, Microsoft implementation of GBM

[[paper]](https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)


#### Key ideas:

- Histogram based split finding

    it's not a new idea, it was well-known before (for example, when constructing decision trees), but since it's more effective then pre-sorted split find, they make a big use of it


- GOSS (Gradient-based one side sampling) 
    
    during the model update process it's more sensible to give more attention to under-trained examples (cases with high loss gradient), examples with low gradient can be safely be undersampled. This accelerates the training process significantly.


- EFB (Exclusive Feature Bundling)

    features that are frequently mutually exclusive can be combined into one feature. In the setting of highly sparse data this reduces algorithm complexity significantly.

All other innovations has to do with efficient and convenient implementation, not algorithm.


#### How exactly does GOSS work

On each step they:
1. select $\alpha$ examples with the largest loss function values (largest deviance) and then 
2. randomly sample $\beta$ examples from all other cases

#### How exactly does EFB work

They create incidence graph between features and try ro colorize that graph into exclusive sets of features.

#### How categorical features are tackled

They claim that one-hot encoding is inefficient. Instead they 
1. compute categorical histogram with averaged gradients for each category
2. find the best combination of categories in 2 sets according (they always do binary splits)

Available algorithms:
- GBDT - standard approach
- DART - boosting with dropouts
- GOSS - sampled training
- RF - they added it just for completeness (?)


# Implementation (Python)

In [None]:
import lightgbm as lgb

## Getting Data

Key class for storing data is called **lgb.Dataset**

API is pretty much the same as in XGBoost.

<img src="img/lightgbm_datatset.png" width=350>

Available data sources:
- Pandas dataframe
- numpy ndarray
- scipy sparse matrix
- CSV, TXT, LibSVM files
- binary lightGBM files

Differences with XGB:
- class is called Dataset instead of DMatrix
- you can also load from TXT
- there is a support for categorical features - one must define them explicitly

#### Type reuiqerments
All features must be int, float or bool. Strings must be converted to numeric types.

There is more flexibility with parameters:
- two_round_loading
- enable_feature_bundling
- bucketing strategy:
  - max_bin
  - min_bin
  - bin_sample_cnt
- free_raw_data

Dataset is a <u>link</u> to data. It does not do loading when defined. First loading is occured during train process.

Setting **free_raw_data** to False is mandatory when using categortical features. Otherwise you could only call train() once.

## Overall schema

<img src="img/lightgbm.png" width=750>

Let's create some training data

In [3]:
from sklearn.datasets import fetch_kddcup99
import pandas as pd
import numpy as np
from tqdm import tqdm

X,y = fetch_kddcup99(return_X_y = True)

feature_names = ["feature_"+str(x) for x in range(X.shape[1])]
exclude_features = ['feature_1','feature_2','feature_3']

X = pd.DataFrame(X, columns=feature_names, dtype=None)
X = X[X.columns.difference(exclude_features)].astype(np.float)

y = np.random.choice([0,1], X.shape[0])

categorical_features = ['feature_11','feature_13','feature_14','feature_6','feature_7','feature_8']

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X,y)

train_data = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features, free_raw_data=False)
valid_data = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_features, free_raw_data=False, reference=train_data)
# train_data = lgb.Dataset(X, label=y, free_raw_data=True)

NameError: name 'lgb' is not defined

## Setting parameters

[[nice tutorial on LGB parameters]](https://neptune.ai/blog/lightgbm-parameters-guide)

In lightGBM all parameters are global. 

There are several ways you can set them up:
- in config file
- using param dicionary
- in particular classes or methods

The are 100+ parameters.
- core params = objective and evaluation metrics / algorithm
- training params = n_estimators / learning_rate
- base learner params = max_depth / max_leaves




## Training

There are no different models for different tasks like in scikit-learn (LGBMCLassifier, LGBMRegressor).

The type of task is defined only by its objective:
- MSE + L2 for regression tasks
- binary / multiclass for classification tasks
- lambdarank for ranking tasks

Implemented bossting types:
- GBT
- DART
- GOSS
- RF

#### Minimal Training

In [None]:
bst = lgb.train(
    train_set = train_data, 
    params = {'objective':'binary'})

#### How to track training process using evaluation metrics

- Parameter **valid_sets** defines datasets on which evaluation will be done (including training data if necessary)
- Parameter __feval__ defines metrics
- Parameter __valid_names__ defines datasets names in evaluation results

You should manually provide dictionary where the evaluation results will be stored

In [None]:
results_dict = {}

lgb.train(
    train_data,
    valid_sets = [train_data, valid_data],
    feval = ['auc','logloss'],
    valid_names = ['super_train_data','super_valid_data'], 
    evals_result = results_dict)

print(results)

Training visualization

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.plot(results_dict['train']['auc'], label='train', color='red')
plt.plot(results_dict['valid']['auc'], label='valid', color='green')
plt.grid(True, linestyle='dotted')

To suppress printing each iteration:
`verbose_eval=False`

#### How to use early stopping

`lgb.train(early_stopping_rounds = 10)`

or 

`params['early_stopping_rounds']=10`

will stop training if there is no improvement in validation score.

- It checks only **validation** sets

    there must be at least one validation set (with a reference) defined in valid_sets


- It checks **all** metrics
    
    you could enforce it to check only the first metric (for example ROC_AUC) by setting **first_metric_only** parameter.

## Distributed training

params['tree_learner'] = 
- serial
- feature parallel
- data parallel
- voting

params['device_type'] = 
- cpu
- gpu
- cuda



# Implementation (Sklearn)

There is a number of wrappers for scikit-learn. They make it possible to plug LGBM models into standard scikit-learn pipelines:
- LGBMClassifier()
- LGBMRegressor()
- LGBMRanker()

They differ only with default objective function. Ensemble algorithm is the same.

For example, they provide standard __n_estimators__ parameter instead of __num_rounds_count__.

boosting_type =
- lgbt
- dart
- goss
- random forest

importance_type = 
- split - number of times feature is used in splits
- gain - total gain achieved by splitting this feature

To train and apply model use fit() and predict() correspondingly.

You can still set most of the parameters using kwargs.