# Lightgbm

## References

* *统计学习方法*
* [lightgbm a highly efficient gradient boosting decision tree](https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf)

## Pracice

### lightgbm 原生接口

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn import metrics
from sklearn import datasets as sk_datasets
from sklearn.model_selection import train_test_split
import lightgbm as lgb

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


导入数据集

In [2]:
iris_dataset = sk_datasets.load_iris()

In [3]:
pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
iris_dataset.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [5]:
iris_dataset.data.shape, iris_dataset.target.shape

((150, 4), (150,))

划分数据集，训练集和测试集

In [6]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target,
                                                    test_size=0.3, random_state=42)

In [7]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((105, 4), (45, 4), (105,), (45,))

训练

In [8]:
lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'metrics': 'multi_error',
    'verbose': 1,
    'num_class': 3
}
train_set = lgb.Dataset(X_train, y_train)
test_set = lgb.Dataset(X_test, y_test)

model = lgb.train(lgb_params, train_set=train_set, valid_sets=[train_set, test_set],
                  verbose_eval=10)

[10]	training's multi_error: 0.0666667	valid_1's multi_error: 0
[20]	training's multi_error: 0.0571429	valid_1's multi_error: 0
[30]	training's multi_error: 0.0571429	valid_1's multi_error: 0
[40]	training's multi_error: 0.0285714	valid_1's multi_error: 0
[50]	training's multi_error: 0.0190476	valid_1's multi_error: 0
[60]	training's multi_error: 0.00952381	valid_1's multi_error: 0
[70]	training's multi_error: 0.00952381	valid_1's multi_error: 0
[80]	training's multi_error: 0.00952381	valid_1's multi_error: 0
[90]	training's multi_error: 0.00952381	valid_1's multi_error: 0.0222222
[100]	training's multi_error: 0.00952381	valid_1's multi_error: 0.0222222


预测

In [9]:
pred_test = np.argmax(model.predict(X_test, num_iteration=model.best_iteration), axis=1)

In [10]:
pred_test.shape

(45,)

模型评估

In [11]:
precision = metrics.precision_score(y_test, pred_test, average='macro')
recall = metrics.recall_score(y_test, pred_test, average='macro')

'precision: {}, recall: {}'.format(precision, recall)

'precision: 0.9761904761904763, recall: 0.9743589743589745'

### sklearn 接口

In [12]:
lgb_params = {
    'learning_rate': 0.1,
    'max_bin': 150,
    'num_leaves': 32,
    'max_depth': 11,
    'objective': 'multiclass',
    'n_estimators': 300
}

model = lgb.LGBMClassifier(**lgb_params)

In [13]:
model.fit(X_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_bin=150,
               max_depth=11, min_child_samples=20, min_child_weight=0.001,
               min_split_gain=0.0, n_estimators=300, n_jobs=-1, num_leaves=32,
               objective='multiclass', random_state=None, reg_alpha=0.0,
               reg_lambda=0.0, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

In [14]:
pred_test = model.predict(X_test)

In [15]:
precision = metrics.precision_score(y_test, pred_test, average='macro')
recall = metrics.recall_score(y_test, pred_test, average='macro')

'precision: {}, recall: {}'.format(precision, recall)

'precision: 1.0, recall: 1.0'

### 总结

* lgb.train中正则化参数为"lambda_l1", "lambda_l1"，sklearn中则为'reg_alpha', 'reg_lambda'。
* 多分类时lgb.train除了'objective':'multiclass',还要指定"num_class":5，而sklearn接口只需要指定'objective':'multiclass'。
* 迭代次数在sklearn中是'n_estimators':20，在初始化模型时指定
