# Xgboost模型

## References

* *统计学习方法*
* [XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf)
* [Chen Tianqi Introduction to Boosted Tree](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)

## xgboost原生态接口

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xgboost as xgb
import sklearn
from sklearn import datasets as sklearn_datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split

导入数据

In [2]:
iris_dataset = sklearn_datasets.load_iris()

iris_df = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
iris_label_df = pd.DataFrame(iris_dataset.target, columns=['label'])

In [3]:
iris_df.tail()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


分割数据集，训练集和测试集

In [4]:
X_train, X_test, y_train, y_test = train_test_split(iris_dataset.data, iris_dataset.target,
                                                    test_size=0.3, random_state=42)

In [5]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((105, 4), (45, 4), (105,), (45,))

训练

In [6]:
xgb_params = {
    'eta': 0.3,  # learning rate
    'silent': False,
    'objective': 'multi:softprob',
    'num_class': 3,
    'max_depth': 3
}

NUM_BOOST_ROUND = 10

model = xgb.train(xgb_params, dtrain=xgb.DMatrix(X_train, label=y_train),
                  num_boost_round=NUM_BOOST_ROUND)

预测

In [7]:
test_pred = model.predict(xgb.DMatrix(X_test, label=y_test))

In [8]:
X_test.shape, y_test.shape, test_pred.shape

((45, 4), (45,), (45, 3))

In [9]:
pd.DataFrame(test_pred).head()

Unnamed: 0,0,1,2
0,0.024572,0.926167,0.049261
1,0.94096,0.034371,0.024669
2,0.025954,0.030035,0.944011
3,0.024273,0.914921,0.060806
4,0.022549,0.849922,0.127529


In [10]:
print('predict result：{}'.format(np.argmax(test_pred, axis=1)))

predict result：[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0 0 0 1 0 0 2 1
 0 0 0 2 1 1 0 0]


模型评估

In [11]:
pred_test = np.argmax(test_pred, axis=1)

precision = metrics.precision_score(y_test, pred_test, average='macro')
recall = metrics.recall_score(y_test, pred_test, average='macro')
print('precision: {}, recall: {}'.format(precision, recall))

precision: 1.0, recall: 1.0


## sklearn接口Xgboost

In [12]:
from xgboost import XGBClassifier

训练

In [13]:
model = XGBClassifier(learning_rate=0.01, n_estimators=3000, max_depth=4,
                     objective='binary:logistic', seed=27)
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.01, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=None, n_estimators=3000, n_jobs=1,
              nthread=None, objective='multi:softprob', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27,
              silent=None, subsample=1, verbosity=1)

预测

In [14]:
pred_test = model.predict(X_test)

In [15]:
pred_test

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0,
       0])

模型评估

In [16]:
precision = metrics.precision_score(y_test, pred_test, average='macro')
recall = metrics.recall_score(y_test, pred_test, average='macro')

'precision: {}, recall: {}'.format(precision, recall)

'precision: 1.0, recall: 1.0'