## Xgboost-demo-1
### 1、Dmatrix保存为二进制文件
可以将完成特征处理并且已经转换为DMatrix格式的数据进行保存，下次加载时可以直接使用

In [1]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split


In [2]:
#加载乳腺癌数据
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state = 8)

In [3]:
xgb_train = xgb.DMatrix(X_train,label=y_train)
xgb_test = xgb.DMatrix(X_test,label=y_test)

In [42]:
params = {
    'objective': "binary:logistic",
    'booster': "gbtree",
    'eta': 0.1,
    'min_child_weight': 1,
    'max_depth': 5
    }
num_round = 10
watchlist = [(xgb_train,'train'),(xgb_test,'test')]
model = xgb.train(params,xgb_train,num_round,evals = watchlist)
preds = model.predict(xgb_test)

[0]	train-error:0.024176	test-error:0.035088
[1]	train-error:0.015385	test-error:0.052632
[2]	train-error:0.013187	test-error:0.070175
[3]	train-error:0.004396	test-error:0.078947
[4]	train-error:0.004396	test-error:0.078947
[5]	train-error:0.004396	test-error:0.04386
[6]	train-error:0.004396	test-error:0.052632
[7]	train-error:0.004396	test-error:0.061404
[8]	train-error:0.004396	test-error:0.052632
[9]	train-error:0.004396	test-error:0.04386


In [5]:
xgb_test.save_binary('dtest.buffer') #保存为二进制文件
xgb_test2 = xgb.DMatrix('dtest.buffer')
preds2 = model.predict(xgb_test2) #利用新数据预测

[02:02:41] 114x30 matrix with 3420 entries loaded from dtest.buffer


In [6]:
preds==preds2

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

### 2、基于历史预测值训练
在历史预测值上训练，可以使模型快速达到较高的准确度，节省时间，这里的预测值为未进行转化（如：softmax、sigmoid）的原始值

In [7]:
#训练模型
bst = xgb.train(params,xgb_train,10,watchlist)

#设置output_margin = True,表示最终输出的预测值为未进行sigmoid转化的原始值
pred_train = bst.predict(xgb_train,output_margin = True)
pred_test = bst.predict(xgb_test,output_margin=True)

#设置预测值为初始值
xgb_train.set_base_margin(pred_train)
xgb_test.set_base_margin(pred_test)

print('以下是设置预测值为初始值的运行结果：')
bst = xgb.train(params,xgb_train,10,watchlist)

[0]	train-error:0.024176	test-error:0.035088
[1]	train-error:0.015385	test-error:0.052632
[2]	train-error:0.013187	test-error:0.070175
[3]	train-error:0.004396	test-error:0.078947
[4]	train-error:0.004396	test-error:0.078947
[5]	train-error:0.004396	test-error:0.04386
[6]	train-error:0.004396	test-error:0.052632
[7]	train-error:0.004396	test-error:0.061404
[8]	train-error:0.004396	test-error:0.052632
[9]	train-error:0.004396	test-error:0.04386
以下是设置预测值为初始值的运行结果：
[0]	train-error:0.004396	test-error:0.04386
[1]	train-error:0.004396	test-error:0.04386
[2]	train-error:0.004396	test-error:0.04386
[3]	train-error:0.004396	test-error:0.04386
[4]	train-error:0.004396	test-error:0.035088
[5]	train-error:0.004396	test-error:0.035088
[6]	train-error:0.004396	test-error:0.035088
[7]	train-error:0.004396	test-error:0.04386
[8]	train-error:0.004396	test-error:0.04386
[9]	train-error:0.004396	test-error:0.04386


### 3、自定义目标函数和评估函数
xgboost的自定义函数需要***返回一阶、二阶梯度***，即需要**满足二次可微**，通过自定义目标函数得到的预测值是模型预测的原始值，不会进行任何转换，自定义目标函数后，xgboost内置的评估函数不一定适用，因为内置的评估函数默认是经过转换的（sigmoid、softmax）

In [8]:
import xgboost as xgb
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [9]:
cancer = datasets.load_breast_cancer()
X = cancer.data
y = cancer.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,random_state = 8)

In [10]:
xgb_train = xgb.DMatrix(X_train,label=y_train)
xgb_test = xgb.DMatrix(X_test,label=y_test)

In [16]:
params = {
    "booster": "gbtree",
    "eta": 0.1,
    "max_depth": 5
}
num_round = 50
watchlist = [(xgb_train,'train'),(xgb_test,'test')]

In [28]:
#自定义目标函数logloss，给定预测值，返回一阶、二阶梯度
def logregobj(preds,dtrain):
  labels = dtrain.get_label() #真实值
  preds = 1.0/ (1.0 + np.exp(-preds)) #预测值
  grad = preds - labels   #一阶导数 
  hess = preds * (1.0 - preds)
  return grad,hess

In [29]:
#自定义评估函数
def evalerror(preds, dtrain):
  labels = dtrain.get_label()
  #因为是未进行sigmoid转换，因此以0为分类阈值
  return 'error', float(sum(labels != (preds > 0.0)))/ len(labels)

In [30]:
bst = xgb.train(params,xgb_train,num_round,watchlist,obj=logregobj,feval=evalerror)

[0]	train-rmse:0.336644	test-rmse:0.352463	train-error:0.364835	test-error:0.403509
[1]	train-rmse:0.216709	test-rmse:0.253139	train-error:0.364835	test-error:0.403509
[2]	train-rmse:0.171317	test-rmse:0.230497	train-error:0.048352	test-error:0.078947
[3]	train-rmse:0.237105	test-rmse:0.290008	train-error:0.048352	test-error:0.087719
[4]	train-rmse:0.337897	test-rmse:0.383917	train-error:0.035165	test-error:0.078947
[5]	train-rmse:0.448986	test-rmse:0.487307	train-error:0.01978	test-error:0.070175
[6]	train-rmse:0.56289	test-rmse:0.596039	train-error:0.013187	test-error:0.052632
[7]	train-rmse:0.674882	test-rmse:0.698743	train-error:0.008791	test-error:0.052632
[8]	train-rmse:0.785773	test-rmse:0.805449	train-error:0.008791	test-error:0.052632
[9]	train-rmse:0.89346	test-rmse:0.906848	train-error:0.008791	test-error:0.052632
[10]	train-rmse:0.997078	test-rmse:1.00828	train-error:0.008791	test-error:0.052632
[11]	train-rmse:1.10112	test-rmse:1.10701	train-error:0.008791	test-error:0.061

### 4、交叉验证


In [None]:
params = {
    "objective":"binary:logistic",
    "booster":"gbtree",
    "eta": 0.1,
    "max_depth": 5
}
num_round = 50

xgboost利用cv函数进行交叉验证，nfold参数为交叉验证中数据集被分为nfold份，metrics是交叉验证使用的评估指标，callbacks可以定义多个callback函数，会在每一轮迭代的最后被调用，用户也可以使用xgboost内置的callback函数

In [18]:
res = xgb.cv(params,xgb_train,num_round,nfold=5,metrics={'auc'},seed=0,
             callbacks=[xgb.callback.print_evaluation(show_stdv=True)]) #xgb.callback.print_evaluation(show_stdv=True):打印评估指标的均值+标准差

[0]	train-auc:0.996496+0.00175595	test-auc:0.962077+0.0177871
[1]	train-auc:0.996454+0.00177517	test-auc:0.963754+0.0200165
[2]	train-auc:0.996424+0.00179116	test-auc:0.964545+0.0198595
[3]	train-auc:0.996979+0.00189768	test-auc:0.963221+0.0218959
[4]	train-auc:0.996952+0.00191968	test-auc:0.964236+0.0240807
[5]	train-auc:0.999015+0.000712626	test-auc:0.966035+0.0234331
[6]	train-auc:0.999116+0.000696901	test-auc:0.968592+0.0240493
[7]	train-auc:0.999345+0.00072586	test-auc:0.967956+0.0241695
[8]	train-auc:0.99948+0.000565826	test-auc:0.967996+0.0244328
[9]	train-auc:0.999474+0.000573455	test-auc:0.968494+0.0247489
[10]	train-auc:0.999487+0.000557343	test-auc:0.968599+0.0247509
[11]	train-auc:0.999719+0.000368483	test-auc:0.967738+0.0258376
[12]	train-auc:0.999835+0.000198995	test-auc:0.967957+0.0255959
[13]	train-auc:0.999874+0.000174767	test-auc:0.967434+0.0261129
[14]	train-auc:0.999941+6.44962e-05	test-auc:0.968084+0.0250049
[15]	train-auc:0.999974+3.77306e-05	test-auc:0.974219+0.0

In [20]:
# 不输出标准差，若5轮评估指标未提升则停止训练
res = xgb.cv(params,xgb_train,num_round,nfold=5,metrics={'auc'},seed=0,callbacks=[xgb.callback.print_evaluation(show_stdv=False),xgb.callback.early_stop(5)])

[0]	train-auc:0.996496	test-auc:0.962077
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 5 rounds.
[1]	train-auc:0.996454	test-auc:0.963754
[2]	train-auc:0.996424	test-auc:0.964545
[3]	train-auc:0.996979	test-auc:0.963221
[4]	train-auc:0.996952	test-auc:0.964236
[5]	train-auc:0.999015	test-auc:0.966035
[6]	train-auc:0.999116	test-auc:0.968592
[7]	train-auc:0.999345	test-auc:0.967956
[8]	train-auc:0.99948	test-auc:0.967996
[9]	train-auc:0.999474	test-auc:0.968494
[10]	train-auc:0.999487	test-auc:0.968599
[11]	train-auc:0.999719	test-auc:0.967738
[12]	train-auc:0.999835	test-auc:0.967957
[13]	train-auc:0.999874	test-auc:0.967434
[14]	train-auc:0.999941	test-auc:0.968084
[15]	train-auc:0.999974	test-auc:0.974219
[16]	train-auc:0.999981	test-auc:0.974204
[17]	train-auc:0.999981	test-auc:0.976253
[18]	train-auc:0.999994	test-auc:0.975778
[19]	train-auc:1	test-auc:0.97664
[20]	train-auc:1	test-auc:0.976759
[21]

In [22]:
print(res)

    train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0         0.996496       0.001756       0.962077      0.017787
1         0.996454       0.001775       0.963754      0.020016
2         0.996424       0.001791       0.964545      0.019859
3         0.996979       0.001898       0.963221      0.021896
4         0.996952       0.001920       0.964236      0.024081
5         0.999015       0.000713       0.966035      0.023433
6         0.999116       0.000697       0.968592      0.024049
7         0.999345       0.000726       0.967956      0.024170
8         0.999480       0.000566       0.967996      0.024433
9         0.999474       0.000573       0.968494      0.024749
10        0.999487       0.000557       0.968599      0.024751
11        0.999719       0.000368       0.967738      0.025838
12        0.999835       0.000199       0.967957      0.025596
13        0.999874       0.000175       0.967434      0.026113
14        0.999941       0.000064       0.968084      0

xgboost的交叉验证还支持自定义预处理函数

In [23]:
def fpreproc(xgb_train,xgb_test,params):
  label = xgb_train.get_label()
  ratio = float(np.sum(label==0))/np.sum(label==1)
  params['scale_pos_weight'] = ratio #设置参数scale_pos_weight
  return (xgb_train,xgb_test,params)

xgb.cv(params,xgb_train,num_round,nfold=5,metrics={'auc'},seed=0,fpreproc=fpreproc)

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
0,0.995407,0.001344,0.953464,0.013351
1,0.995437,0.001356,0.953549,0.013661
2,0.998148,0.001816,0.965869,0.022223
3,0.999094,0.001178,0.972817,0.016854
4,0.999014,0.001167,0.974063,0.016056
5,0.99908,0.001115,0.97504,0.015884
6,0.999474,0.000588,0.976908,0.013069
7,0.9995,0.000632,0.977014,0.012762
8,0.999787,0.000287,0.977067,0.012741
9,0.999895,0.000138,0.977106,0.013299


In [24]:
#用户自定义目标函数和评估函数来交叉验证，继续延用之前的目标函数
xgb.cv(params,xgb_train,num_round,nfold=5,obj=logregobj,feval=evalerror)

Unnamed: 0,train-error-mean,train-error-std,train-rmse-mean,train-rmse-std,test-error-mean,test-error-std,test-rmse-mean,test-rmse-std
0,0.364835,0.013725,0.336757,0.004438,0.364835,0.054901,0.361129,0.009725
1,0.364835,0.013725,0.21362,0.009275,0.364835,0.054901,0.261732,0.033812
2,0.066484,0.003645,0.171087,0.009007,0.107692,0.03362,0.240469,0.053229
3,0.039561,0.010654,0.229814,0.004164,0.096703,0.032894,0.287616,0.046761
4,0.026923,0.008038,0.33355,0.006268,0.076923,0.030295,0.376795,0.04314
5,0.02033,0.006408,0.447018,0.007996,0.072527,0.022628,0.475815,0.039896
6,0.017033,0.00636,0.560788,0.009092,0.061538,0.02467,0.580189,0.040293
7,0.015385,0.006407,0.673292,0.011197,0.059341,0.026556,0.683859,0.043168
8,0.014286,0.005327,0.782944,0.01215,0.057143,0.029812,0.787032,0.042268
9,0.013187,0.004395,0.889184,0.013959,0.057143,0.030612,0.889679,0.04232


### 5、保存评估结果
模型保存与评估结果保存

In [None]:
model = xgb.train(params,xgb_train,10,watchlist)
model.save_model("./model.xgb")

[0]	train-error:0.004396	test-error:0.04386
[1]	train-error:0.004396	test-error:0.04386
[2]	train-error:0.004396	test-error:0.04386
[3]	train-error:0.004396	test-error:0.04386
[4]	train-error:0.004396	test-error:0.035088
[5]	train-error:0.004396	test-error:0.035088
[6]	train-error:0.004396	test-error:0.035088
[7]	train-error:0.004396	test-error:0.04386
[8]	train-error:0.004396	test-error:0.04386
[9]	train-error:0.004396	test-error:0.04386


In [None]:
bst = xgb.Booster()
bst.load_model("./model.xgb")
pred = bst.predict(xgb_test)
pred

array([0.9303397 , 0.9119321 , 0.9303397 , 0.8942841 , 0.92795146,
       0.9303397 , 0.4654373 , 0.46111533, 0.9303397 , 0.9303397 ,
       0.91980803, 0.9303397 , 0.9303397 , 0.07041784, 0.07041784,
       0.9303397 , 0.89680326, 0.07595936, 0.5600603 , 0.07041784,
       0.7233341 , 0.19820791, 0.72285205, 0.9303397 , 0.9303397 ,
       0.3644025 , 0.07041784, 0.07041784, 0.9303397 , 0.8813913 ,
       0.9303397 , 0.9303397 , 0.08022156, 0.07041784, 0.0756431 ,
       0.92710835, 0.0839306 , 0.07041784, 0.8252453 , 0.07041784,
       0.9303397 , 0.0756431 , 0.09421231, 0.9303397 , 0.59778714,
       0.9303397 , 0.8805882 , 0.07041784, 0.07041784, 0.07041784,
       0.5872261 , 0.10664127, 0.9087783 , 0.1211081 , 0.9257959 ,
       0.07041784, 0.07041784, 0.92458844, 0.9303397 , 0.9303397 ,
       0.07041784, 0.07041784, 0.9303397 , 0.74109215, 0.9303397 ,
       0.9303397 , 0.92458844, 0.12638171, 0.9303397 , 0.86494315,
       0.09925666, 0.9028226 , 0.07041784, 0.9303397 , 0.07041

In [None]:
dump_model = bst.dump_model("./dump.txt") #将模型保存为文本格式

评估结果的保存

In [None]:
evals_result = {} #定义dict类型保存评估指标
bst = xgb.train(params,xgb_train,num_round,watchlist,evals_result=evals_result)
print(evals_result)

[0]	train-error:0.004396	test-error:0.04386
[1]	train-error:0.004396	test-error:0.04386
[2]	train-error:0.004396	test-error:0.04386
[3]	train-error:0.004396	test-error:0.04386
[4]	train-error:0.004396	test-error:0.035088
[5]	train-error:0.004396	test-error:0.035088
[6]	train-error:0.004396	test-error:0.035088
[7]	train-error:0.004396	test-error:0.04386
[8]	train-error:0.004396	test-error:0.04386
[9]	train-error:0.004396	test-error:0.04386
{'train': {'error': [0.004396, 0.004396, 0.004396, 0.004396, 0.004396, 0.004396, 0.004396, 0.004396, 0.004396, 0.004396]}, 'test': {'error': [0.04386, 0.04386, 0.04386, 0.04386, 0.035088, 0.035088, 0.035088, 0.04386, 0.04386, 0.04386]}}


### 6、通过前n棵树预测

In [36]:
from sklearn.metrics import roc_auc_score

In [45]:
params = {
    "objective":"binary:logistic",
    "booster":"gbtree",
    "eta": 0.1,
    "max_depth": 5
}
num_round = 50
watchlist = [(xgb_test,'eval'),(xgb_train,'train')]

In [46]:
bst = xgb.train(params,xgb_train,num_round,watchlist)
print('前10棵树预测')
label = xgb_test.get_label()
pred1 = bst.predict(xgb_test,ntree_limit=10) 
#默认情况用所有树预测
pred2 = bst.predict(xgb_test)
print("前10棵树auc：%f" %roc_auc_score(y_test,pred1))
print("所有棵树auc：%f" %roc_auc_score(y_test,pred2))

[0]	eval-error:0.035088	train-error:0.024176
[1]	eval-error:0.052632	train-error:0.015385
[2]	eval-error:0.070175	train-error:0.013187
[3]	eval-error:0.078947	train-error:0.004396
[4]	eval-error:0.078947	train-error:0.004396
[5]	eval-error:0.04386	train-error:0.004396
[6]	eval-error:0.052632	train-error:0.004396
[7]	eval-error:0.061404	train-error:0.004396
[8]	eval-error:0.052632	train-error:0.004396
[9]	eval-error:0.04386	train-error:0.004396
[10]	eval-error:0.04386	train-error:0.004396
[11]	eval-error:0.04386	train-error:0.004396
[12]	eval-error:0.04386	train-error:0.004396
[13]	eval-error:0.04386	train-error:0.004396
[14]	eval-error:0.035088	train-error:0.004396
[15]	eval-error:0.035088	train-error:0.004396
[16]	eval-error:0.035088	train-error:0.004396
[17]	eval-error:0.04386	train-error:0.004396
[18]	eval-error:0.04386	train-error:0.004396
[19]	eval-error:0.04386	train-error:0.004396
[20]	eval-error:0.04386	train-error:0.004396
[21]	eval-error:0.04386	train-error:0.004396
[22]	eval