<a href="https://colab.research.google.com/github/pengfei123xiao/ML_Basic/blob/master/models/Ch05-DecisionTree/XGBoost/1/1_Mushroom_skearn_GridSearchCV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost+scikit-learn-GridSearchCV

In [0]:
from xgboost import XGBClassifier
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from matplotlib import pyplot

## Load data

scikit-learn支持多种格式的数据，包括LibSVM格式数据
XGBoost可以加载libsvm格式的文本数据，libsvm的文件格式（稀疏特征）如下：
1 101:1.2 102:0.03
0 1:2.1 10001:300 10002:400
...

每一行表示一个样本，第一行的开头的“1”是样本的标签。“101”和“102”为特征索引，'1.2'和'0.03' 为特征的值。
在两类分类中，用“1”表示正样本，用“0” 表示负样本。也支持[0,1]表示概率用来做标签，表示为正样本的概率。

下面的示例数据需要我们通过一些蘑菇的若干属性判断这个品种是否有毒。
UCI数据描述：http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/ ，
每个样本描述了蘑菇的22个属性，比如形状、气味等等(加工成libsvm格式后变成了126维特征)，
然后给出了这个蘑菇是否可食用。其中6513个样本做训练，1611个样本做测试。

XGBoost加载的数据存储在对象DMatrix中
XGBoost自定义了一个数据矩阵类DMatrix，优化了存储和运算速度
DMatrix文档：http://xgboost.readthedocs.io/en/latest/python/python_api.html

In [6]:
# read in data，数据在xgboost安装的路径下的demo目录,现在copy到代码目录下的data目录
my_workpath = './data/'
# X_train,y_train = load_svmlight_file(my_workpath + 'agaricus.txt.train')
# X_test,y_test = load_svmlight_file(my_workpath + 'agaricus.txt.test')
X_train,y_train = load_svmlight_file('agaricus.txt.train')
X_test,y_test = load_svmlight_file('agaricus.txt.test')
print(X_train.shape)
print(X_test.shape)

(6513, 126)
(1611, 126)


## paras setting

In [8]:
# specify parameters via map
params = {'max_depth':2, 'eta':0.1, 'silent':0, 'objective':'binary:logistic' }
print(params)

{'max_depth': 2, 'eta': 0.1, 'silent': 0, 'objective': 'binary:logistic'}


## build models

In [0]:
# bst = XGBClassifier(params)
bst =XGBClassifier(max_depth=2, learning_rate=0.1, silent=True, objective='binary:logistic')

## cross-validation

In [14]:
# 设置boosting迭代计算次数
param_test = {
 'n_estimators': range(1, 51, 2)
}
clf = GridSearchCV(estimator = bst, param_grid = param_test, scoring='accuracy', cv=5)
clf.fit(X_train, y_train)
print(clf.cv_results_)
print(clf.best_params_)
print(clf.best_score_)

{'mean_fit_time': array([0.05043974, 0.05420499, 0.06067576, 0.06380911, 0.06899858,
       0.07582035, 0.07988787, 0.08605852, 0.0901618 , 0.0971066 ,
       0.10076823, 0.11891489, 0.11622629, 0.12000985, 0.12458234,
       0.12912359, 0.1379509 , 0.13888717, 0.14431915, 0.15023499,
       0.15559778, 0.16214075, 0.16618934, 0.17119193, 0.17684126]), 'std_fit_time': array([0.00296757, 0.00073235, 0.00162257, 0.00049193, 0.00046373,
       0.0023111 , 0.00036929, 0.00195421, 0.00069291, 0.0025886 ,
       0.00058855, 0.02222305, 0.00716246, 0.00445608, 0.00162269,
       0.00101656, 0.00403366, 0.00103187, 0.00332253, 0.00197547,
       0.00128598, 0.00256371, 0.00320654, 0.00309259, 0.00337979]), 'mean_score_time': array([0.01078362, 0.01173801, 0.01183562, 0.01084042, 0.01081595,
       0.01075363, 0.01088805, 0.01077957, 0.01094556, 0.01082301,
       0.01093807, 0.01153059, 0.01124797, 0.01093855, 0.01203475,
       0.01107521, 0.0110395 , 0.01190333, 0.01096921, 0.01221805,
     

# testing

In [15]:
#make prediction
preds = clf.predict(X_test)
predictions = [round(value) for value in preds]

test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy of gridsearchcv: %.2f%%" % (test_accuracy * 100.0))

Test Accuracy of gridsearchcv: 97.27%
