# XGBoost调参技巧（二）Titanic实战预测进入9%

Titanic是Kaggle竞赛里的入门比赛之一，要求参赛者根据乘客的属性来预测是否幸存，是典型的二分类（Binary Classifier）问题。解决二分类问题的算法有很多：决策树、随机森林、GBM，而XGBoost是GBM的优化实现。因此本文以Titanic幸存者预测竞赛为例，介绍XGBoost的调参技巧。

## 一、读取数据，清洗数据

### 1.读取数据

In [1]:
#coding:utf-8
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.cross_validation import KFold
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score

#read data
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")

导入需要用到的包，注意我导入的是xgboost下的XGBClassifier包，可以结合sciket-learn下的grid_search来对参数进行暴力猜解。

### 2.清洗数据

In [2]:
def clean_data(titanic):#填充空数据 和 把string数据转成integer表示
    titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
    # child
    titanic["child"] = titanic["Age"].apply(lambda x: 1 if x < 15 else 0)

    # sex
    titanic["sex"] = titanic["Sex"].apply(lambda x: 1 if x == "male" else 0)

    titanic["Embarked"] = titanic["Embarked"].fillna("S")
    # embark
    def getEmbark(Embarked):
        if Embarked == "S":
            return 1
        elif Embarked == "C":
            return 2
        else:
            return 3
    titanic["embark"] = titanic["Embarked"].apply(getEmbark)
    
    # familysize
    titanic["fimalysize"] = titanic["SibSp"] + titanic["Parch"] + 1

    # cabin
    def getCabin(cabin):
        if cabin == "N":
            return 0
        else:
            return 1
    titanic["cabin"] = titanic["Cabin"].apply(getCabin)
    
    # name
    def getName(name):
        if "Mr" in str(name):
            return 1
        elif "Mrs" in str(name):
            return 2
        else:
            return 0
    titanic["name"] = titanic["Name"].apply(getName)

    titanic["Fare"] = titanic["Fare"].fillna(titanic["Fare"].median())

    return titanic
# 对数据进行清洗
train_data = clean_data(train)
test_data = clean_data(test)

## 二、特征工程

Kaggle竞赛的三个核心步骤：**特征工程、调参、模型融合**。俗话说：**数据和特征决定机器学习的上限，而算法只是用来逼近这个上限**，所以特征工程是机器学习能否成功的关键。我们在每个比赛中需要花大量时间来反复完成这个工作。

In [3]:
features = ["Pclass", "sex", "child", "fimalysize", "Fare", "embark", "cabin"]

## 三、模型选择

### 1.构造模型

In [4]:
# 简单初始化xgb的分类器就可以
clf =XGBClassifier(learning_rate=0.1, max_depth=2, silent=True, objective='binary:logistic')

### 2.交叉验证kfold
利用skean提供的grid_search来进行交叉验证选择参数

In [5]:
# 设置boosting迭代计算次数
param_test = {
    'n_estimators': range(30, 50, 2),
    'max_depth': range(2, 7, 1)
}
grid_search = GridSearchCV(estimator = clf, param_grid = param_test, scoring='accuracy', cv=5)
grid_search.fit(train[features], train["Survived"])
grid_search.grid_scores_, grid_search.best_params_, grid_search.best_score_

([mean: 0.81594, std: 0.00673, params: {'n_estimators': 30, 'max_depth': 2},
  mean: 0.81930, std: 0.00916, params: {'n_estimators': 32, 'max_depth': 2},
  mean: 0.82267, std: 0.00978, params: {'n_estimators': 34, 'max_depth': 2},
  mean: 0.82043, std: 0.01423, params: {'n_estimators': 36, 'max_depth': 2},
  mean: 0.82267, std: 0.01585, params: {'n_estimators': 38, 'max_depth': 2},
  mean: 0.82604, std: 0.01800, params: {'n_estimators': 40, 'max_depth': 2},
  mean: 0.82604, std: 0.01800, params: {'n_estimators': 42, 'max_depth': 2},
  mean: 0.82379, std: 0.01629, params: {'n_estimators': 44, 'max_depth': 2},
  mean: 0.82379, std: 0.01629, params: {'n_estimators': 46, 'max_depth': 2},
  mean: 0.82267, std: 0.01545, params: {'n_estimators': 48, 'max_depth': 2},
  mean: 0.82043, std: 0.01642, params: {'n_estimators': 30, 'max_depth': 3},
  mean: 0.81930, std: 0.01690, params: {'n_estimators': 32, 'max_depth': 3},
  mean: 0.81818, std: 0.01863, params: {'n_estimators': 34, 'max_depth': 3},

In [43]:
pre = grid_search.predict(test[features])
predict_dataframe = pd.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": pre
})
predict_dataframe.to_csv('../data/xgboost-gridsearch.csv',index=False,encoding="utf-8")