# 决策树、随机森林、梯度提升树预测泰坦尼克号

## 通过互联网读取泰坦尼克乘客档案，并存储在变量 titanic 中

In [14]:
# 导入 pandas，并且重命名为 pd
import pandas as pd

# 通过互联网读取泰坦尼克乘客档案，并存储在变量 titanic 中
titanic = pd.read_csv(
    'http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

In [15]:
# 人工选取 pclass、age 以及 sex 作为判别乘客是否能够生还的特征
X = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

In [16]:
# 对于缺失的年龄信息，我们使用全体乘客的平均年龄代替，这样可以在保证顺利训练模型的同时，尽可能不影响预测任务
X['age'].fillna(X['age'].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


## 对原始数据进行分割，25% 的乘客数据用于测试

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=33)

In [19]:
X_train.head()

Unnamed: 0,pclass,age,sex
1086,3rd,31.194181,male
12,1st,31.194181,female
1036,3rd,31.194181,male
833,3rd,32.0,male
1108,3rd,31.194181,male


### 类别变量独热编码

In [20]:
# 对类别型特征进行转化，成为特征向量
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))

In [21]:
X_train

array([[31.19418104,  0.        ,  0.        ,  1.        ,  0.        ,
         1.        ],
       [31.19418104,  1.        ,  0.        ,  0.        ,  1.        ,
         0.        ],
       [31.19418104,  0.        ,  0.        ,  1.        ,  0.        ,
         1.        ],
       ...,
       [12.        ,  0.        ,  1.        ,  0.        ,  1.        ,
         0.        ],
       [18.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         1.        ],
       [31.19418104,  0.        ,  0.        ,  1.        ,  1.        ,
         0.        ]])

In [22]:
X_train.shape

(984, 6)

## 使用单一决策树进行模型训练以及预测分析

In [23]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)

In [24]:
from sklearn.metrics import classification_report

# 输出单一决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标
print('The accuracy of decision tree is', dtc.score(X_test, y_test))
print(classification_report(dtc_y_pred, y_test))

The accuracy of decision tree is 0.7811550151975684
             precision    recall  f1-score   support

          0       0.91      0.78      0.84       236
          1       0.58      0.80      0.67        93

avg / total       0.81      0.78      0.79       329



## 使用随机森林分类器进行集成模型的训练以及预测分析

In [26]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)

In [27]:
# 输出随机森林分类器在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标
print('The accuracy of random forest classifier is', rfc.score(X_test, y_test))
print(classification_report(rfc_y_pred, y_test))

The accuracy of random forest classifier is 0.7781155015197568
             precision    recall  f1-score   support

          0       0.90      0.78      0.83       233
          1       0.59      0.78      0.67        96

avg / total       0.81      0.78      0.79       329



## 使用梯度提升决策树进行集成模型的训练以及预测分析

In [28]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)

In [12]:
# 输出梯度提升决策树在测试集上的分类准确性，以及更加详细的精确率、召回率、F1指标。
print('The accuracy of gradient tree boosting is', gbc.score(X_test, y_test))
print(classification_report(gbc_y_pred, y_test))

The accuracy of gradient tree boosting is 0.790273556231003
             precision    recall  f1-score   support

          0       0.92      0.78      0.84       239
          1       0.58      0.82      0.68        90

avg / total       0.83      0.79      0.80       329

