## 集成模型

1. 随机森林 Random Forest Classifier
    构造多颗决策树，然后‘少数服从多数’
2. 梯度提升决策树 Gradient Tree Boosting
    在构造每一棵决策树后，都会对已有集成系统的预测效果提升

In [1]:
import pandas as pd

In [2]:
data = pd.read_excel("./titanic.xls")

In [5]:
# 选取 pclass、sex、age 作为特征
X = data[['pclass', 'sex', 'age']]
y = data['survived']

In [6]:
# 使用平均年龄代替缺失值
X['age'].fillna(X['age'].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [7]:
# 划分数据集为 训练、测试
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

In [8]:
# 把类别型数据（sex）向量化
from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test  = vec.transform(X_test.to_dict(orient='record'))

In [9]:
# 1. 使用单一决策树
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_predict = dtc.predict(X_test)

print "The accuracy of DecisionTreeClassifier:", dtc.score(X_test, y_test)
print classification_report(y_test, dtc_y_predict, target_names=['died', 'survived'])

The accuracy of DecisionTreeClassifier: 0.810975609756
             precision    recall  f1-score   support

       died       0.84      0.88      0.86       210
   survived       0.76      0.69      0.73       118

avg / total       0.81      0.81      0.81       328



In [11]:
# 2. 使用随机森林
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_predict = rfc.predict(X_test)

print "The accuracy of RandomForestClassifier:", rfc.score(X_test, y_test)
print classification_report(y_test, rfc_y_predict, target_names=['died', 'survived'])

 The accuracy of RandomForestClassifier: 0.810975609756
             precision    recall  f1-score   support

       died       0.85      0.86      0.85       210
   survived       0.75      0.72      0.73       118

avg / total       0.81      0.81      0.81       328



In [12]:
# 3. 使用梯度提升决策树
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_predict = gbc.predict(X_test)

print "The accuracy of GradientBoostingClassifier:", gbc.score(X_test, y_test)
print classification_report(y_test, gbc_y_predict, target_names=['died', 'survived'])

The accuracy of GradientBoostingClassifier: 0.807926829268
             precision    recall  f1-score   support

       died       0.84      0.86      0.85       210
   survived       0.74      0.71      0.73       118

avg / total       0.81      0.81      0.81       328

