# kaggle泰坦尼克实战2.0

集百家之长，加强对数据的认识，不再凭直觉进行特征选择    
参考文章  
https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html
https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic
https://zhuanlan.zhihu.com/p/27655949  
http://www.jasongj.com/ml/classification/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns

# 加载数据

In [None]:
train_data = pd.read_csv('./data/train.csv')
test_data = pd.read_csv('./data/test.csv')
train_data.head()

In [None]:
train_data.info()

In [None]:
test_data.head()

In [None]:
test_data.info()

# 探索数据

探索数据前需分出一部分测试集，避免受主观影响

In [None]:
from sklearn.model_selection import train_test_split

train_set,test_set = train_test_split(train_data,test_size=0.2,random_state=42)

In [None]:
train_set.info()

可以看到age，cabin，Embarked都有缺失值

In [None]:
train_set.describe()

Survived均值为0.37表明大约三分之一的人获救

In [None]:
corr_matrix = train_set.drop('PassengerId',axis=1).corr()
corr_matrix

In [None]:
ax = plt.subplots(figsize=(12,10))
ax = sns.heatmap(corr_matrix, vmin=-1, vmax=1 , annot=True , square=True)

通过相关分析，初步判断特征之间的相关关系

## Survived 与 Pclass

In [None]:
train_set['die'] = 1 - train_set['Survived']
train_set.groupby(['Pclass']).agg('sum')[['Survived','die']].plot(kind='bar', figsize=(12, 10),
                                                          stacked=True, color=['g', 'r'])

可以看到每个等级的获救几率不同，等级3的几率最小，这是一个关键的特征，从相关关系也能看出

## Survived 与 Name

通过观察可以发现名字前面都有一个称谓，先提取称谓再观察是否是一个有用特征

In [None]:
train_data.head()

In [None]:
titles = set()
for name in train_data['Name']:
    titles.add(name.split(',')[1].split('.')[0].strip())

In [None]:
titles

In [None]:
Title_Dictionary = {
    "Capt": "Officer",
    "Col": "Officer",
    "Major": "Officer",
    "Jonkheer": "Royalty",
    "Don": "Royalty",
    "Sir" : "Royalty",
    "Dr": "Officer",
    "Rev": "Officer",
    "the Countess":"Royalty",
    "Mme": "Mrs",
    "Mlle": "Miss",
    "Ms": "Mrs",
    "Mr" : "Mr",
    "Mrs" : "Mrs",
    "Miss" : "Miss",
    "Master" : "Master",
    "Lady" : "Royalty"
}

def get_titles(train_data):
    train_data['Title'] = train_data['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    
    train_data['Title'] = train_data.Title.map(Title_Dictionary)
    return train_data

In [None]:
get_titles(train_set)
train_set.head()

In [None]:
train_set.groupby(['Title']).agg('sum')[['Survived','die']].plot(kind='bar',figsize=(12,10),
                                                              stacked=True,color=['g','r'])

可以看出女士的获救率较高，男士和船员的获救率较低

## Survived 与 Sex

In [None]:
train_set.groupby(['Sex']).agg('sum')[['Survived','die']].plot(kind='bar',figsize=(12,10),
                                                       stacked=True,color=['g','r'])

可以看出女士的获救率比男士高，也印证了上面名字特征的结果

## Survived 与 Age

In [None]:
facet = sns.FacetGrid(train_set, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train_set['Age'].max()))
facet.add_legend()

可以看出0-10岁左右的获救率较高，通过性别和年龄都表明了背景‘让妇女儿童先走’，这2个是重要的特征

## Survived 与 SibSp

In [None]:
train_set.groupby(['SibSp']).agg('sum')[['Survived','die']].plot(kind='bar',figsize=(12,10),
                                                       stacked=True,color=['g','r'])

## Survived 与 Parch

In [None]:
train_set.groupby(['Parch']).agg('sum')[['Survived','die']].plot(kind='bar',figsize=(12,10),
                                                       stacked=True,color=['g','r'])

## Survived 与 Fare

In [None]:
facet = sns.FacetGrid(train_set, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Fare',shade= True)
facet.set(xlim=(0, train_set['Fare'].max()))
facet.add_legend()

可以看到票价越高，获救率就越高

## Survived 与 Cabin

cabin缺失值太多了，可以在使用时去掉

## Survived 与 Embarked

In [None]:
train_set.groupby(['Embarked']).agg('sum')[['Survived','die']].plot(kind='bar',figsize=(12,10),
                                                       stacked=True,color=['g','r'])

可以看出，S的人数较多，获救率较低，可能也是更财富地位有关

# 准备数据

## 删除不需要的特征

In [None]:
train_set = train_set.drop(['PassengerId','Ticket','Name','Cabin','die'],axis=1)

In [None]:
train_set.head()

## 处理缺失值

In [None]:
train_set.info()

In [None]:
train_set[train_data['Embarked'].isna()]

In [None]:
train_set.loc[61,'Embarked'] = 'C'
train_set.loc[829,'Embarked'] = 'C'

In [None]:
train_set.info()

In [None]:
group_train = train_set.groupby(['Sex','Pclass','Title'])
group_median_train = group_train.median()
group_median_train

In [None]:
group_median_train = group_median_train.reset_index()[['Sex', 'Pclass', 'Title', 'Age']]
group_median_train

In [None]:
def fill_age(row):
    condition=(
        (group_median_train['Sex'] == row['Sex']) &
        (group_median_train['Title'] == row['Title']) &
        (group_median_train['Pclass'] == row['Pclass'])
    )
    return group_median_train[condition]['Age'].values[0]

In [None]:
train_set['Age'] = train_set.apply(lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'],axis=1)

In [None]:
train_set.info()

## 文本特征转数字特征

In [None]:
train_set['Sex'] = pd.factorize(train_set['Sex'])[0]
train_set.head()

In [None]:
train_set = pd.concat([train_set,pd.get_dummies(train_set['Embarked'])],axis=1)
train_set.head()

In [None]:
train_set = pd.concat([train_set,pd.get_dummies(train_set['Title'])],axis=1)
train_set.head()

In [None]:
train_set = train_set.drop(['Embarked','Title'],axis=1)
train_set.head()

## 特征缩放

In [None]:
y_train = train_set['Survived']
train_set = train_set.drop(['Survived'],axis=1 )

In [None]:
from sklearn.preprocessing import StandardScaler

std = StandardScaler()
std.fit(train_set)

## 完成数据

In [None]:
X_train = std.transform(train_set)
X_train.shape,y_train.shape

# 模型

## 特征选择

In [None]:
from sklearn.ensemble import RandomForestClassifier
rng_clf = RandomForestClassifier(n_estimators=50,max_features='sqrt')
rng_clf.fit(X_train,y_train)

In [None]:
rng_clf.feature_importances_

In [None]:
features = pd.DataFrame()
features['feature'] = train_set.columns
features['importance'] = rng_clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh',figsize=(12,10))

In [None]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(rng_clf, prefit=True)
X_train_reduced = model.transform(X_train)
X_train_reduced.shape

In [None]:
X_train = X_train[:,:14]
X_train.shape

通过树的估算器可用于计算特征重要性，进而筛选不重要的特征。这里特征不多，就直接全部用上

## 尝试不同的模型

### logistics模型

In [None]:
from sklearn.linear_model import LogisticRegression

log_clf = LogisticRegression(random_state=42)
log_clf.fit(X_train,y_train)
y_log_pred = log_clf.predict(X_train)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_log_pred,y_train)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(log_clf,X_train,y_train,cv=3)

### 决策树

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train,y_train)
y_dt_pred = dt_clf.predict(X_train)
accuracy_score(y_dt_pred,y_train)

In [None]:
cross_val_score(dt_clf,X_train,y_train,cv=3,scoring='accuracy')

###  随机森林

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(random_state=42)
rnd_clf.fit(X_train,y_train)
y_rnd_pred = rnd_clf.predict(X_train)
accuracy_score(y_rnd_pred,y_train)

In [None]:
cross_val_score(rnd_clf,X_train,y_train,cv=3,scoring='accuracy')

决策树和随机森林未约束的模型都过拟合严重，对随机森林进行超参数组合调试模型

## 模型微调

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
parameter_grid = {
                 'max_depth' : [2, 3, 4, 5, 6, 7, 8],
                 'n_estimators': [10,20,30,40,50],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 5, 10],
                 'min_samples_leaf': [1, 3, 5, 10],
                 'bootstrap': [True, False],
                 }
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(n_splits=3)

grid_search = GridSearchCV(forest,
                            scoring='accuracy',
                            param_grid=parameter_grid,
                            cv=cross_validation,
                            verbose=1
                            )

grid_search.fit(X_train, y_train)
model = grid_search
parameters = grid_search.best_params_

In [None]:
parameters

In [None]:
model

In [None]:
y_rnd_pred = model.predict(X_train)
accuracy_score(y_rnd_pred,y_train)

In [None]:
cross_val_score(rnd_clf,X_train,y_train,cv=3,scoring='accuracy')

# 测试集上运行

In [None]:
test_set.info()

In [None]:
get_titles(test_set)
test_set.head()

In [None]:
test_set = test_set.drop(['PassengerId','Ticket','Name','Cabin'],axis=1)
test_set.head()

In [None]:
test_set['Age'] = test_set.apply(lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'],axis=1)
test_set.info()

In [None]:
test_set['Sex'] = pd.factorize(test_set['Sex'])[0]
test_set = pd.concat([test_set,pd.get_dummies(test_set['Embarked'])],axis=1)
test_set = pd.concat([test_set,pd.get_dummies(test_set['Title'])],axis=1)
test_set = test_set.drop(['Embarked','Title'],axis=1)
test_set.head()

In [None]:
y_test = test_set['Survived']
test_set = test_set.drop(['Survived'],axis=1 )

std = StandardScaler()
std.fit(test_set)

X_test = std.transform(test_set)
X_test.shape,y_test.shape

In [None]:
y_test_pred = model.predict(X_test)
accuracy_score(y_test_pred,y_test)

In [None]:
y_test_pred = log_clf.predict(X_test)
accuracy_score(y_test_pred,y_test)

随机森林的准确度要高一些

# 生成结果在kaggle上进行评分

In [None]:
test_data = pd.read_csv('./data/test.csv')
test_data.head()
PassengerId = test_data['PassengerId']
test_data.info()

In [None]:
get_titles(test_data)
test_data = test_data.drop(['PassengerId','Ticket','Name','Cabin'],axis=1)
test_data['Age'] = test_data.apply(lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'],axis=1)
test_data['Sex'] = pd.factorize(test_data['Sex'])[0]
test_data = pd.concat([test_data,pd.get_dummies(test_data['Embarked'])],axis=1)
test_data = pd.concat([test_data,pd.get_dummies(test_data['Title'])],axis=1)
test_data = test_data.drop(['Embarked','Title'],axis=1)
test_data.head()

In [None]:
test_data.info()

In [None]:
test_data[test_data['Fare'].isna()]

In [None]:
#根据前面算出的中位数进行填写
test_data.loc[152,'Fare'] = 7.8
test_data.info()

In [None]:
std = StandardScaler()
std.fit(test_data)

X_test = std.transform(test_data)
X_test.shape

In [None]:
y_test_log_pred = log_clf.predict(X_test)

In [None]:
y_test_rnd_pred = log_clf.predict(X_test)

In [None]:
OutDf = pd.DataFrame(index= PassengerId,columns=['Survived'])
OutDf['Survived'] = y_test_log_pred
OutDf.to_csv('log_clf_result.csv')

In [None]:
OutDf = pd.DataFrame(index= PassengerId,columns=['Survived'])
OutDf['Survived'] = y_test_rnd_pred
OutDf.to_csv('rnd_clf_result.csv')