# 泰坦尼克号乘客生存预测各预估器
## 数据处理分析一般步骤
- 获取数据
- 数据处理
    - 特征值 x
    - 目标值 y
- 特征工程：标准化
- 算法预估流程
- 模型选择与调优
- 模型评估

- [1数据来源](#1数据来源)
- [2数据信息](#2数据信息)
- [3数据处理](#3数据处理)
- [4确定特征值目标值](#4确定特征值目标值)
- [5划分数据集](#5划分数据集)
- [6预估器](#6预估器)
- [7小结](#7小结)


# 1数据来源  

泰坦尼克号是当时世界上体积最庞大、内部设施最豪华的客运轮船，有“永不沉没”的美誉 。然而不幸的是，在它的处女航中，泰坦尼克号便遭厄运——它从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰科夫(Cobh)，驶向美国纽约。1912年4月14日23时40分左右，泰坦尼克号与一座冰山相撞，造成右舷船艏至船中部破裂，五间水密舱进水。次日凌晨2时20分左右，泰坦尼克船体断裂成两截后沉入大西洋底3700米处。2224名船员及乘客中，逾1500人丧生，其中仅333具罹难者遗体被寻回。

# 2数据信息
- PassengerId    乘客编码
- Survived       是否幸存 (0=遇难 1=幸存)
- Pclass         船票类型 (1=一等票，2=二等票，3=三等票)
- Name           名字
- Sex            性别
- Age            年龄
- SibSp          船上该成员兄弟姐妹的数量
- Parch          船上该成员的父母或子女数量
- Ticket         船票编号
- Fare           乘客票价
- Cabin          客舱号码
- Embarked       起航运港（C = Cherbourg, Q = Queenstown, S = Southampton）

# 3数据处理
## 3.1导入数据

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [2]:
titanic = pd.read_csv("./titanic_train.csv")
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [3]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 3.2数据处理
- 特征类别转换成字典类型，方便之后一起转换成one-hot编码
- 或者想办法把所有特征转化成数字形式
- 准备好特征值，目标值

### 3.3 填充缺失的年龄值

In [4]:
titanic.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool

In [5]:
age_mean = titanic['Age'].mean()
age_mean

29.69911764705882

In [6]:
age_std = titanic['Age'].std()
age_std

14.526497332334044

In [7]:
age_null_number = titanic['Age'].isnull().sum()
age_null_number

177

In [8]:
rand_age = np.random.randint(age_mean - age_std, age_mean + age_std, age_null_number)
rand_age

array([40, 24, 15, 19, 27, 43, 38, 42, 23, 19, 41, 35, 31, 42, 34, 16, 32,
       15, 40, 39, 39, 26, 41, 37, 36, 23, 16, 38, 39, 29, 17, 43, 36, 18,
       30, 26, 20, 42, 35, 35, 25, 36, 38, 27, 38, 35, 36, 19, 23, 32, 25,
       32, 42, 38, 41, 29, 20, 30, 26, 30, 30, 15, 24, 31, 30, 20, 22, 24,
       33, 19, 41, 31, 32, 25, 24, 33, 22, 15, 28, 26, 43, 40, 33, 25, 32,
       22, 39, 40, 37, 38, 42, 26, 43, 38, 27, 38, 27, 17, 33, 28, 20, 35,
       30, 40, 41, 22, 25, 43, 32, 34, 33, 31, 29, 38, 35, 21, 29, 31, 18,
       24, 39, 15, 29, 41, 42, 31, 23, 33, 24, 40, 38, 40, 43, 38, 27, 28,
       24, 41, 25, 26, 40, 39, 28, 34, 24, 42, 21, 20, 29, 21, 38, 36, 21,
       15, 36, 18, 39, 33, 35, 18, 33, 20, 32, 21, 36, 26, 24, 36, 36, 17,
       17, 43, 35, 41, 28, 27, 30])

In [9]:
titanic['Age'][np.isnan(titanic['Age'])]=rand_age

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [10]:
titanic['Age'].isnull().any()

False

### 3.4填充缺失的港口
- 三个港口，用众数S港填充两个缺失的港口

In [11]:
titanic[titanic['Embarked']=='S']['PassengerId'].count()

644

In [12]:
titanic[titanic['Embarked']=='C']['PassengerId'].count()

168

In [13]:
titanic[titanic['Embarked']=='Q']['PassengerId'].count()

77

In [14]:
titanic['Embarked'].fillna(value = 'S',inplace =True)

In [15]:
titanic['Embarked'].isnull().any()

False

In [16]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


### 3.5因为缺失的数据太多，删除Cabin列, 因为ticket列意义不明，删除ticket列

In [17]:
titanic = titanic.drop(['Cabin','Ticket'],axis=1)

In [18]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.863266,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.498881,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,21.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,29.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### 3.6性别转换 1代表女性，0代表男性

In [19]:
titanic.loc[titanic['Sex']=='male','Sex'] = 0
titanic.loc[titanic['Sex']=='female','Sex'] = 1

In [20]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,S


### 3.7登船的三个港口可以换成数字形式

In [21]:
titanic['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [22]:
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

In [220]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,0


### 3.8 按照年龄将数据集分成5类

In [23]:
titanic.loc[ titanic['Age'] <= 16, 'Age'] = 0
titanic.loc[(titanic['Age'] > 16) & (titanic['Age'] <= 32), 'Age'] = 1
titanic.loc[(titanic['Age'] > 32) & (titanic['Age'] <= 48), 'Age'] = 2
titanic.loc[(titanic['Age'] > 48) & (titanic['Age'] <= 64), 'Age'] = 3
titanic.loc[ titanic['Age'] > 64, 'Age']=4

In [24]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,7.25,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,71.2833,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,7.925,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,53.1,0
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,8.05,0


### 3.9按是否是单独一人上船分类

In [25]:
titanic['isAlone'] = titanic['SibSp'] + titanic['Parch']

In [26]:
titanic.loc[titanic['isAlone'] == 0,'isAlone'] = 0

In [27]:
titanic.loc[titanic['isAlone'] != 0,'isAlone'] = 1

In [28]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,7.25,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,71.2833,1,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,7.925,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,53.1,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,8.05,0,0


### 3.10 按票价分类
- Fare: 0;7.910;14.45;31,512

In [29]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,isAlone
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,1.37037,0.523008,0.381594,32.204208,0.397306
std,257.353842,0.486592,0.836071,0.839862,1.102743,0.806057,49.693429,0.489615
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,1.0,0.0,0.0,7.9104,0.0
50%,446.0,0.0,3.0,1.0,0.0,0.0,14.4542,0.0
75%,668.5,1.0,3.0,2.0,1.0,0.0,31.0,1.0
max,891.0,1.0,3.0,4.0,8.0,6.0,512.3292,1.0


In [30]:
titanic.loc[titanic['Fare']<=7.91,'Fare']=0
titanic.loc[(titanic['Fare']>7.91) & (titanic['Fare']<=14.45),'Fare']=1
titanic.loc[(titanic['Fare']>14.45) & (titanic['Fare']<=31),'Fare']=2
titanic.loc[(titanic['Fare']>31) ,'Fare']=3

In [31]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,1,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,0,0


### 3.11按名字中的称谓分类

In [32]:
titanic['Title'] = titanic.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

In [33]:
titanic[['Title','Survived']].groupby('Title',as_index = False).mean()

Unnamed: 0,Title,Survived
0,Capt,0.0
1,Col,0.5
2,Countess,1.0
3,Don,0.0
4,Dr,0.428571
5,Jonkheer,0.0
6,Lady,1.0
7,Major,0.5
8,Master,0.575
9,Miss,0.697802


In [34]:
titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')

In [35]:
titanic['Title'] = titanic['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'others')

In [36]:
titanic[['Title','Survived']].groupby('Title',as_index = False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,others,0.347826


In [37]:
titanic.loc[titanic['Title'] =='Master','Title']=0
titanic.loc[titanic['Title'] =='Miss','Title']=1
titanic.loc[titanic['Title'] =='Mr','Title']=2
titanic.loc[titanic['Title'] =='Mrs','Title']=3
titanic.loc[titanic['Title'] =='others','Title']=4

In [38]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,isAlone,Title
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,0,1,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,1,1,3
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,0,1,3
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,0,0,2


## 4确定特征值目标值

In [39]:
# 特征值
x = titanic[['Pclass','Sex','Age','Fare','Embarked','isAlone','Title']]
# 目标值
y = titanic['Survived']

In [40]:
x.head()

Unnamed: 0,Pclass,Sex,Age,Fare,Embarked,isAlone,Title
0,3,0,1.0,0.0,0,1,2
1,1,1,2.0,3.0,1,1,3
2,3,1,1.0,1.0,0,0,1
3,1,1,2.0,3.0,0,1,3
4,3,0,2.0,1.0,0,0,2


In [241]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## 5划分数据集
- 一般来说，因为有两个数据集，一个作为训练集，一个作为测试集
- 我们将已有的数据集titanic划分为两个部分
- 因为数据都已经处理完毕，所以这里不需要再做特征工程(字典特征抽取/标准化)

In [41]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)

In [42]:
x_train.shape

(668, 7)

In [43]:
x_test.shape

(223, 7)

In [44]:
y_train.shape

(668,)

## 6预估器

### 6.1决策树

In [47]:
from sklearn.tree import DecisionTreeClassifier
estimator = DecisionTreeClassifier(criterion="entropy", max_depth=8)
estimator.fit(x_train, y_train)
y_predict = estimator.predict(x_test)
score_DecisionTree = round(estimator.score(x_test, y_test)*100,2)
score_DecisionTree

82.060000000000002

### 6.2随机森林

In [48]:
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(n_estimators=100)
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
estimator.score(x_train, y_train)
score_RandomForest = round(estimator.score(x_test, y_test) * 100, 2)
score_RandomForest   

78.030000000000001

### 6.3逻辑回归

In [53]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_LogisticRegression = round(estimator.score(x_test, y_test) * 100, 2)
score_LogisticRegression  

77.129999999999995

In [55]:
estimator.coef_

array([[-0.98266525,  2.56933044, -0.22932526,  0.05176923,  0.24219753,
         0.05043881, -0.37696767]])

In [56]:
estimator.intercept_

array([ 1.51775427])

### 6.4SVC 

In [57]:
from sklearn.svm import SVC
estimator = SVC()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_SVC = round(estimator.score(x_test, y_test) * 100, 2)
score_SVC

78.030000000000001

### 6.5K近邻 

In [58]:
from sklearn.neighbors import KNeighborsClassifier
#KNeighbors
estimator = KNeighborsClassifier(n_neighbors = 3)
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_KNeighbors = round(estimator.score(x_test, y_test) * 100, 2)
score_KNeighbors

80.269999999999996

### 6.6朴素贝叶斯

In [59]:
from sklearn.naive_bayes import GaussianNB
# Gaussian Naive Bayes
estimator = GaussianNB()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_naive_bayes = round(estimator.score(x_test, y_test) * 100, 2)
score_naive_bayes

75.340000000000003

### 6.7感知器

In [60]:
from sklearn.linear_model import Perceptron
#Perceptron
estimator = Perceptron()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_Perceptron = round(estimator.score(x_test, y_test) * 100, 2)
score_Perceptron



64.569999999999993

### 6.8线性SVC 

In [61]:
from sklearn.svm import LinearSVC
# Linear SVC
estimator = LinearSVC()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_LinearSVC = round(estimator.score(x_test, y_test) * 100, 2)
score_LinearSVC

76.680000000000007

### 6.9随机梯度下降  

In [62]:
from sklearn.linear_model import SGDClassifier
# Stochastic Gradient Descent 
estimator = SGDClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_SGD = round(estimator.score(x_test, y_test) * 100, 2)
score_SGD



71.75

## 7小结

In [63]:
models = pd.DataFrame({'Model':['决策树','随机森林','逻辑回归','SVC','K近邻','朴素贝叶斯','感知器','线性SVC','随机梯度下降'],'Score':[score_DecisionTree,score_RandomForest,score_LogisticRegression,score_SVC,score_KNeighbors,score_naive_bayes,score_Perceptron,score_LinearSVC,score_SGD]})
models.sort_values(by='Score',ascending = False)

Unnamed: 0,Model,Score
0,决策树,82.06
4,K近邻,80.27
1,随机森林,78.03
3,SVC,78.03
2,逻辑回归,77.13
7,线性SVC,76.68
5,朴素贝叶斯,75.34
8,随机梯度下降,71.75
6,感知器,64.57
