# 泰坦尼克号乘客生存预测各预估器2
## 数据处理分析一般步骤
- 获取数据
- 数据处理
    - 特征值 x
    - 目标值 y
- 特征工程：标准化
- 算法预估流程
- 模型选择与调优
- 模型评估

- [1数据来源](#1数据来源)
- [2数据信息](#2数据信息)
- [3数据处理](#3数据处理)
- [4确定特征值目标值](#4确定特征值目标值)
- [5划分数据集](#5划分数据集)
- [6预估器](#6预估器)
- [7小结](#7小结)


# 1数据来源  

泰坦尼克号是当时世界上体积最庞大、内部设施最豪华的客运轮船，有“永不沉没”的美誉 。然而不幸的是，在它的处女航中，泰坦尼克号便遭厄运——它从英国南安普敦出发，途经法国瑟堡-奥克特维尔以及爱尔兰科夫(Cobh)，驶向美国纽约。1912年4月14日23时40分左右，泰坦尼克号与一座冰山相撞，造成右舷船艏至船中部破裂，五间水密舱进水。次日凌晨2时20分左右，泰坦尼克船体断裂成两截后沉入大西洋底3700米处。2224名船员及乘客中，逾1500人丧生，其中仅333具罹难者遗体被寻回。

# 2数据信息
- PassengerId    乘客编码
- Survived       是否幸存 (0=遇难 1=幸存)
- Pclass         船票类型 (1=一等票，2=二等票，3=三等票)
- Name           名字
- Sex            性别
- Age            年龄
- SibSp          船上该成员兄弟姐妹的数量
- Parch          船上该成员的父母或子女数量
- Ticket         船票编号
- Fare           乘客票价
- Cabin          客舱号码
- Embarked       起航运港（C = Cherbourg, Q = Queenstown, S = Southampton）

# 3数据处理
## 3.1导入数据

In [3]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [60]:
#导入数据，观察数据
titanic = pd.read_csv("./titanic_train.csv")
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [61]:
type(titanic)

pandas.core.frame.DataFrame

In [62]:
# 观察数据特点
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 3.2数据处理
- 数据清洗
- 删除缺失值比较多的列数据
- 填充缺失数据列

### 对数据列的处理方法
- 数值类型可以直m接使用
- 时间序列（经过长期重复测量而形成的时m间序列）可以转换成单独的年月日
- 分类数据（性别）——用哑变量（虚拟变量）代替。如男=1，女=0；若类别超过2个，用one-hot编码
- 准备好特征值，目标值
### 3.2.1 删除缺失值比较多和意义不明的的数据列

In [63]:
# 对cabin和ticket列处理
# 因为缺失的数据太多，删除Cabin列;因为ticket列意义不明，删除ticket列
titanic = titanic.drop(['Cabin','Ticket'],axis=1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


### 3.2.2查找缺失数据，并填补

In [64]:
titanic.isnull().any()

PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Fare           False
Embarked        True
dtype: bool

In [65]:
titanic.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64

In [66]:
# 填补缺失的年龄值
age_null_number = titanic['Age'].isnull().sum()
age_mean = titanic['Age'].mean()
age_std = titanic['Age'].std()
print('船上一共有未知年龄的乘客有{}人'.format(age_null_number))
print('船上已知的平均年龄为{:.2f} 岁'.format(age_mean))
print('船上已知的年龄标准差为{:.2f}岁'.format(age_std))

船上一共有未知年龄的乘客有177人
船上已知的平均年龄为29.70 岁
船上已知的年龄标准差为14.53岁


In [67]:
# 在平均年龄-标准差和平均年龄+标准差之间随机抽取数据作为年龄填补在缺失值中
rand_age = np.random.randint(age_mean - age_std, age_mean + age_std, age_null_number)
rand_age

array([32, 25, 28, 21, 42, 41, 39, 26, 25, 40, 40, 24, 40, 36, 20, 37, 43,
       26, 32, 28, 38, 22, 34, 32, 42, 22, 36, 32, 19, 36, 27, 42, 19, 36,
       39, 43, 27, 20, 36, 22, 18, 33, 22, 18, 43, 28, 42, 24, 32, 39, 23,
       40, 28, 28, 28, 33, 37, 38, 21, 28, 17, 33, 19, 32, 29, 34, 27, 19,
       15, 40, 15, 43, 31, 30, 27, 15, 16, 34, 29, 27, 41, 37, 39, 24, 33,
       16, 30, 33, 15, 16, 27, 21, 30, 40, 16, 17, 17, 16, 31, 37, 35, 24,
       20, 21, 16, 27, 29, 32, 23, 39, 29, 21, 15, 23, 15, 31, 15, 29, 36,
       37, 16, 40, 38, 41, 26, 28, 21, 32, 20, 35, 29, 19, 17, 31, 17, 34,
       24, 43, 18, 32, 19, 26, 16, 38, 41, 19, 28, 36, 18, 41, 33, 15, 32,
       23, 36, 20, 32, 31, 42, 32, 29, 15, 42, 27, 21, 31, 21, 26, 43, 25,
       31, 38, 37, 17, 40, 19, 18])

In [68]:
titanic['Age'][np.isnan(titanic['Age'])]=rand_age

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [69]:
# 再次观察数据，年龄的缺失值被填补
titanic['Age'].isnull().any()

False

In [70]:
# 填充缺失的港口
## 三个港口，用众数S港填充两个缺失的港口
S_number = titanic[titanic['Embarked']=='S']['PassengerId'].count()
C_number = titanic[titanic['Embarked']=='C']['PassengerId'].count()
Q_number = titanic[titanic['Embarked']=='Q']['PassengerId'].count()
print('S港口上船的人数为{0}人'.format(S_number))
print('C港口上船的人数为{0}人'.format(C_number))
print('Q港口上船的人数为{0}人'.format(Q_number))

S港口上船的人数为644人
C港口上船的人数为168人
Q港口上船的人数为77人


In [71]:
titanic['Embarked'].fillna(value = 'S',inplace =True)

In [72]:
titanic['Embarked'].isnull().any()

False

In [73]:
# 观察数据，所有的缺失值都已经填补完毕，可以接着做后续的处理
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB


### 3.2.3 对数据分类，并且做one-hot编码

In [74]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


In [75]:
type(titanic)

pandas.core.frame.DataFrame

In [76]:
Survived = titanic.loc[:,'Survived']
Survived.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [77]:
# 对pclass做one-hot编码处理
pclass_data = pd.DataFrame()
pclass_data = pd.get_dummies(titanic['Pclass'],prefix='Pclass')
pclass_data.head() 

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1


In [78]:
# 合并表格
titanic = pd.concat((titanic,pclass_data),axis=1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,1,0,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,0,0,1


In [79]:
# 性别转换 1代表女性，0代表男性
titanic.loc[titanic['Sex']=='male','Sex'] = 0
titanic.loc[titanic['Sex']=='female','Sex'] = 1
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Pclass_1,Pclass_2,Pclass_3
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,7.25,S,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,71.2833,C,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,7.925,S,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,53.1,S,1,0,0
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,8.05,S,0,0,1


In [80]:
# 将年龄分类，并且做one-hot编码
titanic.loc[ titanic['Age'] <= 16, 'Age'] = 0
titanic.loc[(titanic['Age'] > 16) & (titanic['Age'] <= 32), 'Age'] = 1
titanic.loc[(titanic['Age'] > 32) & (titanic['Age'] <= 48), 'Age'] = 2
titanic.loc[(titanic['Age'] > 48) & (titanic['Age'] <= 64), 'Age'] = 3
titanic.loc[ titanic['Age'] > 64, 'Age']=4

In [81]:
# 对年龄做one-hot编码
age_data = pd.DataFrame()
age_data = pd.get_dummies(titanic['Age'],prefix='Age')
age_data.head()

Unnamed: 0,Age_0.0,Age_1.0,Age_2.0,Age_3.0,Age_4.0
0,0,1,0,0,0
1,0,0,1,0,0
2,0,1,0,0,0
3,0,0,1,0,0
4,0,0,1,0,0


In [82]:
# 合并表格
titanic = pd.concat((titanic,age_data),axis=1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Pclass_1,Pclass_2,Pclass_3,Age_0.0,Age_1.0,Age_2.0,Age_3.0,Age_4.0
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,7.25,S,0,0,1,0,1,0,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,71.2833,C,1,0,0,0,0,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,7.925,S,0,0,1,0,1,0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,53.1,S,1,0,0,0,0,1,0,0
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,8.05,S,0,0,1,0,0,1,0,0


In [83]:
# 按是否是单独一人上船分类,并做one-hot编码处理
titanic['isAlone'] = titanic['SibSp'] + titanic['Parch']
titanic.loc[titanic['isAlone'] == 0,'isAlone'] = 0
titanic.loc[titanic['isAlone'] != 0,'isAlone'] = 1

In [39]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,...,Age_1.0,Age_2.0,Age_3.0,Age_4.0,Age_0.0,Age_1.0.1,Age_2.0.1,Age_3.0.1,Age_4.0.1,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,7.25,S,...,1,0,0,0,0,1,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,71.2833,C,...,0,1,0,0,0,0,1,0,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,7.925,S,...,1,0,0,0,0,1,0,0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,53.1,S,...,0,1,0,0,0,0,1,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,8.05,S,...,0,1,0,0,0,0,1,0,0,0


In [84]:
# 按票价分类,并做one-hot编码
titanic.loc[titanic['Fare']<=7.91,'Fare']=0
titanic.loc[(titanic['Fare']>7.91) & (titanic['Fare']<=14.45),'Fare']=1
titanic.loc[(titanic['Fare']>14.45) & (titanic['Fare']<=31),'Fare']=2
titanic.loc[(titanic['Fare']>31) ,'Fare']=3

In [41]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,...,Age_1.0,Age_2.0,Age_3.0,Age_4.0,Age_0.0,Age_1.0.1,Age_2.0.1,Age_3.0.1,Age_4.0.1,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,S,...,1,0,0,0,0,1,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,C,...,0,1,0,0,0,0,1,0,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,S,...,1,0,0,0,0,1,0,0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,S,...,0,1,0,0,0,0,1,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,S,...,0,1,0,0,0,0,1,0,0,0


In [85]:
# 对分类之后的票价做one-hot处理
price_data = pd.DataFrame()
price_data = pd.get_dummies(titanic['Fare'],prefix='Fare')
price_data.head()

Unnamed: 0,Fare_0.0,Fare_1.0,Fare_2.0,Fare_3.0
0,1,0,0,0
1,0,0,0,1
2,0,1,0,0
3,0,0,0,1
4,0,1,0,0


In [86]:
# 合并表格
titanic = pd.concat((titanic,price_data),axis=1)
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,...,Age_0.0,Age_1.0,Age_2.0,Age_3.0,Age_4.0,isAlone,Fare_0.0,Fare_1.0,Fare_2.0,Fare_3.0
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,S,...,0,1,0,0,0,1,1,0,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,C,...,0,0,1,0,0,1,0,0,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,S,...,0,1,0,0,0,0,0,1,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,S,...,0,0,1,0,0,1,0,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,S,...,0,0,1,0,0,0,0,1,0,0
5,6,0,3,"Moran, Mr. James",0,1.0,0,0,1.0,Q,...,0,1,0,0,0,0,0,1,0,0
6,7,0,1,"McCarthy, Mr. Timothy J",0,3.0,0,0,3.0,S,...,0,0,0,1,0,0,0,0,0,1
7,8,0,3,"Palsson, Master. Gosta Leonard",0,0.0,3,1,2.0,S,...,1,0,0,0,0,1,0,0,1,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,1.0,0,2,1.0,S,...,0,1,0,0,0,1,0,1,0,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,0.0,1,0,2.0,C,...,1,0,0,0,0,1,0,0,1,0


In [87]:
# 对港口做ont-hot编码处理
titanic_Embarked = pd.DataFrame()
titanic_Embarked = pd.get_dummies(titanic['Embarked'],prefix='Embarked')
titanic_Embarked.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [88]:
# 合并表格
titanic = pd.concat((titanic,titanic_Embarked),axis=1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,...,Age_3.0,Age_4.0,isAlone,Fare_0.0,Fare_1.0,Fare_2.0,Fare_3.0,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,S,...,0,0,1,1,0,0,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,C,...,0,0,1,0,0,0,1,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,S,...,0,0,0,0,1,0,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,S,...,0,0,1,0,0,0,1,0,0,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,S,...,0,0,0,0,1,0,0,0,0,1


In [89]:
# 按名字中的称谓分类并做one-hot编码
titanic['Title'] = titanic.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
titanic[['Title','Survived']].groupby('Title',as_index = False).mean()

Unnamed: 0,Title,Survived
0,Capt,0.0
1,Col,0.5
2,Countess,1.0
3,Don,0.0
4,Dr,0.428571
5,Jonkheer,0.0
6,Lady,1.0
7,Major,0.5
8,Master,0.575
9,Miss,0.697802


In [90]:
titanic['Title'] = titanic['Title'].replace('Mlle', 'Miss')
titanic['Title'] = titanic['Title'].replace('Ms', 'Miss')
titanic['Title'] = titanic['Title'].replace('Mme', 'Mrs')

In [91]:
titanic['Title'] = titanic['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'others')

In [92]:
titanic[['Title','Survived']].groupby('Title',as_index = False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,others,0.347826


In [93]:
titanic.loc[titanic['Title'] =='Master','Title']=0
titanic.loc[titanic['Title'] =='Miss','Title']=1
titanic.loc[titanic['Title'] =='Mr','Title']=2
titanic.loc[titanic['Title'] =='Mrs','Title']=3
titanic.loc[titanic['Title'] =='others','Title']=4

In [94]:
# 对title做one-hot编码
title_data = pd.DataFrame()
title_data = pd.get_dummies(titanic['Title'],prefix='Title')
title_data.head()

Unnamed: 0,Title_0,Title_1,Title_2,Title_3,Title_4
0,0,0,1,0,0
1,0,0,0,1,0
2,0,1,0,0,0
3,0,0,0,1,0
4,0,0,1,0,0


In [95]:
# 合并表格
titanic = pd.concat((titanic,title_data),axis=1)
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,...,Fare_3.0,Embarked_C,Embarked_Q,Embarked_S,Title,Title_0,Title_1,Title_2,Title_3,Title_4
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,0.0,S,...,0,0,0,1,2,0,0,1,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,3.0,C,...,1,1,0,0,3,0,0,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,1.0,S,...,0,0,0,1,1,0,1,0,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,3.0,S,...,1,0,0,1,3,0,0,0,1,0
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,1.0,S,...,0,0,0,1,2,0,0,1,0,0
5,6,0,3,"Moran, Mr. James",0,1.0,0,0,1.0,Q,...,0,0,1,0,2,0,0,1,0,0
6,7,0,1,"McCarthy, Mr. Timothy J",0,3.0,0,0,3.0,S,...,1,0,0,1,2,0,0,1,0,0
7,8,0,3,"Palsson, Master. Gosta Leonard",0,0.0,3,1,2.0,S,...,0,0,0,1,0,1,0,0,0,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,1.0,0,2,1.0,S,...,0,0,0,1,3,0,0,0,1,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,0.0,1,0,2.0,C,...,0,1,0,0,3,0,0,0,1,0


In [96]:
# 删除多余项
titanic = titanic.drop(['Pclass','Name','Age','SibSp','Parch','Fare','Embarked','Title'],axis=1)
titanic.head()

Unnamed: 0,PassengerId,Survived,Sex,Pclass_1,Pclass_2,Pclass_3,Age_0.0,Age_1.0,Age_2.0,Age_3.0,...,Fare_2.0,Fare_3.0,Embarked_C,Embarked_Q,Embarked_S,Title_0,Title_1,Title_2,Title_3,Title_4
0,1,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0
1,2,1,1,1,0,0,0,0,1,0,...,0,1,1,0,0,0,0,0,1,0
2,3,1,1,0,0,1,0,1,0,0,...,0,0,0,0,1,0,1,0,0,0
3,4,1,1,1,0,0,0,0,1,0,...,0,1,0,0,1,0,0,0,1,0
4,5,0,0,0,0,1,0,0,1,0,...,0,0,0,0,1,0,0,1,0,0


In [99]:
titanic.columns.values

array(['PassengerId', 'Survived', 'Sex', 'Pclass_1', 'Pclass_2',
       'Pclass_3', 'Age_0.0', 'Age_1.0', 'Age_2.0', 'Age_3.0', 'Age_4.0',
       'isAlone', 'Fare_0.0', 'Fare_1.0', 'Fare_2.0', 'Fare_3.0',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_0', 'Title_1',
       'Title_2', 'Title_3', 'Title_4'], dtype=object)

## 4确定特征值目标值

In [100]:
# 特征值
x = titanic[['PassengerId', 'Sex', 'Pclass_1', 'Pclass_2',
       'Pclass_3', 'Age_0.0', 'Age_1.0', 'Age_2.0', 'Age_3.0', 'Age_4.0',
       'isAlone', 'Fare_0.0', 'Fare_1.0', 'Fare_2.0', 'Fare_3.0',
       'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_0', 'Title_1',
       'Title_2', 'Title_3', 'Title_4']]
# 目标值
y = titanic['Survived']

In [101]:
x.head()

Unnamed: 0,PassengerId,Sex,Pclass_1,Pclass_2,Pclass_3,Age_0.0,Age_1.0,Age_2.0,Age_3.0,Age_4.0,...,Fare_2.0,Fare_3.0,Embarked_C,Embarked_Q,Embarked_S,Title_0,Title_1,Title_2,Title_3,Title_4
0,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,1,0,0
1,2,1,1,0,0,0,0,1,0,0,...,0,1,1,0,0,0,0,0,1,0
2,3,1,0,0,1,0,1,0,0,0,...,0,0,0,0,1,0,1,0,0,0
3,4,1,1,0,0,0,0,1,0,0,...,0,1,0,0,1,0,0,0,1,0
4,5,0,0,0,1,0,0,1,0,0,...,0,0,0,0,1,0,0,1,0,0


In [102]:
y.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

## 5划分数据集
- 一般来说，因为有两个数据集，一个作为训练集，一个作为测试集
- 我们将已有的数据集titanic划分为两个部分
- 因为数据都已经处理完毕，所以这里不需要再做特征工程(字典特征抽取/标准化)

In [103]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=22)

In [104]:
x_train.shape

(668, 23)

In [105]:
x_test.shape

(223, 23)

In [106]:
y_train.shape

(668,)

## 6预估器
### 6.1决策树

In [107]:
from sklearn.tree import DecisionTreeClassifier
estimator = DecisionTreeClassifier(criterion="entropy", max_depth=8)
estimator.fit(x_train, y_train)
y_predict = estimator.predict(x_test)
score_DecisionTree = round(estimator.score(x_test, y_test)*100,2)
score_DecisionTree

76.230000000000004

### 6.2随机森林

In [109]:
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(n_estimators=100)
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
estimator.score(x_train, y_train)
score_RandomForest = round(estimator.score(x_test, y_test) * 100, 2)
score_RandomForest   

78.030000000000001

### 6.3逻辑回归

In [111]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_LogisticRegression = round(estimator.score(x_test, y_test) * 100, 2)
score_LogisticRegression 

78.480000000000004

In [112]:
estimator.coef_

array([[ -2.19095488e-07,   1.55317286e+00,   1.08969494e+00,
          6.46836386e-02,  -1.22250869e+00,   3.39917695e-01,
          2.39812641e-01,   4.12878285e-02,  -1.24026494e-02,
         -6.76745631e-01,  -4.26295981e-01,  -1.30725017e-01,
          3.97273297e-01,  -3.90324512e-03,  -3.30775151e-01,
          3.65187518e-01,   1.64979513e-02,  -4.49815585e-01,
          1.70546988e+00,   2.75855333e-01,  -1.22599022e+00,
          6.93542874e-01,  -1.51700799e+00]])

In [113]:
estimator.intercept_

array([-0.06813012])

### 6.4SVC

In [115]:
from sklearn.svm import SVC
estimator = SVC()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_SVC = round(estimator.score(x_test, y_test) * 100, 2)
score_SVC

59.189999999999998

### 6.5K近邻 

In [116]:
from sklearn.neighbors import KNeighborsClassifier
#KNeighbors
estimator = KNeighborsClassifier(n_neighbors = 3)
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_KNeighbors = round(estimator.score(x_test, y_test) * 100, 2)
score_KNeighbors

58.299999999999997

### 6.6朴素贝叶斯

In [117]:
from sklearn.naive_bayes import GaussianNB
# Gaussian Naive Bayes
estimator = GaussianNB()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_naive_bayes = round(estimator.score(x_test, y_test) * 100, 2)
score_naive_bayes

71.299999999999997

### 6.7感知器

In [121]:
from sklearn.linear_model import Perceptron
#Perceptron
estimator = Perceptron()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_Perceptron = round(estimator.score(x_test, y_test) * 100, 2)
score_Perceptron



60.090000000000003

### 6.8线性SVC 

In [119]:
from sklearn.svm import LinearSVC
# Linear SVC
estimator = LinearSVC()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_LinearSVC = round(estimator.score(x_test, y_test) * 100, 2)
score_LinearSVC

69.060000000000002

### 6.9随机梯度下降  

In [120]:
from sklearn.linear_model import SGDClassifier
# Stochastic Gradient Descent 
estimator = SGDClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
score_SGD = round(estimator.score(x_test, y_test) * 100, 2)
score_SGD



40.359999999999999

## 7小结

In [123]:
models = pd.DataFrame({'Model':['决策树','随机森林','逻辑回归','SVC','K近邻','朴素贝叶斯','感知器','线性SVC','随机梯度下降'],'Score':[score_DecisionTree,score_RandomForest,score_LogisticRegression,score_SVC,score_KNeighbors,score_naive_bayes,score_Perceptron,score_LinearSVC,score_SGD]})
models.sort_values(by='Score',ascending = False)

Unnamed: 0,Model,Score
2,逻辑回归,78.48
1,随机森林,78.03
0,决策树,76.23
5,朴素贝叶斯,71.3
7,线性SVC,69.06
6,感知器,60.09
3,SVC,59.19
4,K近邻,58.3
8,随机梯度下降,40.36
