![](./lesson11/1.jpg)

### 如下图所示，以多层感知器模型来预测旅客的生存率，可分为训练和预测两部分

![](./lesson11/2.jpg)

![](./lesson11/3.png)

***

## 11.1 下载泰坦尼克号旅客的数据集

In [1]:
# 1. 下载所需模块
import urllib.request
import os

### 说明
**urllib.request 导入urllib包，用于下载文件**  
**os 导入os模块，用于确认文件是否存在**

In [3]:
url="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls"
filepath = "data/titanic3.xls"
if not os.path.isfile(filepath):
    result = urllib.request.urlretrieve(url, filepath)
    print('download:', result)

download: ('data/titanic3.xls', <http.client.HTTPMessage object at 0x00000293ABC0F048>)


![](./lesson11/4.jpg)

***

## 11.2 使用Pandas DataFrame 读取数据并进行数据预处理

In [5]:
import pandas as pd
import numpy as np

In [7]:
# 读取titanic.xls文件
all_df = pd.read_excel(filepath)

In [8]:
# 查看前5项数据
all_df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


![](./lesson11/5.jpg)

![](./lesson11/6.jpg)

### 以上字段中，survival（是否生存）是label标签字段，也就是我们要预测的目标，其余都是特征字段

In [12]:
### 以上字段中，除了ticket（船票号码）, cabin（舱位号码）, 我们认为与survival（预测的结果）关系不大，所以我们选择将其忽略
cols = ['survived', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']
all_df = all_df[cols]

In [13]:
all_df.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,fare,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,211.3375,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,151.55,S
2,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,151.55,S
3,0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,151.55,S
4,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,151.55,S


### 以上数据还存在一些问题，需要进行预处理，才能进行机器学习训练
![](./lesson11/7.jpg)

## 11.3 使用Pandas DataFrame 进行数据预处理

In [14]:
# axis默认为0即按行删除，要想删除列只需令axis=1
# 以下代码删除 name 字段
df = all_df.drop(['name'], axis=1)

In [15]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S
2,0,1,female,2.0,1,2,151.55,S
3,0,1,male,30.0,1,2,151.55,S
4,0,1,female,25.0,1,2,151.55,S


In [20]:
# 找出含有null值的字段
all_df.isnull().sum()

survived      0
pclass        0
name          0
sex           0
age         263
sibsp         0
parch         0
fare          1
embarked      2
dtype: int64

![](./lesson11/8.jpg)

In [21]:
# 将age字段为null 的数据替换成平均值
age_mean = all_df['age'].mean()

In [23]:
df['age'] = df['age'].fillna(age_mean)

In [25]:
# 将fare字段为null的数据替换成平均值
fare_mean = df['fare'].mean()
print(fare_mean)
df['fare'] = df['fare'].fillna(fare_mean)

33.29547928134557


In [27]:
# 转换性别字段为0 与 1， 性别字段是文本，必须转换为0 与 1，才能进行后续的机器学习训练
# 使用map方法把 'female' 转换为 0， 'male' 转换为 1
df['sex'] = df['sex'].map({'female':0, 'male':1}).astype(int)

In [28]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked
0,1,1,0,29.0,0,0,211.3375,S
1,1,1,1,0.9167,1,2,151.55,S
2,0,1,0,2.0,1,2,151.55,S
3,0,1,1,30.0,1,2,151.55,S
4,0,1,0,25.0,1,2,151.55,S


In [None]:
# 将embarked字段进行一位有效编码

![](./lesson11/9.jpg)

In [29]:
x_OneHot_df = pd.get_dummies(data=df,  columns=['embarked'])

In [30]:
# 查看转换后的DataFrame
x_OneHot_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked_C,embarked_Q,embarked_S
0,1,1,0,29.0,0,0,211.3375,0,0,1
1,1,1,1,0.9167,1,2,151.55,0,0,1
2,0,1,0,2.0,1,2,151.55,0,0,1
3,0,1,1,30.0,1,2,151.55,0,0,1
4,0,1,0,25.0,1,2,151.55,0,0,1


### 以上可以看出，将性别转换为 0 或者 1， 原本的embarked 转换为了3个字段 (embarked_C, embarked_Q, embarked_S)

## 11.4 将DataFrame 转换成Array

### 因为后续要进行深度学习训练，所以必须先将DataFrame 转换为Array

In [31]:
# DataFrame 转换为 Array
ndarray = x_OneHot_df.values

In [34]:
# 查看ndarray 的第一项数据
ndarray[0]

array([  1.    ,   1.    ,   0.    ,  29.    ,   0.    ,   0.    ,
       211.3375,   0.    ,   0.    ,   1.    ])

![](./lesson11/10.jpg)

In [38]:
# 查看ndarray的shape
ndarray.shape

(1309, 10)

In [39]:
# 提取 features 与 label
Label = ndarray[:, 0]
Features= ndarray[:, 1:]

In [43]:
print(Label[:5])
print(Features[:2])

[1. 1. 0. 0. 0.]
[[  1.       0.      29.       0.       0.     211.3375   0.       0.
    1.    ]
 [  1.       1.       0.9167   1.       2.     151.55     0.       0.
    1.    ]]


![](./lesson11/11.jpg)

### 从以上的执行结果可知，因为数值特征字段的单位不同，例如年龄29，运费211等，数字差距很大，没有一个共同的标准。这时就要使用标准化，让所有数值都在0-1之间，使数值特征字段有共同的标准。
### 进行标准化可以提高训练后模型的准确率。

***

## 11.5 将ndarray特征字段进行标准化

### 使用sklearn中的preprocessing数据预处理模块进行标准化

In [45]:
# 1.导入模块
from sklearn import preprocessing

In [46]:
# 2. 建立MinMaxScaler标准化刻度 minmax_scale
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))

![](./lesson11/12.jpg)

In [47]:
# 3. 使用minmax_scale.fit_transform 进行标准化
scaledFeatures = minmax_scale.fit_transform(Features)

In [49]:
# 查看标准化后的数据
scaledFeatures[:2]

array([[0.        , 0.        , 0.36116884, 0.        , 0.        ,
        0.41250333, 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.00939458, 0.125     , 0.22222222,
        0.2958059 , 0.        , 0.        , 1.        ]])

### 可以看出，标准化之后的结果都在0-1之间

In [50]:
print(str(Features[0]))
print(str(scaledFeatures[0]))

[  1.       0.      29.       0.       0.     211.3375   0.       0.
   1.    ]
[0.         0.         0.36116884 0.         0.         0.41250333
 0.         0.         1.        ]


___

## 11.6 将数据分为训练数据与测试数据

In [110]:
# 重新读取文件
all_df = pd.read_excel(filepath)

In [111]:
all_df = all_df[cols]

In [112]:
all_df.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,fare,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,211.3375,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,151.55,S
2,0,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,151.55,S
3,0,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,151.55,S
4,0,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,151.55,S


In [113]:
# 将数据划分成 训练数据 和 测试数据
# np.random.rand() 通过本函数可以返回一个或一组服从“0~1”均匀分布的随机样本值。随机样本取值范围是[0,1)，不包括1。
msk = np.random.rand(len(all_df)) < 0.8

In [114]:
train_df = all_df[msk]
test_df = all_df[~msk]

In [115]:
# 查看形状
print('total: ', len(all_df))
print('train: ', len(train_df))
print('test: ', len(test_df))

total:  1309
train:  1027
test:  282


In [116]:
# 创建PreprocessData函数进行数据的预处理
# 将之前数据预处理的命令全部收集在PreprocessData函数中， 方便后续使用
def PreprocessData(raw_df):
    df = raw_df.drop(['name'], axis=1)
#     null值填充
    age_mean = df['age'].mean()
    df['age'] = df['age'].fillna(age_mean)
    fare_mean = df['fare'].mean()
    df['fare'] = df['fare'].fillna(fare_mean)
#     性别字段转换为数值型
    df['sex'] = df['sex'].map({'female':0, 'male':1}).astype(int)
#     embarked 字段 进行OneHot编码
    x_OneHot_df = pd.get_dummies(data=df, columns=['embarked'])
    
#     dataframe 转换成 array , 并提取 features和label
    ndarray = x_OneHot_df.values
    Features = ndarray[:, 1:]
    Label = ndarray[:, 0]
    
    minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
    scaledFeatures = minmax_scale.fit_transform(Features)
    return scaledFeatures, Label

In [117]:
# 对训练数据与测试数据进行预处理
train_features, train_label = PreprocessData(train_df)
test_features, test_label = PreprocessData(test_df)

In [119]:
# 查看数据预处理后训练数据的特征字段
train_features[:2]

array([[0.        , 0.        , 0.38021951, 0.        , 0.        ,
        0.41250333, 0.        , 0.        , 1.        ],
       [0.        , 1.        , 0.00989011, 0.125     , 0.22222222,
        0.2958059 , 0.        , 0.        , 1.        ]])

In [121]:
# 查看数据预处理后训练数据的标签字段
train_label[:2]

array([1., 1.])

***

## 11.7 结论
### 在本章中，下载并读取了泰坦尼克号旅客数据集，介绍 了泰坦尼克号数据集的特色，并完成了数据的预处理。下一章，使用Keras建立多层感知器模型，训练模型并进行预测