- 数据概览：
    - 训练集共891个样本，其中549个未存活、342个存活；
    - 测试集共418个样本。

|序号|表头名称|表头含义|取值类型|
|--|--|--|--|
|1|PassengerID|旅客id|整数|
|2|Survived|是否存活|0：否； 1：是|
|3|Pclass|舱位|1：头等舱； 2：二等舱； 3：三等舱|
|4|Name|旅客姓名|字符串|
|5|Sex|性别|female；male|
|6|Age|年龄|年龄小于1则为分数；如果是估计年龄，则为xx.5；有缺失值|
|7|SibSp|乘客在船上兄弟姐妹/配偶的个数|整数|
|8|Parch|乘客在船上父母/孩子的个数|整数|
|9|Ticket|票号|字符串|
|10|Fare|票价|浮点数|
|11|Cabin|船舱|字符串；有缺失值|
|12|Embarked|登船港口|C： Cherbourg；Q：Queenstown；S：Southampton；有缺失值|

In [None]:
import numpy as np
import pandas as pd
from pylab import mpl

# 指定默认字体：解决plot不能显示中文问题
mpl.rcParams['font.sans-serif'] = ['Microsoft YaHei']
mpl.rcParams['axes.unicode_minus'] = False               # 解决保存图像是负号'-'显示为方块的问题

# 加载数据
data_train = pd.read_csv('./data/train.csv')
data_test = pd.read_csv('./data/test.csv')


In [None]:
print(data_train.info())
print(data_test.info())


- 属性概览
  - 数值型数据：Age(有缺失值：训练集177个样本缺失，测试集86个样本缺失)、SibSp、Parch、Fare(测试集1个样本缺失)；
   - 类别型数据：Pclass、Sex、Embarked(有缺失值：训练集2个样本缺失，测试集无缺失)；
   - ~~文本型数据：Name、Ticket、Cabin(有缺失值：训练集687个样本缺失，327个样本缺失)。~~
- 删除Name、Ticket、Cabin，暂定保留Age、SibSp、Parch、Fare、Pclass、Sex、Embarked作为属性值
- 训练集中只有两个样本的Embarked值缺失，删除这两个样本
- Pclass、Sex、Embarked进行one-hot编码
- 训练集中缺失Age值的样本过多，使用knn插值

In [None]:
# 删除PassengerId、Name、Ticket、Cabin
data_train = data_train.drop(
    columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])


In [None]:
# 删除Embarked值缺失的样本
nan_index = np.where(pd.isna(data_train['Embarked']))
data_train = data_train.drop(index=np.array(nan_index).reshape(-1))


In [None]:
# Pclass、Sex、Embarked进行one-hot编码
pclass = pd.get_dummies(data_train['Pclass'], prefix='Pclass')
sex = pd.get_dummies(data_train['Sex'], prefix='Sex')
embarked = pd.get_dummies(data_train['Embarked'], prefix='Embarked')

y_train = data_train['Survived']
x_train = data_train.drop(columns=['Survived'])
x_train = x_train.drop(columns=['Pclass', 'Sex', 'Embarked'])
x_train = pd.concat([x_train, pclass, sex, embarked], axis=1)


In [None]:
#  SibSp、Parch、Fare均值方差归一化处理
sibSp_mean = np.mean(x_train['SibSp'])
sibSp_std = np.std(x_train['SibSp'])

parch_mean = np.mean(x_train['Parch'])
parch_std = np.std(x_train['Parch'])

fare_mean = np.mean(x_train['Fare'])
fare_std = np.std(x_train['Fare'])

x_train['SibSp'] = (x_train['SibSp']-sibSp_mean)/sibSp_std
x_train['Parch'] = (x_train['Parch']-parch_mean)/parch_std
x_train['Fare'] = (x_train['Fare']-fare_mean)/fare_std


In [None]:
# 对Age属性使用knn插值
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=10)
x_train = knn_imputer.fit_transform(x_train)

# Age归一化处理
age_mean = np.mean(x_train[:, 0])
age_std = np.std(x_train[:, 0])

x_train[:, 0] = (x_train[:, 0]-age_mean)/age_std


- 选择模型：knn
- 训练模型：网格搜索

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV


_x_train, _x_test, _y_train, _y_test = train_test_split(
    x_train, y_train, test_size=0.3, random_state=10)

model = KNeighborsClassifier()
params = [{'n_neighbors': list(range(1, 20)), 'weights': [
    'uniform', 'distance']}]


model = GridSearchCV(model, params)
model.fit(_x_train, _y_train)

print(model.best_params_)
print(model.best_score_)
print(model.score(_x_test, _y_test))


- 对测试进行预测
- 按要求存储预测结果

In [None]:
# 删除PassengerId、Name、Ticket、Cabin
passengerId = data_test['PassengerId']
data_test = data_test.drop(
    columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])


In [None]:
# 对Fare属性缺失值使用均值插补
index = np.where(pd.isna(data_test['Fare']))
data_test.loc[index[0], 'Fare'] = mean = np.mean(data_test['Fare'])


In [None]:
# Pclass、Sex、Embarked进行one-hot编码
from sklearn.impute import KNNImputer
pclass = pd.get_dummies(data_test['Pclass'], prefix='Pclass')
sex = pd.get_dummies(data_test['Sex'], prefix='Sex')
embarked = pd.get_dummies(data_test['Embarked'], prefix='Embarked')

x_test = data_test.drop(columns=['Pclass', 'Sex', 'Embarked'])
x_test = pd.concat([x_test, pclass, sex, embarked], axis=1)


In [None]:
#  SibSp、Parch、Fare均值方差归一化处理
x_test['SibSp'] = (x_test['SibSp']-sibSp_mean)/sibSp_std
x_test['Parch'] = (x_test['Parch']-parch_mean)/parch_std
x_test['Fare'] = (x_test['Fare']-fare_mean)/fare_std


In [None]:
# 对Age属性使用knn插值
knn_imputer = KNNImputer(n_neighbors=10)
x_test = knn_imputer.fit_transform(x_test)

# Age归一化处理
x_test[:, 0] = (x_test[:, 0]-age_mean)/age_std

In [158]:
# 预测测试集
y_test_pre = model.predict(x_test)
y_test_pre = pd.Series(y_test_pre, name='Survived')

# 合并并保持数据
test_pre = pd.DataFrame({'PassengerId': passengerId, 'Survived': y_test_pre})
test_pre.to_csv('gender_submission.csv', index=False)
