## 一、摘要

鐵達尼號的沉沒是歷史上最有名的沉船事件之一。「Titanic乘客生存預測」也是Kaggle上的一項入門競賽，所以選擇此主題作為研究對象。

## 二、研究背景與目的

1912年4月10日，鐵達尼號展開首航，也是唯一一次的載客出航，在1912年4月15日，鐵達尼號與冰山相撞後沉沒，2224名船上人員中有1514人罹難，成為近代史上最嚴重的和平時期船難，這場悲劇震驚了國際社會，並導致了更好的船舶安全規定。

造成船難失事的原因之一是乘客和機組人員沒有足夠的救生艇，儘管倖存下來也有一些運氣的因素，但有些人比其他人更容易生存，例如女人，孩子和上流社會。因此在這個競賽中，要利用機器學習工具來預測哪些乘客在悲劇中倖存下來。


## 三、資料集介紹(含資料特徵)及資料集來源

以下是train dataset的各欄位資訊：

Varible     | Definition                 | Key
:-----------|:---------------------------|:------
PassengerId | 乘客編號                    |   
Survived    | 是否生存                    | 0=死亡, 1=存活 
Pclass      | 艙等                       | 1=1等艙, 2=2等艙, 3=3等艙   
Name        | 乘客姓名                    |   
Sex         | 乘客性別                    |   
Age         | 乘客年齡                    |   
SibSp       | 乘客攜帶兄弟姐妹或配偶的人數  |   
Parch       | 乘客攜帶父母或孩子 的人數     |   
Ticket      | 船票編號                    |   
Fare        | 乘客編號                    |   
Cabin       | 船票價格                    |   
Embarked    | 乘客登船的港口               | C=C港口, Q=Q港口, S=S港口  
Set         | 識別訓練與測試資料           |   

## 四、資料預處理

### 4-1.匯入資料

In [1]:
# 先讀取titanic的訓練train.csv與測試test.csv資料，其中test.csv沒有Survived欄位。
import pandas as pd
titanic_train = pd.read_csv('data/train.csv')
titanic_test = pd.read_csv('data/test.csv')

In [2]:
# 新增 "Set" 變數來分辨訓練與測試資料
titanic_train['Set'] = 'Train'
titanic_test['Set'] = 'Test'

# 合併訓練與測試資料
titanic_test['Survived'] = None
full_data = pd.concat([titanic_train, titanic_test])
full_data = full_data.reindex_axis(titanic_train.columns, axis=1)
full_data

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  import sys
  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Train
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Train
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Train
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Train
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Train
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Train
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,Train
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S,Train
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,Train
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,Train


### 4-2.進行資料預處理

In [3]:
# 找出每個欄位空值總數，因為Cabin空值太多，此欄位暫時不考慮，只考慮Embarked、Age、Fare欄位
full_data.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
Set               0
dtype: int64

In [4]:
# 觀察Embarked欄位中，哪個值佔最多
full_data['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [5]:
# 找出Embarked欄位中，NaN的資料
full_data[full_data['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,Train
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,Train


In [6]:
# 因為S最多 取多數決，因此將NaN假設為S
full_data['Embarked'] = full_data['Embarked'].fillna('S')

In [7]:
# 確認NaN已改為S
full_data[full_data['PassengerId'].isin([62, 830])]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,S,Train
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,S,Train


In [8]:
#接著對Age欄位進行處理
full_data[full_data['Age'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Train
17,18,1,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,Train
19,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C,Train
26,27,0,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C,Train
28,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,Train
29,30,0,3,"Todoroff, Mr. Lalio",male,,0,0,349216,7.8958,,S,Train
31,32,1,1,"Spencer, Mrs. William Augustus (Marie Eugenie)",female,,1,0,PC 17569,146.5208,B78,C,Train
32,33,1,3,"Glynn, Miss. Mary Agatha",female,,0,0,335677,7.7500,,Q,Train
36,37,1,3,"Mamee, Mr. Hanna",male,,0,0,2677,7.2292,,C,Train
42,43,0,3,"Kraeff, Mr. Theodor",male,,0,0,349253,7.8958,,C,Train


In [9]:
# 算出每一艙等的平均年齡
mean_age_by_pclass = full_data[['Pclass', 'Age']]
mean_age_by_pclass = mean_age_by_pclass.groupby('Pclass', as_index=False).mean()
mean_age_by_pclass

Unnamed: 0,Pclass,Age
0,1,39.15993
1,2,29.506705
2,3,24.816367


In [10]:
p1_age = round(mean_age_by_pclass.loc[mean_age_by_pclass['Pclass'] == 1, ['Age']].values[0][0], 2)
p2_age = round(mean_age_by_pclass.loc[mean_age_by_pclass['Pclass'] == 2, ['Age']].values[0][0], 2)
p3_age = round(mean_age_by_pclass.loc[mean_age_by_pclass['Pclass'] == 3, ['Age']].values[0][0], 2)

In [11]:
full_data.loc[(full_data['Pclass'] == 1) & (full_data['Age'].isnull()), ['Age']] = p1_age
full_data.loc[(full_data['Pclass'] == 2) & (full_data['Age'].isnull()), ['Age']] = p2_age
full_data.loc[(full_data['Pclass'] == 3) & (full_data['Age'].isnull()), ['Age']] = p3_age

In [12]:
# 接著找出Fare欄位中，NaN的資料。
full_data[full_data['Fare'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set
152,1044,,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,Test


In [13]:
# 此乘客是從S港口出發的3艙等，但不知道票價是多少，因此取中從S港口出發的3艙等的票價中位數
fare_median = full_data.loc[(full_data['Embarked'] == 'S') & (full_data['Pclass'] == 3), ['Fare']].median().values[0]
fare_median

8.05

In [14]:
full_data.loc[full_data['Fare'].isnull(), ['Fare']] = fare_median

In [15]:
# 建立新的變數`family_size`，包括乘客本人
full_data["family_size"] = full_data["SibSp"] + full_data["Parch"] + 1

In [16]:
full_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Set,family_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.00,1,0,A/5 21171,7.2500,,S,Train,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.00,1,0,PC 17599,71.2833,C85,C,Train,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.00,0,0,STON/O2. 3101282,7.9250,,S,Train,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.00,1,0,113803,53.1000,C123,S,Train,2
4,5,0,3,"Allen, Mr. William Henry",male,35.00,0,0,373450,8.0500,,S,Train,1
5,6,0,3,"Moran, Mr. James",male,24.82,0,0,330877,8.4583,,Q,Train,1
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.00,0,0,17463,51.8625,E46,S,Train,1
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.00,3,1,349909,21.0750,,S,Train,5
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.00,0,2,347742,11.1333,,S,Train,3
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.00,1,0,237736,30.0708,,C,Train,2


In [17]:
# 檢查是否還有空值，Survived是為了垂直合併用的；Cabin因空值太多，暫不考慮
full_data.isnull().sum()

PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin          1014
Embarked          0
Set               0
family_size       0
dtype: int64

## 五、機器學習

### 5.1選擇變數

In [18]:
training_data = full_data[full_data.Set == "Train"]
training_data = training_data[['Pclass', 'Sex', 'Age', 'family_size', 'Fare', 'Survived']]
X_train = pd.get_dummies(training_data.drop(['Survived'], axis = 1))
y_train = training_data['Survived'].values.astype('int')

test_data = full_data[full_data.Set == "Test"]
X_test = pd.get_dummies(test_data[['Pclass', 'Sex', 'Age', 'family_size', 'Fare']])

In [19]:
X_train

Unnamed: 0,Pclass,Age,family_size,Fare,Sex_female,Sex_male
0,3,22.00,2,7.2500,0,1
1,1,38.00,2,71.2833,1,0
2,3,26.00,1,7.9250,1,0
3,1,35.00,2,53.1000,1,0
4,3,35.00,1,8.0500,0,1
5,3,24.82,1,8.4583,0,1
6,1,54.00,1,51.8625,0,1
7,3,2.00,5,21.0750,0,1
8,3,27.00,3,11.1333,1,0
9,2,14.00,2,30.0708,1,0


In [20]:
X_train.shape, y_train.shape, X_test.shape

((891, 6), (891,), (418, 6))

### 5.2 建立KNN分類模型

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# k-Nearest Neighbor
ks = [3, 5, 7, 9, 11]
for k in ks:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn_acc = cross_val_score(knn, X_train, y_train, cv = 5, scoring = "accuracy").mean()
    print("[n_neighbors = %i] accurary: %.2f%%" % (k, knn_acc * 100))

[n_neighbors = 3] accurary: 71.39%
[n_neighbors = 5] accurary: 71.84%
[n_neighbors = 7] accurary: 69.70%
[n_neighbors = 9] accurary: 70.04%
[n_neighbors = 11] accurary: 70.05%


In [22]:
knn = KNeighborsClassifier(n_neighbors = 5)
# 對test資料做預測
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [23]:
# 依照競賽規定，產出2個欄位的檔案：PassengerId、Survived(predicted)
submission = pd.DataFrame({
    "PassengerId": test_data["PassengerId"],
    "Survived": y_pred
})
submission.to_csv("result/knn_submission.csv", index = False)

## 六、研究結果及討論(含模型評估與改善)

In [24]:
#研究結果
submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,0
7,899,1
8,900,0
9,901,1


### 模型評估：由第21個cell可得知模型的準確率71.84%