# kaggle - Titanic: Machine Learning from Disaster

下の2つのページを参考にしながら実施  
u++さん:  
・https://www.kaggle.com/sishihara/hypothesis-and-visualization-for-titanic-in-kaggle/notebook  
  
KaggleチュートリアルTitanicで上位3%以内に入るには。(0.82297):  
・https://lp-tech.net/articles/0QUUd

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re as re # 正規表現ライブラリ

## データセットの読み込み

In [2]:
train = pd.read_csv('./input/train.csv')    # 891 x 12
test = pd.read_csv('./input/test.csv')    #418 x 11

### データセットを連結して特徴量エンジニアリングしやすくする

In [3]:
data = train.append(test) # testデータのSurvivedはNaNになる
data

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.2500,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,26.0,,S,7.9250,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,35.0,,S,8.0500,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450
5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877
6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463
7,2.0,,S,21.0750,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909
8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742
9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736


## 特徴量エンジニアリング

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB


### 1. pclass
Passenger Class  
・Null値なし  
・int64型→マッピング必要なし

### 2. Sex
性別データ  
・Null値なし  
・object型→マッピングが必要   
  
マッピング  
男性：0  
女性：1

In [5]:
# inplaceするので2回目以降の実行はエラーとなる
train['Sex'].replace(['male', 'female'], [0, 1], inplace=True)
test['Sex'].replace(['male', 'female'], [0, 1], inplace=True)

### 3. Embarked
 乗船した港　Cherbourg、Queenstown、Southamptonの３種類  
`Embarked       1307 non-null object`  
・Null値が2つある→埋める必要あり  
・object型→マッピングが必要  

参考①では再頻出のS(Southampton)で埋めている  
そのほかの埋め方で考えられるのは以下  
・Null値が多い場合は'C','Q','S'の出現頻度の確率で埋めても良いかもしれない  
・家族関係がわかれば、同じ港で乗ったことにする方法もある  

In [22]:
# まずはNull値を'S'で埋める
train['Embarked'].fillna(('S'), inplace=True)
test['Embarked'].fillna(('S'), inplace=True)

In [24]:
SouthamptonPassenger = train[train['Embarked'] == 'S']
SPasNum = len(SouthamptonPassenger)
S_SurvivedNum = len(SouthamptonPassenger[SouthamptonPassenger['Survived'] == 1])

QueenstownPassenger = train[train['Embarked'] == 'Q']
QPasNum = len(QueenstownPassenger)
Q_SurvivedNum = len(QueenstownPassenger[QueenstownPassenger['Survived'] == 1])

CherbourgPassenger = train[train['Embarked'] == 'C']
CPasNum = len(CherbourgPassenger)
C_SurvivedNum = len(CherbourgPassenger[CherbourgPassenger['Survived'] == 1])

print("SouthamptonPassengerで乗船した乗客の生存率:{}%".format(S_SurvivedNum/SPasNum * 100))
print("QueenstownPassenger で乗船した乗客の生存率:{}%".format(Q_SurvivedNum/QPasNum * 100))
print("CherbourgPassenger  で乗船した乗客の生存率:{}%".format(C_SurvivedNum/CPasNum * 100))

SouthamptonPassengerで乗船した乗客の生存率:33.90092879256966%
QueenstownPassenger で乗船した乗客の生存率:38.961038961038966%
CherbourgPassenger  で乗船した乗客の生存率:55.35714285714286%


In [None]:
# object型の変数をintに変換
train['Embarked'] = train['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype(int);
test['Embarked'] = test['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype(int);

### 4. Fare
運賃  
`Fare           1308 non-null float64`  
・Null値が2つある→埋める必要あり  
・float64型→マッピングの必要なし  

参考①では全体の平均値で埋めている  
そのほかの埋め方で考えられるの以下  
・家族関係がわかれば、同じ運賃とすることができる

In [84]:
train['Fare'].fillna(np.mean(train['Fare']), inplace=True)
test['Fare'].fillna(np.mean(train['Fare']), inplace=True)# testデータは使わず、trainデータを使って平均を埋める

運賃を4つのカテゴリに分類する

In [85]:
train['Categorical_Fare'] = pd.qcut(train['Fare'], 4, labels=False)# 量をもとにビン分割
test['Categorical_Fare'] = pd.qcut(train['Fare'], 4, labels=False)# testデータは使わず、trainデータを使って分割する

### 5. Age
年齢  
`Age            1046 non-null float64`  
・Null値が結構ある　→　何らかの方法で埋める必要あり  
・float64型→マッピング必要なし  

参考①では±σの範囲内の乱数で値を埋めている。  
値が出そろったところで5カテゴリに分類する

In [86]:
age_ave = train['Age'].mean()
age_std = train['Age'].std()

train['Age'].fillna(np.random.randint(age_ave - age_std, age_ave + age_std), inplace=True)
test['Age'].fillna(np.random.randint(age_ave - age_std, age_ave + age_std), inplace=True)

年齢を5つのカテゴリに分類する

In [87]:
train['Categorical_Age'] = pd.cut(train['Age'], 5, labels=False)
test['Categorical_Age'] = pd.cut(train['Age'], 5, labels=False)

### 6. Name
名前  
`Name           1309 non-null object`  
・Null値なし  
・object型→マッピングが必要

nameは'str'型  
title_searchは'_sre.SRE_Match'型

In [92]:
# Dropping Title feature
def get_title(name):
    title_search = re.search('([A-Za-z]+)\.', name)# 文字列にドットがつくものを抽出
    # If the title exists, extract and return it.
    if title_search:
        #print(type(title_search))
        print('{}:{}'.format(title_search, title_search.groups()))
        return title_search.group(1) # title_search.group(0) は 'Mr.' title_search.group(1) は'Mr'※どっとがなくなった
    return ""

train['Title'] = train['Name'].apply(get_title)
test['Title'] = test['Name'].apply(get_title)
print(train['Title'].head())

<_sre.SRE_Match object; span=(8, 11), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(9, 13), match='Mrs.'>:('Mrs',)
<_sre.SRE_Match object; span=(11, 16), match='Miss.'>:('Miss',)
<_sre.SRE_Match object; span=(10, 14), match='Mrs.'>:('Mrs',)
<_sre.SRE_Match object; span=(7, 10), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(7, 10), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(10, 13), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(9, 16), match='Master.'>:('Master',)
<_sre.SRE_Match object; span=(9, 13), match='Mrs.'>:('Mrs',)
<_sre.SRE_Match object; span=(8, 12), match='Mrs.'>:('Mrs',)
<_sre.SRE_Match object; span=(11, 16), match='Miss.'>:('Miss',)
<_sre.SRE_Match object; span=(9, 14), match='Miss.'>:('Miss',)
<_sre.SRE_Match object; span=(13, 16), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(11, 14), match='Mr.'>:('Mr',)
<_sre.SRE_Match object; span=(9, 14), match='Miss.'>:('Miss',)
<_sre.SRE_Match object; span=(9, 13), match='Mrs.'>:('Mrs',)
<_sre.SRE_Match 

'Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'  
→'Rare'へ変換

'Mlle', 'Ms', 'Mme'  
→'Miss', 'Miss', 'Mrs'へ変換

'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5  
へマッピング

In [69]:
train['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare', inplace=True)
test['Title'].replace(['Lady', 'Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare', inplace=True)

train['Title'].replace(['Mlle', 'Ms', 'Mme'], ['Miss', 'Miss', 'Mrs'], inplace=True)
test['Title'].replace(['Mlle', 'Ms', 'Mme'], ['Miss', 'Miss', 'Mrs'], inplace=True)

title_mapping = {'Mr':1, 'Miss':2, 'Mrs':3, 'Master':4, 'Rare':5}
train['Title'] = train['Title'].map(title_mapping)
test['Title'] = test['Title'].map(title_mapping)

train['Title'].fillna(0, inplace=True)
test['Title'].fillna(0, inplace=True)
print(train['Title'].head())

0    1
1    3
2    2
3    3
4    1
Name: Title, dtype: int64


In [70]:
data = train.append(test)

## データクリーニング

In [71]:
data.head()

Unnamed: 0,Age,Cabin,Categorical_Age,Categorical_Fare,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Title
0,22.0,,1,0,0,7.25,"Braund, Mr. Owen Harris",0,1,3,0,1,0.0,A/5 21171,1
1,38.0,C85,2,3,1,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,1,1,1.0,PC 17599,3
2,26.0,,1,1,0,7.925,"Heikkinen, Miss. Laina",0,3,3,1,0,1.0,STON/O2. 3101282,2
3,35.0,C123,2,3,0,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,1,1,1.0,113803,3
4,35.0,,2,1,0,8.05,"Allen, Mr. William Henry",0,5,3,0,0,0.0,373450,1


In [72]:
delete_columns = ['Fare', 'Age', 'Name', 'PassengerId', 'SibSp', 'Parch', 'Ticket', 'Cabin']
train.drop(delete_columns, axis=1, inplace = True)
test.drop(delete_columns, axis=1, inplace = True)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked,Categorical_Fare,Categorical_Age,Title
0,0,3,0,0,0,1,1
1,1,1,1,1,3,2,3
2,1,3,1,0,1,1,2
3,1,1,1,0,3,2,3
4,0,3,0,0,1,2,1


## Classification

X と y の生成  
X : train から Survived を削除したデータ  
y : train の Survived だけのデータ（教師データ）

In [73]:
X = train.drop('Survived', axis=1)
y = train['Survived']
X_test = test.copy()

### 特徴量スケーリング

In [74]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
X = std_scaler.fit_transform(X)
X_test = std_scaler.fit_transform(X_test)

## Grid Search CV

### knnを用いる

In [75]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

n_neighbors = list(range(5, 20, 1))
algorithm = ['auto']
weights = ['uniform', 'distance']
leaf_size = list(range(1, 50, 5))
hyperparams = {'algorithm' : algorithm, 'weights' : weights, 'leaf_size' : leaf_size, 'n_neighbors' : n_neighbors}
gd = GridSearchCV(estimator = KNeighborsClassifier(), param_grid = hyperparams, verbose=True, cv=10, scoring='roc_auc', n_jobs=10)
gd.fit(X, y)
print(gd.best_score_)
print(gd.best_estimator_)

Fitting 10 folds for each of 300 candidates, totalling 3000 fits


[Parallel(n_jobs=10)]: Done  30 tasks      | elapsed:   10.7s
[Parallel(n_jobs=10)]: Done 180 tasks      | elapsed:   12.6s
[Parallel(n_jobs=10)]: Done 678 tasks      | elapsed:   17.8s
[Parallel(n_jobs=10)]: Done 2078 tasks      | elapsed:   29.0s
[Parallel(n_jobs=10)]: Done 3000 out of 3000 | elapsed:   37.7s finished


0.8610769072675721
KNeighborsClassifier(algorithm='auto', leaf_size=26, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=13, p=2,
           weights='distance')


### Grid Searchで見つけたモデルを使用する

In [76]:
gd.best_estimator_.fit(X,y)
y_pred = gd.best_estimator_.predict(test)

### サブミットする

In [77]:
temp = pd.DataFrame(pd.read_csv('./input/test.csv')['PassengerId'])
temp['Survived'] = list(map(int, y_pred))
temp.to_csv('submission.csv', index=False)