## Background
This kernel is intended to use Keras on the classic Titanic survivors dataset.  It is assuming that you are familiar with the titanic survivors data and skips most of the very necessary EDA. <br />
Specifically I want to see if some of the SibSp and Parch feature engineering can be avoided by using a deep learning architecture and still get a decent enough score.

## Load environment

In [1]:
from __future__ import print_function
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.optimizers import SGD, RMSprop, Adam
from keras.layers import Dense, Activation, Dropout

In [2]:
raw_train = pd.read_csv('../input/train.csv', index_col=0)
raw_train['is_test'] = 0
raw_test = pd.read_csv('../input/test.csv', index_col=0)
raw_test['is_test'] = 1

In [3]:
all_data = pd.concat((raw_train, raw_test), axis=0)

## Functions to preprocess the data

In [4]:
def get_title_last_name(name):
    full_name = name.str.split(', ', n=0, expand=True)
    last_name = full_name[0]
    titles = full_name[1].str.split('.', n=0, expand=True)
    titles = titles[0]
    return(titles)

def get_titles_from_names(df):
    df['Title'] = get_title_last_name(df['Name'])
    df = df.drop(['Name'], axis=1)
    return(df)

def get_dummy_cats(df):
    return(pd.get_dummies(df, columns=['Title', 'Pclass', 'Sex', 'Embarked',
                                       'Cabin', 'Cabin_letter']))

def get_cabin_letter(df):    
    df['Cabin'].fillna('Z', inplace=True)
    df['Cabin_letter'] = df['Cabin'].str[0]    
    return(df)

def process_data(df):
    # preprocess titles, cabin, embarked
    df = get_titles_from_names(df)    
    df['Embarked'].fillna('S', inplace=True)
    df = get_cabin_letter(df)
    
    # drop remaining features
    df = df.drop(['Ticket', 'Fare'], axis=1)
    
    # create dummies for categorial features
    df = get_dummy_cats(df)
    
    return(df)

proc_data = process_data(all_data)
proc_train = proc_data[proc_data['is_test'] == 0]
proc_test = proc_data[proc_data['is_test'] == 1]

In [5]:
proc_data.head()

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,is_test,Title_Capt,Title_Col,Title_Don,Title_Dona,Title_Dr,...,Cabin_Z,Cabin_letter_A,Cabin_letter_B,Cabin_letter_C,Cabin_letter_D,Cabin_letter_E,Cabin_letter_F,Cabin_letter_G,Cabin_letter_T,Cabin_letter_Z
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,22.0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
2,1.0,38.0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,1.0,26.0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
4,1.0,35.0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
5,0.0,35.0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


## Build Network to predict missing ages

X_train으로 age를 뺀것을 독립변수로, y_train을 종속변수로 잡고 진행한다.

In [6]:
for_age_train = proc_data.drop(['Survived', 'is_test'], axis=1).dropna(axis=0)
X_train_age = for_age_train.drop('Age', axis=1)
y_train_age = for_age_train['Age']

In [7]:
X_train_age.shape

(1046, 224)

In [8]:
y_train_age.shape

(1046,)

> age를 딥러닝으로 찾음

In [29]:
# create model
tmodel = Sequential()
tmodel.add(Dense(input_dim=X_train_age.shape[1], units=128, #shape =>독립변수 개수 / units = 뉴런의 수
                 kernel_initializer='normal', bias_initializer='zeros'))
# input_dim => 입력 차원이 1이라는 뜻 / 입력 하는 노드가 한개
#초기값 설정기의 사용법
#초기값 설정은 케라스 레이어의 초기 난수 가중치를 설정하는 방식을 규정합니다.
#초기값 설정기를 레이어에 전달하는 키워드 인수는 레이어 종류
#  웨이트 값을 어떻게 초기화 시키는 지에 따라서 그라디언트를 얼마나 잘 전달할 수 있고, 레이어를 더 많이 쌓을 수 있는지를 좌우한
# bias_initializer 이니셜 라이저는 Keras 레이어의 초기 임의 가중치를 설정하는 방법
# 0으로 초기화 된 텐서를 생성하는 이니셜 라이저

tmodel.add(Activation('relu')) # 활성화 함수(전달하는 신호의 세기를 정하는 방법 0 과 1을 보내는데 있어서 이 세기에 따라 보낸다.)

for i in range(0, 8):
    tmodel.add(Dense(units=64, kernel_initializer='normal', # dense는 보통 층을 의미
                     bias_initializer='zeros'))
    tmodel.add(Activation('relu'))
    tmodel.add(Dropout(.25)) # 랜덤하게 뉴런을 끊는다.

tmodel.add(Dense(units=1)) # 종속변수 개수 1 
tmodel.add(Activation('linear'))

tmodel.compile(loss='mean_squared_error', optimizer='rmsprop')

In [30]:
tmodel.fit(X_train_age.values, y_train_age.values, epochs=600, verbose=2) # epochs=> 전체데이터를 몇번 반복할지
#=> loss는 학습이 얼마나 진행되었는지 알려준다. 학습을 할 때마다 지금 모델이 얼마나 맞추고 있는지 나타낸다.
# loss는 차이의 제곱의 평균이다. loss가 원하는 수준으로 떨어질때까지 반복한다.
#  verbose 는 학습의 진행상황을 보여줄것인지.

Epoch 1/600
33/33 - 0s - loss: 517.5284
Epoch 2/600
33/33 - 0s - loss: 226.2280
Epoch 3/600
33/33 - 0s - loss: 206.2910
Epoch 4/600
33/33 - 0s - loss: 203.2900
Epoch 5/600
33/33 - 0s - loss: 207.6428
Epoch 6/600
33/33 - 0s - loss: 194.2111
Epoch 7/600
33/33 - 0s - loss: 176.3115
Epoch 8/600
33/33 - 0s - loss: 175.9699
Epoch 9/600
33/33 - 0s - loss: 183.7759
Epoch 10/600
33/33 - 0s - loss: 164.1871
Epoch 11/600
33/33 - 0s - loss: 162.5147
Epoch 12/600
33/33 - 0s - loss: 143.3214
Epoch 13/600
33/33 - 0s - loss: 151.7272
Epoch 14/600
33/33 - 0s - loss: 150.7619
Epoch 15/600
33/33 - 0s - loss: 154.0368
Epoch 16/600
33/33 - 0s - loss: 135.6397
Epoch 17/600
33/33 - 0s - loss: 143.8985
Epoch 18/600
33/33 - 0s - loss: 147.5223
Epoch 19/600
33/33 - 0s - loss: 136.1484
Epoch 20/600
33/33 - 0s - loss: 144.4813
Epoch 21/600
33/33 - 0s - loss: 129.9934
Epoch 22/600
33/33 - 0s - loss: 119.8342
Epoch 23/600
33/33 - 0s - loss: 128.8457
Epoch 24/600
33/33 - 0s - loss: 123.4320
Epoch 25/600
33/33 - 0s -

Epoch 202/600
33/33 - 0s - loss: 85.0590
Epoch 203/600
33/33 - 0s - loss: 86.0494
Epoch 204/600
33/33 - 0s - loss: 83.8481
Epoch 205/600
33/33 - 0s - loss: 81.9991
Epoch 206/600
33/33 - 0s - loss: 84.6423
Epoch 207/600
33/33 - 0s - loss: 86.4496
Epoch 208/600
33/33 - 0s - loss: 86.3139
Epoch 209/600
33/33 - 0s - loss: 84.7838
Epoch 210/600
33/33 - 0s - loss: 83.6881
Epoch 211/600
33/33 - 0s - loss: 86.7827
Epoch 212/600
33/33 - 0s - loss: 84.5842
Epoch 213/600
33/33 - 0s - loss: 82.4114
Epoch 214/600
33/33 - 0s - loss: 82.1651
Epoch 215/600
33/33 - 0s - loss: 84.7647
Epoch 216/600
33/33 - 0s - loss: 82.5006
Epoch 217/600
33/33 - 0s - loss: 84.6013
Epoch 218/600
33/33 - 0s - loss: 81.2997
Epoch 219/600
33/33 - 0s - loss: 87.4687
Epoch 220/600
33/33 - 0s - loss: 85.6525
Epoch 221/600
33/33 - 0s - loss: 82.8215
Epoch 222/600
33/33 - 0s - loss: 81.6132
Epoch 223/600
33/33 - 0s - loss: 85.8833
Epoch 224/600
33/33 - 0s - loss: 82.7517
Epoch 225/600
33/33 - 0s - loss: 83.5191
Epoch 226/600
33

Epoch 402/600
33/33 - 0s - loss: 79.7365
Epoch 403/600
33/33 - 0s - loss: 78.6804
Epoch 404/600
33/33 - 0s - loss: 78.6986
Epoch 405/600
33/33 - 0s - loss: 77.4133
Epoch 406/600
33/33 - 0s - loss: 80.7930
Epoch 407/600
33/33 - 0s - loss: 76.7141
Epoch 408/600
33/33 - 0s - loss: 76.0574
Epoch 409/600
33/33 - 0s - loss: 82.4295
Epoch 410/600
33/33 - 0s - loss: 81.4196
Epoch 411/600
33/33 - 0s - loss: 78.8108
Epoch 412/600
33/33 - 0s - loss: 80.0909
Epoch 413/600
33/33 - 0s - loss: 79.4249
Epoch 414/600
33/33 - 0s - loss: 76.8454
Epoch 415/600
33/33 - 0s - loss: 76.7523
Epoch 416/600
33/33 - 0s - loss: 79.8211
Epoch 417/600
33/33 - 0s - loss: 79.8144
Epoch 418/600
33/33 - 0s - loss: 75.7859
Epoch 419/600
33/33 - 0s - loss: 75.9410
Epoch 420/600
33/33 - 0s - loss: 78.7568
Epoch 421/600
33/33 - 0s - loss: 79.3011
Epoch 422/600
33/33 - 0s - loss: 79.7741
Epoch 423/600
33/33 - 0s - loss: 79.2119
Epoch 424/600
33/33 - 0s - loss: 78.1028
Epoch 425/600
33/33 - 0s - loss: 78.8777
Epoch 426/600
33

<tensorflow.python.keras.callbacks.History at 0x7ff9fef6a050>

In [31]:
train_data = proc_train
train_data.loc[train_data['Age'].isnull()]

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,is_test,Title_Capt,Title_Col,Title_Don,Title_Dona,Title_Dr,...,Cabin_Z,Cabin_letter_A,Cabin_letter_B,Cabin_letter_C,Cabin_letter_D,Cabin_letter_E,Cabin_letter_F,Cabin_letter_G,Cabin_letter_T,Cabin_letter_Z
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,0.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
18,1.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
20,1.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
27,0.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
29,1.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
860,0.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
864,0.0,,8,2,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
869,0.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
879,0.0,,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


In [32]:
train_data.loc[train_data['Age'].isnull(), 'Age'].shape

(177,)

In [33]:
to_pred = train_data.loc[train_data['Age'].isnull()].drop(
          ['Age', 'Survived', 'is_test'], axis=1)
p = tmodel.predict(to_pred.values)
train_data.loc[train_data['Age'].isnull(), 'Age'] = p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


In [34]:
test_data = proc_test
to_pred = test_data.loc[test_data['Age'].isnull()].drop(
          ['Age', 'Survived', 'is_test'], axis=1)
p = tmodel.predict(to_pred.values)
# test_data['Age'].loc[test_data['Age'].isnull()] = p
test_data.loc[test_data['Age'].isnull(), 'Age'] = p

In [35]:
train_data.loc[train_data['Age'].isnull()]

Unnamed: 0_level_0,Survived,Age,SibSp,Parch,is_test,Title_Capt,Title_Col,Title_Don,Title_Dona,Title_Dr,...,Cabin_Z,Cabin_letter_A,Cabin_letter_B,Cabin_letter_C,Cabin_letter_D,Cabin_letter_E,Cabin_letter_F,Cabin_letter_G,Cabin_letter_T,Cabin_letter_Z
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1


In [36]:
y = pd.get_dummies(train_data['Survived'])
y.head()

Unnamed: 0_level_0,0.0,1.0
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1,0
2,0,1
3,0,1
4,0,1
5,1,0


In [37]:
X = train_data.drop(['Survived', 'is_test'], axis=1)

In [38]:
# create model
model = Sequential()
model.add(Dense(input_dim=X.shape[1], units=128,
                 kernel_initializer='normal', bias_initializer='zeros'))
model.add(Activation('relu'))

for i in range(0, 15):
    model.add(Dense(units=128, kernel_initializer='normal',
                     bias_initializer='zeros'))
    model.add(Activation('relu'))
    model.add(Dropout(.40))

model.add(Dense(units=2))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [39]:
model.fit(X.values, y.values, epochs=500, verbose=2)

Epoch 1/500
28/28 - 0s - loss: 0.6782 - accuracy: 0.6027
Epoch 2/500
28/28 - 0s - loss: 0.6678 - accuracy: 0.6162
Epoch 3/500
28/28 - 0s - loss: 0.6510 - accuracy: 0.6162
Epoch 4/500
28/28 - 0s - loss: 0.6150 - accuracy: 0.6162
Epoch 5/500
28/28 - 0s - loss: 0.6523 - accuracy: 0.5556
Epoch 6/500
28/28 - 0s - loss: 0.6829 - accuracy: 0.6027
Epoch 7/500
28/28 - 0s - loss: 0.6711 - accuracy: 0.6162
Epoch 8/500
28/28 - 0s - loss: 0.6659 - accuracy: 0.6162
Epoch 9/500
28/28 - 0s - loss: 0.6410 - accuracy: 0.6162
Epoch 10/500
28/28 - 0s - loss: 0.7138 - accuracy: 0.6162
Epoch 11/500
28/28 - 0s - loss: 0.6779 - accuracy: 0.6162
Epoch 12/500
28/28 - 0s - loss: 0.6708 - accuracy: 0.6162
Epoch 13/500
28/28 - 0s - loss: 0.6667 - accuracy: 0.6162
Epoch 14/500
28/28 - 0s - loss: 0.6678 - accuracy: 0.6162
Epoch 15/500
28/28 - 0s - loss: 0.6669 - accuracy: 0.6162
Epoch 16/500
28/28 - 0s - loss: 0.6676 - accuracy: 0.6162
Epoch 17/500
28/28 - 0s - loss: 0.6676 - accuracy: 0.6162
Epoch 18/500
28/28 - 0s

Epoch 142/500
28/28 - 0s - loss: 0.6768 - accuracy: 0.6162
Epoch 143/500
28/28 - 0s - loss: 0.6598 - accuracy: 0.6162
Epoch 144/500
28/28 - 0s - loss: 0.6573 - accuracy: 0.6162
Epoch 145/500
28/28 - 0s - loss: 0.6551 - accuracy: 0.6162
Epoch 146/500
28/28 - 0s - loss: 0.6732 - accuracy: 0.6162
Epoch 147/500
28/28 - 0s - loss: 0.6579 - accuracy: 0.6162
Epoch 148/500
28/28 - 0s - loss: 0.6572 - accuracy: 0.6162
Epoch 149/500
28/28 - 0s - loss: 0.6560 - accuracy: 0.6162
Epoch 150/500
28/28 - 0s - loss: 0.6553 - accuracy: 0.6162
Epoch 151/500
28/28 - 0s - loss: 0.6558 - accuracy: 0.6162
Epoch 152/500
28/28 - 0s - loss: 0.6544 - accuracy: 0.6162
Epoch 153/500
28/28 - 0s - loss: 0.6545 - accuracy: 0.6162
Epoch 154/500
28/28 - 0s - loss: 0.6524 - accuracy: 0.6162
Epoch 155/500
28/28 - 0s - loss: 0.6554 - accuracy: 0.6162
Epoch 156/500
28/28 - 0s - loss: 0.6587 - accuracy: 0.6162
Epoch 157/500
28/28 - 0s - loss: 0.6605 - accuracy: 0.6162
Epoch 158/500
28/28 - 0s - loss: 0.6596 - accuracy: 0.61

Epoch 281/500
28/28 - 0s - loss: 0.3449 - accuracy: 0.8754
Epoch 282/500
28/28 - 0s - loss: 0.3396 - accuracy: 0.8721
Epoch 283/500
28/28 - 0s - loss: 0.3208 - accuracy: 0.8833
Epoch 284/500
28/28 - 0s - loss: 0.3260 - accuracy: 0.8799
Epoch 285/500
28/28 - 0s - loss: 0.3563 - accuracy: 0.8664
Epoch 286/500
28/28 - 0s - loss: 0.3786 - accuracy: 0.8642
Epoch 287/500
28/28 - 0s - loss: 0.3388 - accuracy: 0.8721
Epoch 288/500
28/28 - 0s - loss: 0.3190 - accuracy: 0.8833
Epoch 289/500
28/28 - 0s - loss: 0.3216 - accuracy: 0.8799
Epoch 290/500
28/28 - 0s - loss: 0.3020 - accuracy: 0.8900
Epoch 291/500
28/28 - 0s - loss: 0.3196 - accuracy: 0.8810
Epoch 292/500
28/28 - 0s - loss: 0.3196 - accuracy: 0.8822
Epoch 293/500
28/28 - 0s - loss: 0.3395 - accuracy: 0.8698
Epoch 294/500
28/28 - 0s - loss: 0.2978 - accuracy: 0.8833
Epoch 295/500
28/28 - 0s - loss: 0.3308 - accuracy: 0.8788
Epoch 296/500
28/28 - 0s - loss: 0.3123 - accuracy: 0.8855
Epoch 297/500
28/28 - 0s - loss: 0.3108 - accuracy: 0.88

Epoch 420/500
28/28 - 0s - loss: 0.3078 - accuracy: 0.8855
Epoch 421/500
28/28 - 0s - loss: 0.3139 - accuracy: 0.8754
Epoch 422/500
28/28 - 0s - loss: 0.3076 - accuracy: 0.8810
Epoch 423/500
28/28 - 0s - loss: 0.3130 - accuracy: 0.8855
Epoch 424/500
28/28 - 0s - loss: 0.3221 - accuracy: 0.8765
Epoch 425/500
28/28 - 0s - loss: 0.3041 - accuracy: 0.8844
Epoch 426/500
28/28 - 0s - loss: 0.3143 - accuracy: 0.8732
Epoch 427/500
28/28 - 0s - loss: 0.3124 - accuracy: 0.8844
Epoch 428/500
28/28 - 0s - loss: 0.3062 - accuracy: 0.8855
Epoch 429/500
28/28 - 0s - loss: 0.3038 - accuracy: 0.8866
Epoch 430/500
28/28 - 0s - loss: 0.3046 - accuracy: 0.8866
Epoch 431/500
28/28 - 0s - loss: 0.3320 - accuracy: 0.8799
Epoch 432/500
28/28 - 0s - loss: 0.3058 - accuracy: 0.8900
Epoch 433/500
28/28 - 0s - loss: 0.3064 - accuracy: 0.8844
Epoch 434/500
28/28 - 0s - loss: 0.3285 - accuracy: 0.8822
Epoch 435/500
28/28 - 0s - loss: 0.3117 - accuracy: 0.8833
Epoch 436/500
28/28 - 0s - loss: 0.3125 - accuracy: 0.87

<tensorflow.python.keras.callbacks.History at 0x7ff9fd322650>

In [40]:
test_data.columns

Index(['Survived', 'Age', 'SibSp', 'Parch', 'is_test', 'Title_Capt',
       'Title_Col', 'Title_Don', 'Title_Dona', 'Title_Dr',
       ...
       'Cabin_Z', 'Cabin_letter_A', 'Cabin_letter_B', 'Cabin_letter_C',
       'Cabin_letter_D', 'Cabin_letter_E', 'Cabin_letter_F', 'Cabin_letter_G',
       'Cabin_letter_T', 'Cabin_letter_Z'],
      dtype='object', length=227)

In [41]:
p_survived = model.predict_classes(test_data.drop(['Survived', 'is_test'], axis=1).values)

In [42]:
submission = pd.DataFrame()
submission['PassengerId'] = test_data.index
submission['Survived'] = p_survived

In [43]:
submission.shape

(418, 2)

In [44]:
submission.to_csv('titanic_keras_cs.csv', index=False)