# Predicting Survivors of the Titanic
### A First Project on Kaggle

In this project, I will analyze data on passengers of the Titanic, in order to find relationships between survival and other variables, such as age, class, and family details.  I will use data on a subset of passengers to train a machine learning model (in this case a deep neural network using TensorFlow), which can then be used to make predictions on the survival of the remaining passengers.

We first import the necessary packages and read in the training and test data:

In [114]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We first drop the name, id, and ticket number variables, which are unique to each person and cannot be used to find any sort of pattern.  We then identify which columns have missing data:

In [115]:
df_clean = df.drop(columns=['Name', 'PassengerId', 'Ticket'])
dft_clean = df_test.drop(columns=['Name', 'PassengerId', 'Ticket'])

print('Percent of values missing in training set')
for column in df_clean.columns:
    print(column + ': ' + str(len(df_clean[df_clean[column].isnull()])/len(df_clean)))
    
print('\nPercent of values missing in test set')
for column in dft_clean.columns:
    print(column + ': ' + str(len(dft_clean[dft_clean[column].isnull()])/len(dft_clean)))

Percent of values missing in training set
Survived: 0.0
Pclass: 0.0
Sex: 0.0
Age: 0.19865319865319866
SibSp: 0.0
Parch: 0.0
Fare: 0.0
Cabin: 0.7710437710437711
Embarked: 0.002244668911335578

Percent of values missing in test set
Pclass: 0.0
Sex: 0.0
Age: 0.20574162679425836
SibSp: 0.0
Parch: 0.0
Fare: 0.0023923444976076554
Cabin: 0.7822966507177034
Embarked: 0.0


We find that the majority of Cabin data is missing, so we drop the Cabin variable from the data entirely.  Since there are only a small number of missing entries from the Embarked column, we simply drop the passengers with missing Embarked data from the training set.  (There are no missing Embarked entries in the test set.)  For Age, we replace missing values with the average age, for both the training and test sets (a procedure which could be improved upon in future iterations).  For the small number of missing Fare values in the test set, we replace them with the average Fare value.

In [116]:
df_clean.drop(columns = ['Cabin'], inplace=True)
df_clean.dropna(subset=['Embarked'], inplace=True)
df_clean['Age'].fillna(df_clean['Age'].mean(), inplace=True)

dft_clean.drop(columns = ['Cabin'], inplace=True)
dft_clean['Fare'].fillna(dft_clean['Fare'].mean(), inplace=True)
dft_clean['Age'].fillna(dft_clean['Age'].mean(), inplace=True)

We rescale the Age and Fare variables to lie between 0 and 1.  We convert the Sex variable to 0 or 1, representing male or female, respectively.  Similarly, in the Embarked column, we represent (S, C, Q) as (0, 1, 2).

In [117]:
df_clean['Age'] = df_clean['Age']/max(df_clean['Age'])
df_clean['Fare'] = df_clean['Fare']/max(df_clean['Fare'])
df_clean.replace({'male': 0, 'female': 1}, inplace=True)
df_clean.replace({'S':0, 'C':1, 'Q':2}, inplace=True)
df_clean.reset_index(drop=True, inplace=True)

dft_clean['Age'] = dft_clean['Age']/max(dft_clean['Age'])
dft_clean['Fare'] = dft_clean['Fare']/max(dft_clean['Fare'])
dft_clean.replace({'male': 0, 'female': 1}, inplace=True)
dft_clean.replace({'S':0, 'C':1, 'Q':2}, inplace=True)
dft_clean.reset_index(drop=True, inplace=True)

df_clean.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,0.275,1,0,0.014151,0
1,1,1,1,0.475,1,0,0.139136,1
2,1,3,1,0.325,0,0,0.015469,0
3,1,1,1,0.4375,1,0,0.103644,0
4,0,3,0,0.4375,0,0,0.015713,0


To prepare for applying our machine learning algorithm.  We use the Survived data as our dependent variable y, while all other data is grouped into the independent variable x.  We also convert the data into numpy arrays, which is important for feeding the data into a keras neural network.

In [118]:
y = np.asarray(df_clean['Survived'])
x = np.asarray(df_clean[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']])
x_test = np.asarray(dft_clean)

We create a 3-layer neural network, ending in a single neuron with sigmoid activation, to serve as a binary classifier indicating survival or death of passengers.  The first layer contains as many neurons as there are independent variables (7), while the hidden layer in the middle contains three times as many neurons as the middle layer.  The network is optimized using stochastic gradient descent, over 15 epochs.

In [119]:
model = tf.keras.Sequential([tf.keras.layers.Dense(7, activation = tf.nn.selu),
                             tf.keras.layers.Dense(21, activation = tf.nn.selu),
                             tf.keras.layers.Dense(1, activation = 'sigmoid')])
                            
model.compile(optimizer = 'sgd',
              loss = 'binary_crossentropy',
              metrics=['accuracy'])

model.fit(x, y, epochs = 15)

Train on 889 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x22e72336b88>

In [120]:
predictions = model.predict(x_test)
output = df_test[['PassengerId']].merge(pd.DataFrame(predictions), left_index=True, right_index=True)
output.rename(columns = {0:'Survived'}, inplace=True)
output['Survived'] = round(output['Survived']).astype(int)
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [106]:
output.to_csv('my_submission.csv', index=False)