Imports!

In [66]:
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.models import Sequential
from keras.optimizers import SGD, RMSprop
from keras.layers import Dense, Activation

Get training and test data.

In [67]:
train = pd.read_csv('data/train.csv', index_col=0)
test = pd.read_csv('data/test.csv', index_col=0)

Define a function to strip away useless columns in our dataset.

In [68]:
def drop_useless_cols(df):
    useful = ['Survived', 'Pclass', 'Sex', 'Age', 'Fare']
    for col in list(df):
        if col not in useful:
            df = df.drop(col, axis=1)
    return df

What good's a function if you never use it?

In [69]:
train = drop_useless_cols(train)
test = drop_useless_cols(test)

Always good to scope out null values so they don't blindside you down the road.

In [70]:
print(train.isnull().sum())

Survived      0
Pclass        0
Sex           0
Age         177
Fare          0
dtype: int64


Let's explore the data with the aim of writing a function that cleans up those null Age values in a reasonable way.

In [71]:
s = train.groupby(['Pclass', 'Sex'])[['Age']].apply(np.nanmedian)
print(s)

Pclass  Sex   
1       female    35.0
        male      40.0
2       female    28.0
        male      30.0
3       female    21.5
        male      25.0
dtype: float64


I might just be navigating Pandas.Series all wrong, but if you group by a column with numerical values first and try to access elements of the Pandas.Series via bracket notation, you get funny results:

In [72]:
print("s[0], Expected (Actual) | 35.0 ({0})".format(s[0]))
print("s[1], Expected (Actual) | 40.0 ({0})".format(s[1]))
print("s[2], Expected (Actual) | 28.0 ({0})".format(s[2]))
print("s[3], Expected (Actual) | 30.0 ({0})".format(s[3]))
print("s[4], Expected (Actual) | 21.5 ({0})".format(s[4]))
print("s[5], Expected (Actual) | 25.0 ({0})".format(s[5]))

s[0], Expected (Actual) | 35.0 (35.0)
s[1], Expected (Actual) | 40.0 (Sex
female    35.0
male      40.0
dtype: float64)
s[2], Expected (Actual) | 28.0 (Sex
female    28.0
male      30.0
dtype: float64)
s[3], Expected (Actual) | 30.0 (Sex
female    21.5
male      25.0
dtype: float64)
s[4], Expected (Actual) | 21.5 (21.5)
s[5], Expected (Actual) | 25.0 (25.0)


Anyway, switching those columns around works for now.

In [73]:
s = train.groupby(['Sex', 'Pclass'])[['Age']].apply(np.nanmedian)
print(s)

Sex     Pclass
female  1         35.0
        2         28.0
        3         21.5
male    1         40.0
        2         30.0
        3         25.0
dtype: float64


Just to make sure...

In [74]:
print("s[0], Expected (Actual) | 35.0 ({0})".format(s[0]))
print("s[1], Expected (Actual) | 28.0 ({0})".format(s[1]))
print("s[2], Expected (Actual) | 21.5 ({0})".format(s[2]))
print("s[3], Expected (Actual) | 40.0 ({0})".format(s[3]))
print("s[4], Expected (Actual) | 30.0 ({0})".format(s[4]))
print("s[5], Expected (Actual) | 25.0 ({0})".format(s[5]))

s[0], Expected (Actual) | 35.0 (35.0)
s[1], Expected (Actual) | 28.0 (28.0)
s[2], Expected (Actual) | 21.5 (21.5)
s[3], Expected (Actual) | 40.0 (40.0)
s[4], Expected (Actual) | 30.0 (30.0)
s[5], Expected (Actual) | 25.0 (25.0)


Awesome, everything's working properly. Let's use those values to denullify the Age column.

In [75]:
med_age_upperclass_woman = s[0]
med_age_middleclass_woman = s[1]
med_age_lowerclass_woman = s[2]
med_age_upperclass_man = s[3]
med_age_middleclass_man = s[4]
med_age_lowerclass_man = s[5]

def denullify_age_col(df):
    df_ucw = df[(df['Sex'] == 'female') & (df['Pclass'] == 1)].fillna(med_age_upperclass_woman)
    df_mcw = df[(df['Sex'] == 'female') & (df['Pclass'] == 2)].fillna(med_age_middleclass_woman)
    df_lcw = df[(df['Sex'] == 'female') & (df['Pclass'] == 3)].fillna(med_age_lowerclass_woman)
    df_ucm = df[(df['Sex'] == 'male') & (df['Pclass'] == 1)].fillna(med_age_upperclass_man)
    df_mcm = df[(df['Sex'] == 'male') & (df['Pclass'] == 2)].fillna(med_age_middleclass_man)
    df_lcm = df[(df['Sex'] == 'male') & (df['Pclass'] == 3)].fillna(med_age_lowerclass_man)
    df = pd.concat([df_ucw, df_mcw, df_lcw, df_ucm, df_mcm, df_lcm], axis=0)
    return df

train = denullify_age_col(train)
test = denullify_age_col(test)

Apparently we need to get_dummies to deal with some of our datatype issues (methinks, anyway...)

In [76]:
train = pd.get_dummies(train)
test = pd.get_dummies(test)

Now all we've got between us and building our models is the fact that Pclass is currently a quantitative column when it should be categorical.

In [77]:
def categorize_pclass(df):
    df = pd.concat([df, pd.get_dummies(df['Pclass'], prefix='Pclass_')], axis=1).drop('Pclass', axis=1)
    return df

train = categorize_pclass(train)
test = categorize_pclass(test)

Final data input preparations.

In [78]:
y = pd.get_dummies(train['Survived'])
y.head()

Unnamed: 0_level_0,0,1
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,0,1
4,0,1
12,0,1
32,0,1
53,0,1


In [79]:
X = train.drop('Survived', axis=1)

Let's test some conventional machine learning models on this data and see what kind of accuracy we can get.

In [80]:
# Make sure our data is shaped right
X_vals = X.values[:,0:6]
y_vals = y.values[:,1]

from sklearn.metrics import accuracy_score

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
clf_nb = GaussianNB()
clf_nb.fit(X_vals, y_vals)
pred_nb = clf_nb.predict(X_vals)
acc_nb = accuracy_score(y_vals, pred_nb)
print("Naive Bayes training accuracy: {0}%".format(acc_nb*100.))

# Support Vector Machines
from sklearn.svm import SVC
clf_svm = SVC()
clf_svm.fit(X_vals, y_vals)
pred_svm = clf_svm.predict(X_vals)
acc_svm = accuracy_score(y_vals, pred_svm)
print("SVM training accuracy: {0}%".format(acc_svm*100.))

# AdaBoosted Decision Trees

Naive Bayes training accuracy: 78.33894500561168%
SVM training accuracy: 89.67452300785635%


Time for the neural net models.

In [81]:
model = Sequential()
model.add(Dense(units=200, input_dim=X.shape[1]))
model.add(Activation('relu'))
model.add(Dense(units=200, input_dim=200))
model.add(Activation('relu'))
model.add(Dense(units=24, input_dim=200))
model.add(Activation('relu'))
model.add(Dense(units=2, input_dim=24))
model.add(Activation('softmax'))

optim = SGD(lr=0.001)

model.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])

model.fit(X.values, y.values, epochs=500, verbose=0)

<keras.callbacks.History at 0x116598ef0>

In [82]:
print("Keras neural net accuracy: {0}%".format(model.evaluate(X.values, y.values)[1]*100.))

 32/891 [>.............................] - ETA: 0sKeras neural net accuracy: 77.21661057001279%


In [83]:
p_sub = model.predict_classes(test.values)

predictions = pd.DataFrame()
predictions['PassengerId'] = test.index
predictions['Survived'] = p_sub

 32/418 [=>............................] - ETA: 0s

Make sure our predictions dataframe has the right dimensions.

In [84]:
predictions.shape

(418, 2)

Save predictions to a local csv file.

In [85]:
predictions.to_csv('predictions.csv', index=False)