# Spaceship Titanic - Challenge

This is an evolution of the Titanic entry challenge on Kaggle. I want to train using a Deep Neural Network Model (DNN model) and the Keras Tuner. It will closely follow what I had done for the Titanic challenge, except that it will make use of the tuner later.

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
import keras_tuner as kt

In [3]:
train_data_raw = pd.read_csv("../input/spaceship-titanic/train.csv")
test_data_raw = pd.read_csv("../input/spaceship-titanic/test.csv")

column_names = train_data_raw.columns
print(column_names)

### Missing data

In [4]:
missing = train_data_raw.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

There are missing data in all explanatory variables except for the passenger ID. Let's do a small analysis of the response variable, `Transported`. To start, this variable does not contain missing data.

In [5]:
train_data_raw['Transported'].describe()

So, about half the passengers in the training dataset were transported to another dimension. Let's start by dropping a few useless variables, namely Passenger ID and Name.

In [6]:
train_data = train_data_raw.drop(['PassengerId', 'Name'], axis=1)

We also need to encode some data. Let's look at the distributions of the luxury amenities:

In [7]:
luxury = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
f = pd.melt(train_data, value_vars=luxury)
g = sns.FacetGrid(f, col='variable', col_wrap=5, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")

From this it seems that most people used very little these possibilities. But is this relevant? 

The numerical variables are: Age, RoomService, FoodCourt, ShoppingMall, Spa, and VRDeck.
The boolean variables are: CryoSleep, VIP, Transported (response)
The categorical variables are: HomePlanet, Cabin, Destination

First we need to deal with Cabin to split it into 3 variables, as it's a composed variable.

### The Cabin variable

We need to encode the Cabin variable in such a way that each of the three represented variables is a different number. First, then, let's separate the variable in three.

In [8]:
train_data['Cabin'].head()

For each entry `entry`, the deck is represented by `entry[0]`, the port by `entry[2]` and the starboard by `entry[4]`. Let's encode these in different variables then.

In [9]:
train_data[['Deck', 'Port', 'Starboard']] = train_data['Cabin'].str.split("/", expand=True)
train_data = train_data.drop(['Cabin'], axis=1)

In [10]:
test_data_raw[['Deck', 'Port', 'Starboard']] = test_data_raw['Cabin'].str.split("/", expand=True)
test_data = test_data_raw.drop(['Cabin'], axis=1)
test_data = test_data.drop(['PassengerId', 'Name'], axis=1)

Next, we encode categorical variables as numbers? I like one-hot encoding, so that's what I'm gonna do:

In [11]:
train_data = pd.get_dummies(train_data, columns=['Destination', 'HomePlanet', 'Deck', 'Port', 'Starboard'])
test_data = pd.get_dummies(test_data, columns=['Destination', 'HomePlanet', 'Deck', 'Port', 'Starboard'])

We drop all columns that are not found in `test_data`, to ensure that prediction can be performed at the end.

In [12]:
train_data = train_data.drop(train_data.columns[~train_data.columns.isin(test_data.columns)], 1)

In [13]:
print(train_data.columns)
print(test_data.columns)

This leads to two dataframes with different number of columns! This means that the test set has categories that the train dataset does not have. Where exactly?

In [14]:
test_data.columns[~test_data.columns.isin(train_data.columns)]

These are all `'Port'` features, which is then unlikely to be significant as it stands. I think we should drop the `'Port'` feature completely. Let's redefine the whole dataset:

In [15]:
train_data = train_data_raw.drop(['PassengerId', 'Name'], axis=1)
test_data = test_data_raw.drop(['PassengerId', 'Name'], axis=1)

train_data[['Deck', 'Port', 'Starboard']] = train_data['Cabin'].str.split("/", expand=True)
train_data = train_data.drop(['Cabin', 'Port'], axis=1)
test_data[['Deck', 'Port', 'Starboard']] = test_data['Cabin'].str.split("/", expand=True)
test_data = test_data.drop(['Cabin', 'Port'], axis=1)

train_data = pd.get_dummies(train_data, columns=['Destination', 'HomePlanet', 'Deck', 'Starboard'])
test_data = pd.get_dummies(test_data, columns=['Destination', 'HomePlanet', 'Deck', 'Starboard'])

And to be sure we drop all columns that aren't found in `test_data`.

In [16]:
train_data = train_data.drop(train_data.columns[~train_data.columns.isin(test_data.columns)], 1)

In [17]:
train_data.shape[1] == test_data.shape[1]

### a) fill NAs

`'CryoSleep'` and `'VIP'` are booleans, so we fill them with 0s. Because all the `luxury` variables have mostly 0, we do the same. All the categorical variables will also be filled with 0. We will fill only `'Age'` with the mean. 

In [18]:
for column in train_data.columns:
    if column == 'Age':
        train_data[column] = train_data[column].fillna(np.nanmean(train_data['Age']))
    else: 
        train_data[column] = train_data[column].fillna(0)

for column in test_data.columns:
    if column == 'Age':
        test_data[column] = test_data[column].fillna(np.nanmean(test_data['Age']))
    else: 
        test_data[column] = test_data[column].fillna(0)


### b) turn booleans into 0 and 1

In [19]:
train_data[['CryoSleep', 'VIP']]

In [20]:
train_data['CryoSleep'] = train_data['CryoSleep'].apply(lambda x: int(x))
train_data['VIP'] = train_data['VIP'].apply(lambda x: int(x))

test_data['CryoSleep'] = test_data['CryoSleep'].apply(lambda x: int(x))
test_data['VIP'] = test_data['VIP'].apply(lambda x: int(x))

### c) categorize `luxury` variables

In [21]:
for column in luxury:
    train_data[column] = train_data[column].map(lambda x: 0 if x == 0 else 1)
    test_data[column] = test_data[column].map(lambda x: 0 if x == 0 else 1)

In [22]:
train_data[luxury]

## Train model

In [23]:
# rename for modelling
train_features = train_data
train_labels = train_data_raw['Transported']

We apply a normalization layer:

In [24]:
normalizer = layers.Normalization(input_shape=[train_features.shape[1],], axis=-1)
normalizer.adapt(np.array(train_features))

In [50]:
def model_builder(hp):
    model = keras.Sequential()
    model.add(normalizer)
    
    hp_units_1 = hp.Int('units_1', min_value=32, max_value=512, step=32)
    hp_dropout = hp.Float('dropout', min_value=0.1, max_value=0.9, step=0.1)
    model.add(keras.layers.Dense(units=hp_units_1, activation='relu'))
    model.add(keras.layers.Dropout(hp_dropout))
    model.add(keras.layers.Dense(1))
    
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=hp_learning_rate),
                  loss=keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

    return model


Let's start the tuning.

In [51]:
tuner = kt.Hyperband(model_builder,
                     objective='val_binary_accuracy',
                     max_epochs=20,
                     factor=3,
                     directory='tuner',
                     project_name='space-titanic',
                     overwrite=True)

In [52]:
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_binary_loss', patience=5)

In [53]:
tuner.search(train_features,
             train_labels,
             epochs=20,
             validation_split=0.2,
             callbacks=[stop_early])

In [56]:
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The optimal number of units in the densely-connected layers is {best_hps.get('units_1')}, and the optimal learning_rate for the optimizer is {best_hps.get('learning_rate')}. The best droupout is {best_hps.get('dropout')}.
""")

## Train the model

In [57]:
model = tuner.hypermodel.build(best_hps)
history = model.fit(train_features, train_labels, epochs=50, validation_split=0.2)

In [58]:
val_accuracy = history.history['val_binary_accuracy']
best_epoch = val_accuracy.index(max(val_accuracy)) + 1
print('Best epoch: %d' % (best_epoch,))

In [59]:
hypermodel = tuner.hypermodel.build(best_hps)
hypermodel.fit(train_features, train_labels, epochs=best_epoch, validation_split=0.2)

In [60]:
plt.plot(hypermodel.history.history['binary_accuracy'], label='accuracy')
plt.plot(hypermodel.history.history['val_binary_accuracy'], label='validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()

With the model trained, let's evaluate the test data.

In [61]:
predictions = (hypermodel.predict(test_data) > 0.5).astype("bool")
predictions[:,0]

output = pd.DataFrame({'PassengerId': test_data_raw.PassengerId, 'Transported': predictions[:,0]})
output.to_csv('tuner_submission.csv', index=False)