> ## 0. Setup


>> ## 0.1. Libraries
* `NumPy` and `pandas` are used for exploratory data analysis in order to summarize the main characteristics of the data, and feature engineering
* `matplotlib` is used for visualization in order to assist data analysis
* `sklearn.preprocessing` is used for converting the categorical data into labels and one hot encoding
* `keras` is used for the neural network

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib import rc

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from tensorflow import keras

>> ### 0.2. Loading the data set
After loading the train and test sets to the memory, copying them recursively with the `copy()` function because we don't want changes to be reflected to the original data frame. After that, assigning a name attribute for data frames for later use

In [None]:
df_train_orig = pd.read_csv('../input/train.csv')
df_test_orig = pd.read_csv('../input/test.csv')

df_train = df_train_orig.copy(deep=True)
df_train.name = 'Training set'
df_test = df_test_orig.copy(deep=True)
df_test.name = 'Test set'

print('Number of Training Examples = {}'.format(df_train.shape[0]))
print('Number of Test Examples = {}'.format(df_test.shape[0]))
print('Training Input Shape = {}'.format(df_train.shape))
print('Training Output Shape = {}'.format(df_train['Survived'].shape[0]))
print('Test Input Shape = {}'.format(df_test.shape))
print('Test Output Shape = {}'.format(df_test.shape[0]))
print(df_train.columns)
print(df_test.columns)

> ## 1. Data Analysis

>> ### 1.1. Overview
* Using `info()` to get an overview of the types of the features
* Using `sample(10)` to get random 10 rows from the training set

In [None]:
print(df_train_orig.info())
df_train_orig.sample(10)

In [None]:
print(df_test_orig.info())
df_test_orig.sample(10)

>> ### 1.2. Fixing the Missing Values
* As seen from the random sample, some columns have null values. They have to be fixed but let's see which columns also have null values and how many. The `show_nulls` function below outputs the sum of null values in all columns in both training and test set
* Training set have null values in `Age`,  `Cabin` and `Embarked` columns
* Test set have null values in `Age`, `Fare` and `Cabin` columns

In [None]:
def show_nulls(df):
    print('{} columns with null values '.format(df.name))
    print(df.isnull().sum())
    print("\n")
    
for df in [df_train, df_test]:
    show_nulls(df)

>> The count of missing values in `Age`, `Embarked` and `Fare` columns are relatively smaller compared to the total training or test examples, but more than 80% of the `Cabin` column is missing in both training and test sets. In this case, we fill the missing values of
* `Age` column with median
* `Embarked` column with mode since it is categorical
* `Fare` column with median

>> Even though the large portion of the `Cabin` column is missing, it can't be ignored completely because some cabins might have higher survival rate. The first letter of the cabin data is used as the tiers of the cabins. Only the first letter of the cabin data is kept, and rest of the cabin data isn't important. The missing cabin data is labeled as `X`.

>> Finally `PassengerId` and `Ticket` columns are dropped because they are unique values and they don't have any impact on the survival of an individual.

In [None]:
for df in [df_train, df_test]:    
    df['Age'].fillna(df['Age'].median(), inplace=True)
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Fare'].fillna(df['Fare'].median(), inplace=True)
    df['Cabin'] = df['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else 'X')
    
df_train.drop(['PassengerId','Ticket'], axis=1, inplace=True)
df_test.drop(['PassengerId', 'Ticket'], axis=1, inplace=True)

print(df_train.columns)
print(df_test.columns)

>> Checking the missing values again after filling them and there are no missing values left

In [None]:
for df in [df_train, df_test]:
    show_nulls(df)

>> ### 1.3. Interpreting the Cabins Feature
The missing values are fixed but `Cabin` feature needs further exploration. There is a connection between the cabin tiers and Pclass (socio-economic status). For example cabin `A`, `B` and `C` have only people from `Pclass 1` (upper class). From going cabin `A` to `X`, people from middle and lower class increases in the cabins. `T` is an exception, an outlier. It might be a king suit or something like that because there is only one person in the whole training and test set in that `T` cabin and that person is from upper class. Instead of creating one more column when it is one-hot encoded, I am dropping the person in the `T` cabin. If that record isn't dropped, training and test data shapes wouldn't match. Most of the people in the `X` are from middle and lower class. Actually `X` is not a cabin, it is the label of missing values. I think people didn't even bother recording their cabin names because they are not from upper class.

In [None]:
t_index = df_train[df_train.loc[:, 'Cabin'] == 'T'].index # Dropping the only row with 'Cabin' column as 'T'
df_train.drop(t_index, inplace=True)

 >> First, starting by grouping up `Cabin` and `Pclass` columns. Training set and test set have similar `Pclass` distribution in the cabins, so I don't think model will overfit the data.

In [None]:
df_train_cabin_pclass = df_train.groupby(['Cabin', 'Pclass']).count().drop(columns=['Survived', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']).rename(columns={'Name':'Count'})
df_train_cabin_pclass = df_train_cabin_pclass.transpose()
print('Training set grouped by Cabin and Pclass')
df_train_cabin_pclass

In [None]:
df_test_cabin_pclass = df_test.groupby(['Cabin', 'Pclass']).count().drop(columns=['Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']).rename(columns={'Name':'Count'})
df_test_cabin_pclass = df_test_cabin_pclass.transpose()
print('Test set grouped by Cabin and Pclass')
df_test_cabin_pclass

>> Some of the `Cabin` columns doesn't have the value of every `Pclass`. Only the existing `Pclass` values are grouped within `Cabins`. In order to fix that, I created this helper function `get_pclass_counts` which writes `0` to `Pclass` if it doesn't exist in that `Cabin`. This will be useful when building a counter of `Pclass` values inside cabins. Training set and test set `Pclass` counts are displayed below the function definition.

In [None]:
def get_pclass_counts(df):
    cabin_names = df.columns.levels[0]
    cabins = {'A':{}, 'B':{}, 'C':{}, 'D':{}, 'E':{}, 'F':{}, 'G':{}, 'X':{}}
    
    for cabin in cabin_names:
        for pclass in range(1,4):
            try:
                count = df[cabin][pclass][0] # Trying to get the count of person in that pclass
                cabins[cabin][pclass] = count 
            except KeyError:
                cabins[cabin][pclass] = 0 # If there is no one, assigning it to 0
    return cabins

In [None]:
pclass_count_train = get_pclass_counts(df_train_cabin_pclass)
pclass_count_train

In [None]:
pclass_count_test = get_pclass_counts(df_test_cabin_pclass)
pclass_count_test

>> Then, I made this helper function `get_pclass_percentages` for converting counts into percentages for visualization. It basically divides every count to sum and multiplies it with 100. Training set and test set `Pclass` percentages in the cabins are displayed below the function definition.

In [None]:
def get_pclass_percentages(pclass_count):
    df_pclass_count = pd.DataFrame(pclass_count)
    
    percentages = {}

    for col in df_pclass_count.columns:
        percentages[col] = [(count / df_pclass_count[col].sum()) * 100 for count in df_pclass_count[col]] # Dividing count by sum and multiplying with 100

    return percentages

In [None]:
pclass_per_train = get_pclass_percentages(pclass_count_train)
pclass_per_train

In [None]:
pclass_per_test = get_pclass_percentages(pclass_count_test)
pclass_per_test

In [None]:
def plot_pclass_per(pclass_percentages):
    df_pclass_percentages = pd.DataFrame(pclass_percentages).transpose()

    bar_count = np.arange(8)  

    bar_width = 0.85
    cabin_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'X')

    pclass1 = df_pclass_percentages[0]
    pclass2 = df_pclass_percentages[1]
    pclass3 = df_pclass_percentages[2]

    plt.bar(bar_count, pclass1, color='#b5ffb9', edgecolor='white', width=bar_width, label="Pclass 1")
    plt.bar(bar_count, pclass2, bottom=pclass1, color='#f9bc86', edgecolor='white', width=bar_width, label="Pclass 2")
    plt.bar(bar_count, pclass3, bottom=pclass1 + pclass2, color='#a3acff', edgecolor='white', width=bar_width, label="Pclass 3")

    plt.xticks(bar_count, cabin_names)
    plt.xlabel('Cabins')
    plt.ylabel('Percentages')

    plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
    plt.title('Percentages of Pclass in Cabins')
    plt.show()

>> Visualizing the percentages of `Pclass` inside cabins clearly illustrates that half of the cabins are mostly occupied by `Pclass 1` (high class). However, it doesn't necessarily mean that those cabins have higher survival rate though. Actually they might even sunk first before other cabins. That's why we also have to check survival rates by cabins.

In [None]:
plot_pclass_per(pclass_per_train)

In [None]:
plot_pclass_per(pclass_per_test)

 >> This time the `Cabin` and `Survived` columns are grouped. It can done only for training set because the test set doesn't have the `Survived` feature. This is the same process done for `Cabin` and `Pclass` features. First the count of individuals who have survived and not survived are displayed for every cabin.

In [None]:
df_train_cabin_survived = df_train.groupby(['Cabin', 'Survived']).count().drop(columns=['Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Pclass']).rename(columns={'Name':'Count'})
df_train_cabin_survived = df_train_cabin_survived.transpose()
print('Training set grouped by Cabin and Survived')
df_train_cabin_survived

In [None]:
cabin_names = df_train_cabin_survived.columns.levels[0]
cabin_survived = {'A':{}, 'B':{}, 'C':{}, 'D':{}, 'E':{}, 'F':{}, 'G':{}, 'X':{}}

for cabin in cabin_names:
    for survive in range(0,2):
        cabin_survived[cabin][survive] = df_train_cabin_survived[cabin][survive][0]
        
cabin_survived

In [None]:
df_cabin_survived = pd.DataFrame(cabin_survived)
df_cabin_survived

In [None]:
survived_percentages = {}
df_cabin_survived = pd.DataFrame(cabin_survived)

for col in df_cabin_survived.columns:
    survived_percentages[col] = [(count / df_cabin_survived[col].sum()) * 100 for count in df_cabin_survived[col]] # Dividing count by sum and multiplying with 100

survived_percentages

>> Looks like Cabin `B C D E` has the highest survival rate. Those cabins are occupied mostly by the upper class. `Cabin X` (missing cabin data) has the lowest survival rate which is mostly lower and middle class. To conclude cabins used by upper class individuals have higher survival rate than cabins used by lower and middle class individuals.

In [None]:
df_survived_percentages = pd.DataFrame(survived_percentages).transpose()

bar_count = np.arange(8)  

bar_width = 0.85
cabin_names = ('A', 'B', 'C', 'D', 'E', 'F', 'G', 'X')

not_survived = df_survived_percentages[0]
survived = df_survived_percentages[1]

plt.bar(bar_count, not_survived, color='#b5ffb9', edgecolor='white', width=bar_width, label="Not Survived")
plt.bar(bar_count, survived, bottom=not_survived, color='#f9bc86', edgecolor='white', width=bar_width, label="Survived")

plt.xticks(bar_count, cabin_names)
plt.xlabel('Cabins')
plt.ylabel('Percentages')

plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
plt.title('Percentages of Survival in Cabins')
plt.show()

>> ### 1.4. Checking the Distribution of Data
The output classes are not equally distributed, but the gap is not that big, so the bias is not significant. We don't need to balance the distribution in this case.

In [None]:
df_survive = df_train_orig['Survived'].value_counts()
print(df_survive)
ax = df_survive.plot.bar()
ax.set_xticklabels(('Not Survived', 'Survived'))

>> ### 1.5. Feature Engineering
* `Family_Members` is created by adding `SibSp`, `Parch` and `1`. Since we know that `SibSp` is siblings and spouse, and `Parch` is parents and children, we can add those columns to find the count of family members of the person. Finally, adding `1` is the person himself or herself
* `Is_Alone` column is based on the number of `Family_Members`. If `Family_Members` is more than `1`, `Is_Alone` is set to `0`, otherwise it is set to `1`
* `Title` column is created by extracting prefix before the `Name` column

In [None]:
for df in [df_train, df_test]:    
    df['Family_Members'] = df['SibSp'] + df['Parch'] + 1
    
    df['Is_Alone'] = 1
    df['Is_Alone'].loc[df['Family_Members'] > 1] = 0
    
    df['Title'] = df['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]

df_train.sample(10)

>> Since the `Title` column is categorical, we can group up some values to a big one. Titles like Master and Dr might have a higher priority at the evacuation, so this feature might be worth exploring. We are going group Titles that are coming after Dr to Other because their titles are not as significant as others I think.

In [None]:
df_train['Title'].value_counts()

In [None]:
df_test['Title'].value_counts()

>> Titles that are less than 10, are grouped into `Other`.

In [None]:
train_title_names = (df_train['Title'].value_counts() < 10)
df_train['Title'] = df_train['Title'].apply(lambda x: 'Other' if train_title_names.loc[x] == True else x)

df_train['Title'].value_counts()

In [None]:
test_title_names = (df_test['Title'].value_counts() < 10)
df_test['Title'] = df_test['Title'].apply(lambda x: 'Other' if test_title_names.loc[x] == True else x)

df_test['Title'].value_counts()

>> ### 1.6. Categorical to Dummy
Categorical data are transformed to numerical data with the `LabelEncoder()` from `sklearn.preprocessing`. It basically labels the categories from 0 to n.

In [None]:
df_train.head(10)

In [None]:
le = LabelEncoder()
for df in [df_train, df_test]:
    df['Pclass'] = le.fit_transform(df['Pclass'])
    df['Sex'] = le.fit_transform(df['Sex'])
    df['Cabin'] = le.fit_transform(df['Cabin'])
    df['Embarked'] = le.fit_transform(df['Embarked'])
    df['Title'] = le.fit_transform(df['Title'])
    
df_train.head(10)

>> The categorical columns (`Pclass`, `Sex`, `Cabin`, `Embarked`, `Title`) are converted to one-hot encoding with `get_dummies()` function then the previous categorical columns are dropped. Column names are reorganized.

In [None]:
df_train_dummy = pd.concat([df_train, pd.get_dummies(df_train['Pclass'])], axis=1)
df_train_dummy = pd.concat([df_train_dummy, pd.get_dummies(df_train['Sex'])], axis=1)
df_train_dummy = pd.concat([df_train_dummy, pd.get_dummies(df_train['Cabin'])], axis=1)
df_train_dummy = pd.concat([df_train_dummy, pd.get_dummies(df_train['Embarked'])], axis=1)
df_train_dummy = pd.concat([df_train_dummy, pd.get_dummies(df_train['Title'])], axis=1)

df_train_dummy.drop(columns=['Pclass', 'Sex', 'Embarked', 'Title', 'Name', 'Cabin'], inplace=True)
df_train_dummy.columns = ('Survived', 'Age', 'SibSp', 'Parch', 'Fare', 'Family_Members', 'Is_Alone',
                         'Pclass_1', 'Pclass_2', 'Pclass_3', 'Female', 'Male', 'Cabin_A', 'Cabin_B', 'Cabin_C',
                          'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_X', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
                         'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Other',)
   
df_train_dummy.head(10)

In [None]:
df_test_dummy = pd.concat([df_test, pd.get_dummies(df_test['Pclass'])], axis=1)
df_test_dummy = pd.concat([df_test_dummy, pd.get_dummies(df_test['Sex'])], axis=1)
df_test_dummy = pd.concat([df_test_dummy, pd.get_dummies(df_test_dummy['Cabin'])], axis=1)
df_test_dummy = pd.concat([df_test_dummy, pd.get_dummies(df_test['Embarked'])], axis=1)
df_test_dummy = pd.concat([df_test_dummy, pd.get_dummies(df_test['Title'])], axis=1)

df_test_dummy.drop(columns=['Pclass', 'Sex', 'Embarked', 'Title', 'Name', 'Cabin'], inplace=True)
df_test_dummy.columns = ('Age', 'SibSp', 'Parch', 'Fare', 'Family_Members', 'Is_Alone',
                         'Pclass_1', 'Pclass_2', 'Pclass_3', 'Female', 'Male', 'Cabin_A', 'Cabin_B', 'Cabin_C',
                          'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G', 'Cabin_X', 'Embarked_C', 'Embarked_Q', 'Embarked_S',
                         'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Other',)

df_test_dummy.head(10)

>> ### 1.7. Normalizing the Continous Data
The range of continous data is too wide, so we have to normalize them. There are many ways to normalize data. I did std normalization.

In [None]:
for df in [df_train_dummy, df_test_dummy]:
    df['SibSp'] = (df['SibSp'] - df['SibSp'].mean()) / df['SibSp'].std()
    df['Parch'] = (df['Parch'] - df['Parch'].mean()) / df['Parch'].std()
    df['Family_Members'] = (df['Family_Members'] - df['Family_Members'].mean()) / df['Family_Members'].std()
    df['Age'] = (df['Age'] - df['Age'].mean()) / df['Age'].std()
    df['Fare'] = (df['Fare'] - df['Fare'].mean()) / df['Fare'].std()

df_train_dummy.head(10)

In [None]:
df_test_dummy.head(10)

>> ### 1.8. Separating X and Y
* The data is finally ready for training. The input (X) and output (Y) are separated here. `X_train` is basically all the columns except `Survived` since it is the output. `Y_train` is the `Survived` column. `df_test_dummy` doesn't need to be separated because it doesn't have`Survived` column anyway.

In [None]:
X_train = df_train_dummy.drop(['Survived'], axis=1)
Y_train = df_train_dummy['Survived']

X_train[:5]

In [None]:
Y_train[:5]

> ## 2. Machine Learning (Neural Network)

>> ### 2.1 Neural Network
* Using relu activation function on the hidden layers
* The activation function of the last hidden layer is sigmoid because it is a binary classification problem

In [None]:
model = keras.models.Sequential([
    keras.layers.Dense(32, activation='relu', input_dim=27),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.25),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

>> ### 2.2 Optimizer, Loss Function, Metrics and Callbacks
* The optimizer is stochastic gradient descent with default parameters
* The loss function is binary cross-entropy
* Using accuracy for the metric
* Creating a callback function which reduces the learning rate, If accuracy doesn't increase in 3 epochs.

In [None]:
optimizer = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
loss = 'binary_crossentropy'
metrics = ['accuracy']

learning_rate_reduction = keras.callbacks.ReduceLROnPlateau(monitor='val_acc', factor=0.5, patience=3, verbose=1, mode='auto', min_delta=0.0001, cooldown=0, min_lr=0)

model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

In [None]:
epochs = 50
batch_size = 8

model.fit(X_train, Y_train, 
          epochs=epochs, 
          batch_size=batch_size, 
          callbacks=[learning_rate_reduction], 
          validation_data=(X_train, Y_train))

In [None]:
model.summary()

> ## 3. Result

>> ### 3.1 Predicting with the Trained Model
* Predicting the classes of `X_test` with the model trained earlier

In [None]:
Y_hat = model.predict_classes(df_test_dummy, batch_size=None, verbose=0)
Y_hat.shape

> ## 4. Submission

In [None]:
submission_df = pd.DataFrame(columns=['PassengerId', 'Survived'])
submission_df['PassengerId'] = df_test_orig['PassengerId']
submission_df['Survived'] = Y_hat

In [None]:
submission_df.head(30)

In [None]:
submission_df.to_csv('submissions.csv', header=True, index=False)