# Introduction

<img src="https://i.imgur.com/qBl2QL2.jpg" width="600px">

I made this kernel for Kaggle's Titanic competition (and it also happens to be my first one!). In this kernel, given some information, I tried to predict whether a given person aboard the ship had survived.

# Contents

* Preliminary steps
    * Importing the necessary libraries
    * Converting the CSV file into a pandas dataframe
* Encoding the features of the train data
* Defining the features and prediction target
* Creating the model
* Fitting the model
* Encoding the features of the test data
* Predicting survival
* Ending Note

### Preliminary Steps

Importing the necessary libraries -

In [1]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

import keras
from keras.models import Sequential
from keras.layers import Dense, Input

tqdm.pandas()
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


Converting the CSV file into a pandas dataframe -

In [2]:
train_path = '/kaggle/input/titanic/train.csv'
train_data = pd.read_csv(train_path)
train_data = train_data.fillna(train_data.mean())

A look at some of the train data - 

In [3]:
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Encoding the features of the train data

Encoding the 'Sex' and 'Embarked' features - 

In [4]:
def process_sex(x):
    if x == "male":
        return 1
    else:
        return 0
    
def process_embarked(x):
    code = [0, 0, 0, 0]
    ports = ["C", "Q", "S"]

    if x in ports:
        code[list.index(ports, x)] = 1
    else:
        code[-1] = 1
        
    return tuple(code)
        
train_data["Sex"] = train_data["Sex"].progress_apply(process_sex)
train_data["Embarked"] = train_data["Embarked"].progress_apply(process_embarked)

HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




Splitting the lists of numbers under the feature 'Embarked' to obtain 4 different columns containing data of only one number per each row - 

In [5]:
train_data["Embarked_0"] = [train_data["Embarked"][idx][0] for idx in tqdm(range(len(train_data)))]
train_data["Embarked_1"] = [train_data["Embarked"][idx][1] for idx in tqdm(range(len(train_data)))]
train_data["Embarked_2"] = [train_data["Embarked"][idx][2] for idx in tqdm(range(len(train_data)))]
train_data["Embarked_3"] = [train_data["Embarked"][idx][3] for idx in tqdm(range(len(train_data)))]

HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=891.0), HTML(value='')))




A look at the encoded features - 

In [6]:
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Embarked_0,Embarked_1,Embarked_2,Embarked_3
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,"(0, 0, 1, 0)",0,0,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,"(1, 0, 0, 0)",1,0,0,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,"(0, 0, 1, 0)",0,0,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,"(0, 0, 1, 0)",0,0,1,0
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,"(0, 0, 1, 0)",0,0,1,0
5,6,0,3,"Moran, Mr. James",1,29.699118,0,0,330877,8.4583,,"(0, 1, 0, 0)",0,1,0,0
6,7,0,1,"McCarthy, Mr. Timothy J",1,54.0,0,0,17463,51.8625,E46,"(0, 0, 1, 0)",0,0,1,0
7,8,0,3,"Palsson, Master. Gosta Leonard",1,2.0,3,1,349909,21.075,,"(0, 0, 1, 0)",0,0,1,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",0,27.0,0,2,347742,11.1333,,"(0, 0, 1, 0)",0,0,1,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",0,14.0,1,0,237736,30.0708,,"(1, 0, 0, 0)",1,0,0,0


### Defining the features and prediction target - 

In [7]:
y = train_data["Survived"].values.reshape(len(train_data), 1)
X = train_data[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked_0", "Embarked_1", "Embarked_2", "Embarked_3"]].values

Bringing all the features to a range between 0 and 1 by dividing all the values of a feature by its biggest value - 

In [8]:
X = X/X.max(axis=0)

Splitting the training data into training data and validation data - 

In [9]:
train_X, val_X, train_y, val_y = train_test_split(X, y)

### Creating the model

In [10]:
model = Sequential()
model.add(Dense(units=20, activation='relu'))
model.add(Dense(units=15, activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

Providing the input size to the model - 

In [11]:
model.build(input_shape=(None, 10))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20)                220       
_________________________________________________________________
dense_2 (Dense)              (None, 15)                315       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 16        
Total params: 551
Trainable params: 551
Non-trainable params: 0
_________________________________________________________________


### Fitting the model - 

In [12]:
model.fit(x=train_X, y=train_y, validation_data=(val_X, val_y), epochs=10)

Train on 668 samples, validate on 223 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.callbacks.History at 0x7f8c505baef0>

### Encoding the features of the test data

Converting the CSV file into a pandas dataframe -

In [13]:
test_path = '/kaggle/input/titanic/test.csv'
test_data = pd.read_csv(test_path)
test_data = test_data.fillna(test_data.mean())

A look at the test data - 

In [14]:
test_data.head(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


Encoding the 'Sex' and 'Embarked' features - 

In [15]:
test_data["Sex"] = test_data["Sex"].progress_apply(process_sex)
test_data["Embarked"] = test_data["Embarked"].progress_apply(process_embarked)

HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




A look at the encoded features - 

In [16]:
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",1,34.50000,0,0,330911,7.8292,,"(0, 1, 0, 0)"
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.00000,1,0,363272,7.0000,,"(0, 0, 1, 0)"
2,894,2,"Myles, Mr. Thomas Francis",1,62.00000,0,0,240276,9.6875,,"(0, 1, 0, 0)"
3,895,3,"Wirz, Mr. Albert",1,27.00000,0,0,315154,8.6625,,"(0, 0, 1, 0)"
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.00000,1,1,3101298,12.2875,,"(0, 0, 1, 0)"
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",1,30.27259,0,0,A.5. 3236,8.0500,,"(0, 0, 1, 0)"
414,1306,1,"Oliva y Ocana, Dona. Fermina",0,39.00000,0,0,PC 17758,108.9000,C105,"(1, 0, 0, 0)"
415,1307,3,"Saether, Mr. Simon Sivertsen",1,38.50000,0,0,SOTON/O.Q. 3101262,7.2500,,"(0, 0, 1, 0)"
416,1308,3,"Ware, Mr. Frederick",1,30.27259,0,0,359309,8.0500,,"(0, 0, 1, 0)"


Splitting the lists of numbers under the feature 'Embarked' to obtain 4 different columns containing data of only one number per each row - 

In [17]:
test_data["Embarked_0"] = [test_data["Embarked"][idx][0] for idx in tqdm(range(len(test_data)))]
test_data["Embarked_1"] = [test_data["Embarked"][idx][1] for idx in tqdm(range(len(test_data)))]
test_data["Embarked_2"] = [test_data["Embarked"][idx][2] for idx in tqdm(range(len(test_data)))]
test_data["Embarked_3"] = [test_data["Embarked"][idx][3] for idx in tqdm(range(len(test_data)))]

HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=418.0), HTML(value='')))




Defining a new variable to hold the features - 

In [18]:
X_test = test_data[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked_0", "Embarked_1", "Embarked_2", "Embarked_3"]].values

Bringing all the features to a range between 0 and 1 by dividing all the values of a feature by its biggest value. The second line of code in the cell below deals with cases where the biggest value of the feature happens to be 0. Since you cannot divide by 0, it simply substitutes 0 instead.

In [19]:
X_test = X_test/X_test.max(axis=0)
X_test[np.isnan(X_test)] = 0

  """Entry point for launching an IPython kernel.


### Predicting Survival

Running inference on the test data and rounding it off to either 0 or 1 - 

In [20]:
predictions = model.predict(X_test)
predictions = np.round(predictions).reshape(len(X_test))

Since gender_submission.csv is of the format in which our submission is supposed to be made, I'm first importing it and converting it into a pandas dataframe - 

In [21]:
sub_path = '/kaggle/input/titanic/gender_submission.csv'
submission = pd.read_csv(sub_path)

A look at gender_submission.csv - 

In [22]:
submission.head(10)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


Replacing the 'Survived' column in the dataframe with the values we got - 

In [23]:
submission["Survived"] = np.int32(predictions)

A final look at the dataframe with our predictions - 

In [24]:
submission.head(10)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


Converting the dataframe into a csv file without the index column - 

In [25]:
submission.to_csv('submission.csv', index=False)

## Ending Note

This being my first ML model, I learnt basics like how to create a neural network and encode features. I really enjoyed it, and look forward to learning more in the future. I also really appreciate feedback to help me improve both the accuracy and efficiency of my model :)