# Titanic - Survival Predictions

This project focuses on applying Decision Tree Classifier to predict how many people were able to survive. This is supervised learning technique where certain features are selected from data set to train Decision Tree model. Titanic survival data is collected from Udacity course workspace and is also available online. However for this project the data is cleaned and preprocessed in certain aspects. 

In [44]:
#import 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

#load the data into pandas dataframe

df = pd.read_csv(r"C:\Everything On This PC\Udacity\Intro to ML -TensorFlow\Coursework-Projects\Titanic-Project\titanic_data.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


These features are enlisted below with its explanation:

- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

## Preprocessing Data

Since our data features have certain columns that are in text form we will perform One-hot-encoding to convert them into categorical data. However, **Name** column will be removed before performing One-hot-encoding as all the names will be different and we would not like to convert that into category of each name. This will be a big data mess.

In [45]:
#dropping name column
df = df.drop('Name', axis=1)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [46]:
#performing One-hot-encoding for all columns
df = pd.get_dummies(df)

#dropping 'Survived' column as that will be used as our labelling for training the data
outcome = df['Survived']
features = df.drop('Survived', axis=1)


#filling any missing values with 0
features = features.fillna('0.0')
features.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Ticket_110152,Ticket_110413,...,Cabin_F G73,Cabin_F2,Cabin_F33,Cabin_F38,Cabin_F4,Cabin_G6,Cabin_T,Embarked_C,Embarked_Q,Embarked_S
0,1,3,22.0,1,0,7.25,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
1,2,1,38.0,1,0,71.2833,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,3,26.0,0,0,7.925,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,4,1,35.0,1,0,53.1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,5,3,35.0,0,0,8.05,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


### Training the Model:




In [49]:
X_train, X_test, y_train, y_test = train_test_split(features, outcome, test_size =0.2, random_state=40)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

#Predictions and accuracy
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

test_accuracy = accuracy_score(y_test_pred, y_test)
train_accuracy = accuracy_score(y_train_pred, y_train)
print("Our train accuracy is :{} \nOur test accuracy is {}".format(train_accuracy,test_accuracy) ) 

Our train accuracy is :1.0 
Our test accuracy is 0.8435754189944135


This test accuracy can further be improved by utilizing hyperparameters for Decision tree classifier. For this let's implement a loop system that will find the best values for parameters with highest accuract score.

In [48]:
train_max_acc = accuracy_score(y_train, y_train_pred)
test_max_acc = accuracy_score(y_test,y_test_pred)

for i in range(12):
    for j in range(12):
        model = DecisionTreeClassifier(max_depth =i+1,min_samples_leaf =j+1, min_samples_split=10)
        model.fit(X_train, y_train)
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)
        train_accuracy = accuracy_score(y_train,y_train_pred)
        test_accuracy = accuracy_score(y_test,y_test_pred)
        if test_max_acc < test_accuracy:
            test_max_acc = test_accuracy
        if train_max_acc < train_accuracy:
            train_max_acc = train_accuracy
            
print('The training accuracy is', train_accuracy)         
print('The test accuracy is', test_max_acc)

The training accuracy is 0.8497191011235955
The test accuracy is 0.8770949720670391


From above results our model was able to predict the survival rate with an accuracy of 87.7% just by using single DecisionTree classifier. However, this model only uses a single method, there are other methods that can generate more accurate predictions like ensembling methods. 