# Executive Summary
In this project, we look into a dataset related to the famous Titanic. The purpose of this project is to practice Machine Learning skills and build prediction model on whether the passenger would survive. Given the nature of the question and purpose we want to achieve, this would be a classification problem and we could practice skills like cleaning, visualizing, feature engineering, building machine learning models and etc. From the performance result of 9 of our machine learning models, we highly recommend to use Decision Tree Classifier while it has the highest AUC score. More discussion and our process will be explain in the following.

In [18]:
#Before starting, we import all the necessary package
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, neural_network
from sklearn.metrics import precision_score, auc, roc_curve

# Exploratory Data Analysis (EDA)

First, we obtain the Titanic dataset from Kaggle (https://www.kaggle.com/c/titanic), and we explore what this dataset is like.

#### Data Description
<li> PassengerId - id number assigned to the passenger
<li> Survived - whether the passenger survive, where 1 = Yes, 0 = No
<li> Pclass - ticket class, where 1 = 1st, 2 = 2nd, 3 = 3rd
<li> Name - name of the passenger
<li> Sex - sex of the passenger
<li> Age - age of the passenger, if the age is estimated, it will in the form of xx.5
<li> SibSp - number of sibling/spouse aboard with the passenger
<li> Parch - number of parent/children aboard with the passenger
<li> Ticket - ticket number
<li> Fare - ticket fare amount
<li> Cabin - cabin number
<li> Embarked - port where the passenger embark on Titanic, where C = Cherbourg, Q = Queenstown, S = Southampton

In [19]:
df = pd.read_csv("train.csv") 

In [20]:
#first few rows of the dataset
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
#there are 891 rows and 12 columns
df.shape

(891, 12)

In [22]:
#a snapshot of each column data structure
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [23]:
#basic stat of each numeric column
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


After we had a taste of how the dataset is like, we will take a look if there is missing value in the dataset. If there is missing value, it will prevent as to build machine learning model since machine learning model can't take NAs.

In [24]:
#there is 177 NAs in Age, 687 NAs in Cabin and 2 NAs in Embarked column.
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Next, we are going to visualize how different attribute of a passenger would impact survival.

In [25]:
#higher passenger class would have higher survival rate
df[["Survived", "Pclass"]].groupby("Pclass").mean().sort_values("Survived", ascending=False)

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


In [26]:
#female would have higher survival rate than male
df[["Survived", "Sex"]].groupby("Sex").mean().sort_values("Survived", ascending=False)

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [27]:
#passenger got on at Port of Cherbourg would have higher survival rate
df[["Survived", "Embarked"]].groupby("Embarked").mean().sort_values("Survived", ascending=False)

Unnamed: 0_level_0,Survived
Embarked,Unnamed: 1_level_1
C,0.553571
Q,0.38961
S,0.336957


In [28]:
#passenger travel with more Siblings/Spouse would have lower survival rate
df[["Survived", "SibSp"]].groupby("SibSp").mean().sort_values("Survived", ascending=False)

Unnamed: 0_level_0,Survived
SibSp,Unnamed: 1_level_1
1,0.535885
2,0.464286
0,0.345395
3,0.25
4,0.166667
5,0.0
8,0.0


In [29]:
#passenger travel with more Parents/Children would have lower survival rate
df[["Survived", "Parch"]].groupby("Parch").mean().sort_values("Survived", ascending=False)

Unnamed: 0_level_0,Survived
Parch,Unnamed: 1_level_1
3,0.6
1,0.550847
2,0.5
0,0.343658
5,0.2
4,0.0
6,0.0


# Clearning

After EDA, we are now better understand the dataset and we will then preform some data cleaning to prepare our dataset for building machine learning model.
1) we will drop some of the attributes because they will not help us better classify and answer our question.

2) we will fill NAs for the columns.

3) we will change categorical to interger for "Sex" and "Embarked" column given machine learning model requires that.

4) we will bin and change "Age" and "Fare" from float to interger as it requires that format as input for machine learning model.

### 1) Dropping unnecessary features
We will drop 4 columns which we think would not help us improve our model performance, which are PassengerID, Ticket, Cabin and Name.

In [30]:
#Dropping PassengerID, Ticket, Cabin, Name <-useless
df = df.drop(["PassengerId", "Ticket", "Cabin", "Name"], axis=1)

### 2) Filling NAs
There are NAs in Age and Embarked column, we will fill them out using different statistical method:
<li> for Age, we will use median because median is less noise and more objective than using mean
<li> for Embarked, we will use mode because we believe filling port of embark with the most frequent port passenger on board with have higher probability than other ports.

In [31]:
#fill Age's NA with median
df["Age"] = df.Age.fillna(df.Age.median())

In [32]:
#fill Embarked's NA
df["Embarked"] = df.Embarked.fillna(df.Embarked.mode()[0])

In [33]:
#there is no more NAs in df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Embarked    891 non-null object
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


### 3) Changing categorical to numeric for building machine learning models purpose
<li> create a "Female" column, where 1 if the passenger is female, otherwise 0. Then drop the "Sex" column.
<li> change "S", "C", "Q" in "Embarked" to "1", "2" and "3".

In [34]:
#Change Sex -> Female (1,0)
df["Female"] = df["Sex"].apply(lambda x : 1 if x == "female" else 0)

In [35]:
#Drop Sex column
df = df.drop("Sex", axis=1)

In [36]:
#Change Embarked to 1, 2, 3
df["Embarked"] = df["Embarked"].apply(lambda x : 1 if x == "S" 
                                            else 2 if x == "C" else 3)

In [38]:
#new look of the df
df.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,Female
0,0,3,22.0,1,0,7.25,1,0
1,1,1,38.0,1,0,71.2833,2,1
2,1,3,26.0,0,0,7.925,1,1
3,1,1,35.0,1,0,53.1,1,1
4,0,3,35.0,0,0,8.05,1,0


### 4) Changing float to interger for meeting machine learning models input requirment
<li> create a new column "AgeBin", visualize boundary of each age bin, assign each bin with a new interger number and apply on original "Age" column.
<li> perform the same process to "Fare" column.

In [39]:
#Create a new column for different Age bins
df["AgeBin"] = pd.cut(df.Age, 5)

In [40]:
#Visualize Agebin different range
df[["Survived", "AgeBin"]].groupby("AgeBin").mean()

Unnamed: 0_level_0,Survived
AgeBin,Unnamed: 1_level_1
"(0.34, 16.336]",0.55
"(16.336, 32.252]",0.344168
"(32.252, 48.168]",0.404255
"(48.168, 64.084]",0.434783
"(64.084, 80.0]",0.090909


In [41]:
#Apply 0-4 for Age
df["Age"] = pd.cut(df.Age, 5, labels=False)

In [42]:
df = df.drop("AgeBin", axis=1)

In [48]:
#Create a new column for different Fare bins
df["FareBin"] = pd.cut(df.Fare, 4)

In [49]:
#Visualize FareBin different range
df[["Survived", "FareBin"]].groupby("FareBin").mean()

Unnamed: 0_level_0,Survived
FareBin,Unnamed: 1_level_1
"(-0.512, 128.082]",0.368113
"(128.082, 256.165]",0.724138
"(256.165, 384.247]",0.666667
"(384.247, 512.329]",1.0


In [50]:
#Apply 0-3 for Fare
df["Fare"] = pd.cut(df.Fare, 4, labels= False)

In [51]:
df = df.drop("FareBin", axis=1)

# Feature Engineering

After cleaning, there are some useful info from "SibSp" and "Parch" column, we can train our model with them directly, or may be better is that we could prefrom feature engineering to create a new feature with the use of these 2 family relationship columns. From what we shown early that passenger travel with more SibSp and Parch have a higher change of survival. We think that it would be a good idea to combine these two columns and make a new feature to capture the most info from these two columns and train our mdoel. We call this column "WithFamily", where 1 if the passenger travel with more than 1 family member, otherwise 0.

In [52]:
#Create a new column named WithFamily to combine SibSp and Parch
df["WithFamily"] = df["SibSp"] + df["Parch"]

In [53]:
#Change Withfamily to 1 if x > 0 else 0
df["WithFamily"] = df["WithFamily"].apply(lambda x : 1 if x > 0 else 0)

In [54]:
#drop SibSp and Parch
df = df.drop(["SibSp", "Parch"], axis=1)

In [57]:
#this is the final look of our df
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare,Embarked,Female,WithFamily
0,0,3,1,0,1,0,1
1,1,1,2,0,2,1,1
2,1,3,1,0,1,1,0
3,1,1,2,0,1,1,1
4,0,3,2,0,1,0,0


# Machine Learning

Finally, we arrive at the fun part: Machine Learning! In this section, we are going to split our data into 80/20 ratio, 80% for training and 20% for testing. We will use 9 classification machine learning models to see which one provide the best performance by successfuly classify whether the passenger survive or not. The metrics that we use to evaulate our performace is Precision Score and Area Under the Curve Score (AUC). We don't use Accuracy Score because it is not a good metric for evaulating classification problem. For example, if you have 90% survive and 10% dead in your dataset and your Accuracy Score is 90% correct, you can't say you are 90% of time are correct because your dataset is very imbalance.

In [58]:
X = df.drop("Survived", axis=1)

y = df.Survived

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2
                                                   ,random_state=3)

In [60]:
clfs = [
   linear_model.LogisticRegressionCV(),
   neighbors.KNeighborsClassifier(n_neighbors=6),
   svm.SVC(),
   naive_bayes.GaussianNB(),
   tree.DecisionTreeClassifier(max_depth=5),
   ensemble.RandomForestClassifier(max_depth=5, n_estimators=10),
   discriminant_analysis.QuadraticDiscriminantAnalysis(),
   discriminant_analysis.LinearDiscriminantAnalysis(),
   neural_network.MLPClassifier()
]

In [61]:
CLF_columns = []
CLF_compare = pd.DataFrame(columns = CLF_columns)

row_index = 0
for clf in clfs:
   y_pred = clf.fit(X_train, y_train).predict(X_test)
   fp, tp, th = roc_curve(y_test, y_pred)
   CLF_name = clf.__class__.__name__
   CLF_compare.loc[row_index,'CLF Name'] = CLF_name
   CLF_compare.loc[row_index, 'Precision'] = precision_score(y_test, y_pred)
   CLF_compare.loc[row_index, 'AUC'] = auc(fp, tp)
   
   row_index+=1
CLF_compare.sort_values(by = ['AUC'], ascending = False, inplace = True)
CLF_compare

Unnamed: 0,CLF Name,Precision,AUC
4,DecisionTreeClassifier,0.816667,0.799541
1,KNeighborsClassifier,0.803279,0.794954
5,RandomForestClassifier,0.753623,0.793447
2,SVC,0.818182,0.775557
8,MLPClassifier,0.712329,0.775098
3,GaussianNB,0.689189,0.758781
7,LinearDiscriminantAnalysis,0.6625,0.754718
6,QuadraticDiscriminantAnalysis,0.723077,0.753145
0,LogisticRegressionCV,0.638554,0.740957


From the table above, we can see the best machine learning model among those 9 we used is Decision Tree Classifier.