## Importing Stuff

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import tree
from time import time

## Viewing the dataset

Let's take a view into the dataset itself.

In [2]:
data_raw = pd.read_csv("datasets/titanic_train.csv", index_col='PassengerId')
data_validate = pd.read_csv("datasets/titanic_test.csv", index_col='PassengerId')
data_raw.sample(10)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
134,1,2,"Weisz, Mrs. Leopold (Mathilde Francoise Pede)",female,29.0,1,0,228414,26.0,,S
151,0,2,"Bateman, Rev. Robert James",male,51.0,0,0,S.O.P. 1166,12.525,,S
804,1,3,"Thomas, Master. Assad Alexander",male,0.42,0,1,2625,8.5167,,C
313,0,2,"Lahtinen, Mrs. William (Anna Sylfven)",female,26.0,1,1,250651,26.0,,S
38,0,3,"Cann, Mr. Ernest Charles",male,21.0,0,0,A./5. 2152,8.05,,S
64,0,3,"Skoog, Master. Harald",male,4.0,3,2,347088,27.9,,S
596,0,3,"Van Impe, Mr. Jean Baptiste",male,36.0,1,1,345773,24.15,,S
544,1,2,"Beane, Mr. Edward",male,32.0,1,0,2908,26.0,,S
145,0,2,"Andrew, Mr. Edgardo Samuel",male,18.0,0,0,231945,11.5,,S
696,0,2,"Chapman, Mr. Charles Henry",male,52.0,0,0,248731,13.5,,S


## Cleaning and Wrangling the Data

We see that there are 891 entries in the dataset and 12 columns including the PassengerId as the index.

Of the 891 entries for Cabin 687 entries in total are null. This means that there isn't much we can do with the information about the cabin.

In addition, both the Ticket and Fare columns are more or less random. Furthermore, PassengerId is only a unique identifier and will not affect our model.

While it is possible to separate the Name into titles alone, I believe it is not needed.

So all of them are dropped.


We note that there 177 entries for Age do not exist. Instead of deleting these entries completely, we shall instead fill these age columns with the median age. We choose median over mean because there are both babies(Age is a fraction less than one) and very old people as well which might skew the value of mean.

## Splitting up the data

We can now split the data into the labels and features.

## Applying Naive Bayes

In [21]:
nb_classifier = GaussianNB()

In [22]:
t0 = time()
nb_classifier.fit(features_train, labels_train)
print("Training Time: ", time()-t0, "s.", sep='')

Training Time: 0.011408567428588867s.


In [23]:
t1 = time()
nb_pred = nb_classifier.predict(features_test)
print("Testing Time: ", time()-t1, "s.", sep='')

Testing Time: 0.0057599544525146484s.


In [24]:
print("Accuracy: ", accuracy_score(labels_test, nb_pred), ".", sep='')

Accuracy: 0.787709497207.


## Using a Decision Tree

In [25]:
dt_classifier = tree.DecisionTreeClassifier(min_samples_split=40)

In [26]:
t0 = time()
dt_classifier.fit(features_train, labels_train)
print("Training Time: ", round(time() - t0), "s")

Training Time:  0 s


In [27]:
t1 = time()
dt_prediction = dt_classifier.predict(features_test)
print("Prediction Time: ", round(time() - t1), "s")

Prediction Time:  0 s


In [28]:
print(accuracy_score(labels_test, dt_prediction))

0.826815642458
