Titanic - Machine Learning from Disaster

The following imports are my own implementations of the Classification algorithms using Python and Numpy

In [1]:
from MLAlgorithms.Supervised.Classification.knnclassifier import *
from MLAlgorithms.Supervised.Classification.logisticregression import *

Import numpy and pandas

In [2]:
import numpy as np
import pandas as pd

In [3]:
titanic_train = pd.read_csv('titanic/train.csv')
titanic_test = pd.read_csv('titanic/test.csv')
titanic_gender_submission = pd.read_csv('titanic/gender_submission.csv')['Survived']

In [4]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


We can apply feature engineering to the categorical variables in the data

Drop columns that do not contribute to the survival rate of those on board

In [6]:
titanic_y_train = titanic_train['Survived']

In [7]:
dropped_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked', 'Survived']

In [8]:
titanic_train.drop(dropped_cols, axis=1, inplace=True)

In [9]:
cat_variables = ['Sex',
'Pclass',
]
titanic_train = pd.get_dummies(data=titanic_train, prefix=cat_variables, columns=cat_variables, dtype=int)
titanic_train.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
0,22.0,1,0,7.25,0,1,0,0,1
1,38.0,1,0,71.2833,1,0,1,0,0
2,26.0,0,0,7.925,1,0,0,0,1
3,35.0,1,0,53.1,1,0,1,0,0
4,35.0,0,0,8.05,0,1,0,0,1


In [10]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         714 non-null    float64
 1   SibSp       891 non-null    int64  
 2   Parch       891 non-null    int64  
 3   Fare        891 non-null    float64
 4   Sex_female  891 non-null    int32  
 5   Sex_male    891 non-null    int32  
 6   Pclass_1    891 non-null    int32  
 7   Pclass_2    891 non-null    int32  
 8   Pclass_3    891 non-null    int32  
dtypes: float64(2), int32(5), int64(2)
memory usage: 45.4 KB


From the data info we can see that we only have some nan values in the Age column, we can handle the Nan values first. We can fill the missing age with the average age of the Age.

In [11]:
titanic_train.describe()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
count,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,29.699118,0.523008,0.381594,32.204208,0.352413,0.647587,0.242424,0.20651,0.551066
std,14.526497,1.102743,0.806057,49.693429,0.47799,0.47799,0.42879,0.405028,0.497665
min,0.42,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.125,0.0,0.0,7.9104,0.0,0.0,0.0,0.0,0.0
50%,28.0,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,1.0
75%,38.0,1.0,0.0,31.0,1.0,1.0,0.0,0.0,1.0
max,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0,1.0


In [12]:
titanic_train['Age'] = titanic_train['Age'].fillna(value=30)

In [13]:
titanic_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         891 non-null    float64
 1   SibSp       891 non-null    int64  
 2   Parch       891 non-null    int64  
 3   Fare        891 non-null    float64
 4   Sex_female  891 non-null    int32  
 5   Sex_male    891 non-null    int32  
 6   Pclass_1    891 non-null    int32  
 7   Pclass_2    891 non-null    int32  
 8   Pclass_3    891 non-null    int32  
dtypes: float64(2), int32(5), int64(2)
memory usage: 45.4 KB


It seems that the values of both Age and Fare deviate way too much from the rest of the data, so scaling will be useful. Let us first apply the above feature engineering to our test data.

In [14]:
titanic_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [15]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


Both Age and Fare have nan values, we need to fill the nan value with mean the the features

In [16]:
dropped_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'Embarked']
titanic_test.drop(dropped_cols, axis=1, inplace=True)

In [17]:
titanic_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,male,34.5,0,0,7.8292
1,3,female,47.0,1,0,7.0
2,2,male,62.0,0,0,9.6875
3,3,male,27.0,0,0,8.6625
4,3,female,22.0,1,1,12.2875


We also have mission data in the test data, we can handle them as well

In [18]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    int64  
 1   Sex     418 non-null    object 
 2   Age     332 non-null    float64
 3   SibSp   418 non-null    int64  
 4   Parch   418 non-null    int64  
 5   Fare    417 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 19.7+ KB


In [19]:
titanic_test.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
count,418.0,332.0,418.0,418.0,417.0
mean,2.26555,30.27259,0.447368,0.392344,35.627188
std,0.841838,14.181209,0.89676,0.981429,55.907576
min,1.0,0.17,0.0,0.0,0.0
25%,1.0,21.0,0.0,0.0,7.8958
50%,3.0,27.0,0.0,0.0,14.4542
75%,3.0,39.0,1.0,0.0,31.5
max,3.0,76.0,8.0,9.0,512.3292


In [20]:
titanic_test['Age'] = titanic_test['Age'].fillna(value=30)
titanic_test['Fare'] = titanic_test['Fare'].fillna(value=35)

In [21]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    int64  
 1   Sex     418 non-null    object 
 2   Age     418 non-null    float64
 3   SibSp   418 non-null    int64  
 4   Parch   418 non-null    int64  
 5   Fare    418 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 19.7+ KB


In [22]:
titanic_test = pd.get_dummies(data=titanic_test, prefix=cat_variables, columns=cat_variables, dtype=int)
titanic_test.head()

Unnamed: 0,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3
0,34.5,0,0,7.8292,0,1,0,0,1
1,47.0,1,0,7.0,1,0,0,0,1
2,62.0,0,0,9.6875,0,1,0,1,0
3,27.0,0,0,8.6625,0,1,0,0,1
4,22.0,1,1,12.2875,1,0,0,0,1


In [23]:
titanic_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Age         418 non-null    float64
 1   SibSp       418 non-null    int64  
 2   Parch       418 non-null    int64  
 3   Fare        418 non-null    float64
 4   Sex_female  418 non-null    int32  
 5   Sex_male    418 non-null    int32  
 6   Pclass_1    418 non-null    int32  
 7   Pclass_2    418 non-null    int32  
 8   Pclass_3    418 non-null    int32  
dtypes: float64(2), int32(5), int64(2)
memory usage: 21.4 KB


To enable our algorithms fit the data, we must first convert it to numpy array

In [24]:
titanic_X_train = np.array(titanic_train)
titanic_y_train = np.array(titanic_y_train)

titanic_X_test = np.array(titanic_test)
titanic_y_test = np.array(titanic_gender_submission)

Let us check the size and dimension of the data

In [25]:
print(titanic_X_train.shape)
print(titanic_X_test.shape)
print(titanic_y_train.shape)
print(titanic_y_test.shape)

(891, 9)
(418, 9)
(891,)
(418,)


In [26]:
slice_index = 600
model = KNNClassifier(titanic_X_train, titanic_y_train, K=5)
predictions, y_true = model.slice_cv(slice_index) #Divide the training data for cross-validation

In [27]:
model.accuracy(y_true)

86.5979381443299

In [28]:
count, values = model.error_count(y_true) #Count the number of misclassified points, and return both the count and values
print(count)
print(values['Actual'])
print(values['Predicted'])

39
[1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0
 0 1]
[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1
 1 0]


In [29]:
predictions = model.predict(titanic_X_test)

In [30]:
model.accuracy(titanic_y_test)

71.77033492822966

In [31]:
model = KNNClassifier(titanic_X_train, titanic_y_train, K=5)

In [32]:
CV_k = model.kfold_cv(9) #Apply K-fold Cross-validation for K = 9
CV_k

0.7059483726150392

In [33]:
CV_n = model.loocv() #Apply Leave-Out-One Cross-Validation
CV_n

0.7361111111111145

In [34]:
model = KNNClassifier(titanic_X_train, titanic_y_train, K=5)

In [35]:
model.adapt(which='zscore')#Learn the mean, std etc of the training data and scale the training data

In [36]:
predictions, y_true = model.slice_cv(slice_index)

In [37]:
model.accuracy(y_true)

89.69072164948454

In [38]:
CV_k = model.kfold_cv(9)
CV_k

0.7957351290684623

In [39]:
CV_n = model.loocv()
CV_n

0.8333333333333395

Let us increase the K value to about the square root of the training data or test data

In [40]:
model = KNNClassifier(titanic_X_train, titanic_y_train, K=20)
model.adapt(which='zscore') #Learn the mean, std etc of the training data and scale the training data
scaled_X_test = model.zscore_norm(titanic_X_test) #Use the mean and std to also scale the test data
predictions = model.predict(scaled_X_test)
model.accuracy(titanic_y_test)

94.49760765550239

In [41]:
count, values = model.error_count(titanic_y_test) #Count the number of misclassified points, and return both the count and values
print(count)
print(values['Actual'])
print(values['Predicted'])

23
[1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0]


Increasing K to 80 can increase the accuracy to about 99%

In [42]:
model = KNNClassifier(titanic_X_train, titanic_y_train, K=80)
model.adapt(which='zscore') #Learn the mean, std etc of the training data and scale the training data
scaled_X_test = model.zscore_norm(titanic_X_test) #Use the mean and std to also scale the test data
predictions = model.predict(scaled_X_test)
model.accuracy(titanic_y_test)

98.56459330143541

In [43]:
count, values = model.error_count(titanic_y_test) #Count the number of misclassified points, and return both the count and values
print(count)
print(values['Actual'])
print(values['Predicted'])

6
[0 1 0 1 1 1]
[1 0 1 0 0 0]


Using KNeighborsClassifier from sklearn, I can compare my model implementation

In [44]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

knn = KNeighborsClassifier(n_neighbors = 80)

scaler = StandardScaler()
scaled_X_train_features = scaler.fit_transform(titanic_X_train)
scaled_X_test = scaler.transform(titanic_X_test)

knn.fit(scaled_X_train_features, titanic_y_train)
knn_predictions = knn.predict(scaled_X_test)

accuracy = (len(np.where(knn_predictions == titanic_y_test)[0]) / len(titanic_y_test))*100
print(accuracy)
error_count = len(np.where(knn_predictions != titanic_y_test)[0])
error_count

98.56459330143541


6