The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import OrdinalEncoder
import time
import math
from sklearn.preprocessing import KBinsDiscretizer
import os

Processing train data

In [35]:
def loadTrainData(filename):
    dataframe = pd.read_csv(filename, delimiter=",")
    label = 'DEFCON_Level'
    dataframe = dataframe.drop(['ID'], axis=1)
    attributes = list(dataframe.columns.values)
    attributes.remove(label)

    return dataframe,attributes,label


processing test data

In [37]:
def loadTestData(filename):
    dataframe = pd.read_csv(filename, delimiter=",")


    return dataframe


load train data

In [25]:
trainData,attributes,label = loadTrainData('Dataset/train.csv')
disc = KBinsDiscretizer(n_bins=5, encode='ordinal',strategy='uniform')
trainData['Percent_Of_Forces_Mobilized'] = pd.Series(disc.fit_transform(trainData['Percent_Of_Forces_Mobilized'].values.reshape(-1, 1)).reshape(-1))
trainData['Active_Threats'] = pd.Series(disc.fit_transform(trainData['Active_Threats'].values.reshape(-1, 1)).reshape(-1))
trainData['Inactive_Threats'] = pd.Series(disc.fit_transform(trainData['Inactive_Threats'].values.reshape(-1, 1)).reshape(-1))
trainData['Citizen_Fear_Index'] = pd.Series(disc.fit_transform(trainData['Citizen_Fear_Index'].values.reshape(-1, 1)).reshape(-1))
trainData['Closest_Threat_Distance(km)'] = pd.Series(disc.fit_transform(trainData['Closest_Threat_Distance(km)'].values.reshape(-1, 1)).reshape(-1))
trainData['Troops_Mobilized(thousands)'] = pd.Series(disc.fit_transform(trainData['Troops_Mobilized(thousands)'].values.reshape(-1, 1)).reshape(-1))

print(trainData['Troops_Mobilized(thousands)'].nunique())

5


Create Training and Test Sets and Apply Scaling

In [26]:
from sklearn.model_selection import train_test_split
trainLabel = trainData[label]
trainData = trainData.drop([label], axis=1)

print(trainData.head())
print(trainLabel.head())
# X_train, X_test, y_train, y_test = train_test_split(trainData, trainLabel, random_state=0)
# from sklearn.preprocessing import MinMaxScaler
# scaler = MinMaxScaler()
# X_train = scaler.fit_transform(X_train)
# X_test = scaler.transform(X_test)
# print(X_train.shape)
# print(X_test.shape)


   Allied_Nations  Diplomatic_Meetings_Set  Percent_Of_Forces_Mobilized  \
0              16                        1                          3.0   
1               8                        1                          0.0   
2               9                        1                          2.0   
3               7                        0                          1.0   
4               8                        1                          0.0   

   Hostile_Nations  Active_Threats  Inactive_Threats  Citizen_Fear_Index  \
0                3             0.0               0.0                 3.0   
1                2             2.0               0.0                 2.0   
2                3             2.0               1.0                 2.0   
3                2             2.0               0.0                 2.0   
4                5             0.0               0.0                 2.0   

   Closest_Threat_Distance(km)  Aircraft_Carriers_Responding  \
0                          1

Logistic Regression


In [27]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(trainData, trainLabel)
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(logreg.score(trainData, trainLabel)))
# print('Accuracy of Logistic regression classifier on test set: {:.2f}'
#      .format(logreg.score(X_test, y_test)))



Accuracy of Logistic regression classifier on training set: 0.55


Decision tree

In [28]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier().fit(trainData, trainLabel)
print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(trainData, trainLabel)))
# print('Accuracy of Decision Tree classifier on test set: {:.2f}'
#      .format(clf.score(X_test, y_test)))

Accuracy of Decision Tree classifier on training set: 0.72


K-nearest neighbour

In [29]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(trainData, trainLabel)
print('Accuracy of K-NN classifier on training set: {:.2f}'
     .format(knn.score(trainData, trainLabel)))
# print('Accuracy of K-NN classifier on test set: {:.2f}'
#      .format(knn.score(X_test, y_test)))

Accuracy of K-NN classifier on training set: 0.61


Linear Discriminant Analysis

In [30]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis()
lda.fit(trainData, trainLabel)
print('Accuracy of LDA classifier on training set: {:.2f}'
     .format(lda.score(trainData, trainLabel)))
# print('Accuracy of LDA classifier on test set: {:.2f}'
#      .format(lda.score(X_test, y_test)))

Accuracy of LDA classifier on training set: 0.55


Gaussian Naive Bayes


In [31]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(trainData, trainLabel)
print('Accuracy of GNB classifier on training set: {:.2f}'
     .format(gnb.score(trainData, trainLabel)))
# print('Accuracy of GNB classifier on test set: {:.2f}'
#      .format(gnb.score(X_test, y_test)))

Accuracy of GNB classifier on training set: 0.52


In [47]:

trainData = loadTestData('Dataset/test.csv')
print(trainData.describe)

disc = KBinsDiscretizer(n_bins=5, encode='ordinal',strategy='uniform')
trainData['Percent_Of_Forces_Mobilized'] = pd.Series(disc.fit_transform(trainData['Percent_Of_Forces_Mobilized'].values.reshape(-1, 1)).reshape(-1))
trainData['Active_Threats'] = pd.Series(disc.fit_transform(trainData['Active_Threats'].values.reshape(-1, 1)).reshape(-1))
trainData['Inactive_Threats'] = pd.Series(disc.fit_transform(trainData['Inactive_Threats'].values.reshape(-1, 1)).reshape(-1))
trainData['Citizen_Fear_Index'] = pd.Series(disc.fit_transform(trainData['Citizen_Fear_Index'].values.reshape(-1, 1)).reshape(-1))
trainData['Closest_Threat_Distance(km)'] = pd.Series(disc.fit_transform(trainData['Closest_Threat_Distance(km)'].values.reshape(-1, 1)).reshape(-1))
trainData['Troops_Mobilized(thousands)'] = pd.Series(disc.fit_transform(trainData['Troops_Mobilized(thousands)'].values.reshape(-1, 1)).reshape(-1))


print(trainData.head())
my_submission = pd.DataFrame({'ID': trainData.index})
my_submission['ID'] = trainData['ID']
trainData = trainData.drop(['ID'], axis=1)
pred = knn.predict(trainData)

my_submission['DEFCON_Level'] = pred

my_submission['DEFCON_Level'] = my_submission['DEFCON_Level'].astype('int64')
print(my_submission.head())
my_submission.to_csv('submission.csv', index=False)

<bound method NDFrame.describe of       Allied_Nations  Diplomatic_Meetings_Set  Percent_Of_Forces_Mobilized  \
0                  8                        0                         0.52   
1                  9                        0                         0.44   
2                  8                        0                         0.44   
3                 10                        0                         0.39   
4                  9                        0                         0.44   
5                  8                        1                         0.36   
6                  7                        0                         0.02   
7                  8                        1                         0.05   
8                  8                        0                         0.35   
9                  9                        0                         0.31   
10                 9                        0                         0.39   
11                10          

   ID  DEFCON_Level
0   1             3
1  10             4
2  14             2
3  17             3
4  21             2
