# Titanic Kaggle Competition

In June of 2019 I attended the Data Science Dojo bootcamp. For our capstone project we built a model for the Kaggle Titanic data set. Below is the code I used to generate my model.

This is a basic model that doesn't do a lot of feature engineering. Partly this was because I wanted to test the theory around the value of the strength of weak learners. This theory states that many weak learners may perform better than a fewer number of strong learners.

To that end, I created a pretty limited Random Forest Tree, and I didn't do much feature engineering beyond creating a variable for family size.

### Set the parameters for the code.

In [105]:
"""
This code is a modified version of the 
original code that is part of Data Science Dojo's bootcamp
Copyright (C) 2015-2019

Objective: Machine Learning of the Titanic dataset with a Random Forest
Data Source: bootcamp root/Datasets/titanic.csv
Python Version: 3.4+
Packages: scikit-learn, pandas, numpy, sklearn-pandas
"""
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn import metrics
from sklearn_pandas import DataFrameMapper
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
import missingno as msno
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import train_test_split


In [106]:
### Load in the data

os.chdir('/Users/johnspencer/Desktop/Bootcamp')
titanic = pd.read_csv('Datasets/titanic.csv')
titanic = titanic.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)
### CHANGE THIS FOR TEST
#titanic['Survived']='NaN'

In [107]:
## Set values to categorical and clean NaNs as needed
titanic['Survived'] = titanic['Survived'].astype('category')
titanic['Pclass'] = titanic['Pclass'].astype('category')
titanic.loc[pd.isnull(titanic['Embarked']), 'Embarked'] = 'S'
titanic['Embarked'] = titanic['Embarked'].astype('category')
titanic['Sex'] = titanic['Sex'].astype('category')

print(titanic["Embarked"].unique())

[S, C, Q]
Categories (3, object): [S, C, Q]


In [108]:
#fill_missing_fare function from:https://www.kaggle.com/saisivasriram/titanic-feature-understanding-from-plots
def fill_missing_fare(df):
    median_fare=df[(df['Pclass'] == 3) & (df['Embarked'] == 'S')]['Fare'].median()
#'S'
       #print(median_fare)
    df["Fare"] = df["Fare"].fillna(median_fare)
    return df

titanic=fill_missing_fare(titanic)



### Cleaning Missing Data

In the class, missing age values were assigned the median for all ages. This is a *very* rough way to fill in these missing values. Because of time constraints in the course, I wasn't able to dive too deeply into alternative ways to impute these missing values, so I used a slightly less rough method. I assigned passengers with missing age values the mean value for their sex. Again this isn't ideal but in keeping with my need to balance developing a quick model with an interest in exploring weak learners, it can be a good starting point.


In [109]:
#This calculates the mean/median by sex. Try running this first. If it doesn't result in a solid model 
titanic=titanic.copy()
g_mean = titanic.groupby('Sex').mean()
g_median= titanic.groupby('Sex').median()
titanic.loc[titanic.Age.isnull() & (titanic.Sex == 'female'),'Age'] = g_mean['Age']['female']
titanic.loc[titanic.Age.isnull() & (titanic.Sex == 'male'), 'Age'] = g_mean['Age']['male']


### Family Size

After exploring the data family size appeared to play a role and indeed in doing some reading on how others have tackled the issue it seemed like it could be an important factor. 

I decided to add that to my model. So I created a categorical variable that indicates family size.

In [110]:
titanic['relatives'] = titanic['SibSp'] + titanic['Parch']
titanic.loc[titanic['relatives'] > 0, 'not_alone'] = 0
titanic.loc[titanic['relatives'] == 0, 'not_alone'] = 1
titanic['not_alone'] = titanic['not_alone'].astype(int)


In [111]:
# Create a family size variable including the passenger themselves
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]+1
print(titanic["FamilySize"].value_counts())
# Discretize family size
titanic.loc[titanic["FamilySize"] == 1, "FsizeD"] = '1'
titanic.loc[(titanic["FamilySize"] > 1)  &  (titanic["FamilySize"] < 5) , "FsizeD"] = '2'
titanic.loc[titanic["FamilySize"] >4, "FsizeD"] = '3'
#titanic["FsizeD"] = pd.to_numeric(["FsizeD"])
titanic["FsizeD"] = titanic["FsizeD"].astype(np.int64)

#print(titanic["Embarked"].unique())
#print(titanic["FsizeD"].value_counts())

1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: FamilySize, dtype: int64


#### More data cleaning by casting certain variables as categorical

In [112]:
## Encode all categorical values as integers
## Survived: 0 = Dead, 1 = Alive
## Embarked: 0 = Cherbourg, 1 = Queenstown, 2 = Southampton, 3 = Unknown
## Sex: 0 = female, 1 = male
titanic['Survived'].cat.categories = [0,1]
titanic['Embarked'].cat.categories = [0,1,2]
titanic['Sex'].cat.categories = [0,1]
titanic['Pclass'].cat.categories = [0,1,2]



#### Drop some variables that we don't need.

In [113]:
titanic = titanic.drop(['SibSp', 'Parch', 'FamilySize'], axis=1)



#### Map the variables so the model can be built.

In [114]:
titanic_map = DataFrameMapper([
    (['Pclass'], LabelBinarizer()),
    (['Sex'], LabelBinarizer()),
    ('Age', None),
    ('Fare', None),
    (['Embarked'], LabelBinarizer()),
    ('relatives', None),
    ('not_alone', None),
    ('FsizeD', None),
])

### Build the model
Now we get down to business by building a `Random Forest Classifier` model.

I chose to set a high value for `n_estimators` (the number of trees the model generates) in order to test the weak learner theory. I selected __3__ for the `max_depth` value (the maximum depth of the tree) and a low value of **2** for `min_samples_split`.

For some reason I could not get the `OOB Accuracy` value to print. That's a subject for future exploration.

In [116]:
# Import train_test_split function
titanic_cln_var_names = [ "Pclass","Sex","Age","Fare", "Embarked","relatives","not_alone","FsizeD"]

titanic_features = titanic_map.fit_transform(titanic)



# Split dataset into training set and test set
np.random.seed(27)
titanic.is_train = np.random.uniform(0,1,len(titanic)) <= .7
titanic_features_train = titanic_features[titanic.is_train]
titanic_features_test = titanic_features[titanic.is_train == False]


# Train Model
titanic_rf_clf = RandomForestClassifier(oob_score=True, n_jobs=3, n_estimators=1000, 
                                        max_features='sqrt', criterion='gini', max_depth=3, 
                                        min_samples_split=2, min_samples_leaf=1
                                       )
titanic_rf_clf = titanic_rf_clf.fit(titanic_features_train, 
                                    titanic.loc[titanic.is_train,'Survived'])
print("OOB Accuracy: " + str(titanic_rf_clf.oob_score))



  


OOB Accuracy: True


### Test the model

Now that we have a model let's test it. 

Again because of time limitations in the course I wasn't able to do cross validation with my model. Next run I'll add that and do more feature engineering.

In [117]:
testdf_cln_var_names = [ "Pclass","Sex","Age","Fare", "Embarked","relatives","not_alone","FsizeD"]

testdf_features = testdf_map.fit_transform(testdf)



# Predict classes of test set and evaluate
titanic_rf_pred = titanic_rf_clf.predict(titanic_features_test)

titanic_rf_cm = metrics.confusion_matrix(titanic.loc[titanic.is_train==False, 'Survived'],
                                         titanic_rf_pred, labels=[0,1])
print(titanic_rf_cm)
titanic_rf_acc = metrics.accuracy_score(titanic.loc[titanic.is_train==False, 'Survived'],
                                         titanic_rf_pred)
titanic_rf_prec = metrics.precision_score(titanic.loc[titanic.is_train==False, 'Survived'],
                                          titanic_rf_pred)
titanic_rf_rec = metrics.recall_score(titanic.loc[titanic.is_train==False, 'Survived'],
                                      titanic_rf_pred)
titanic_rf_f1 = metrics.f1_score(titanic.loc[titanic.is_train==False, 'Survived'],
                                      titanic_rf_pred)

# Predict probabilities to calculate AUC
titanic_rf_pred_prob = titanic_rf_clf.predict_proba(titanic_features_test)

titanic_rf_auc = metrics.roc_auc_score(titanic.loc[titanic.is_train==False, 'Survived'],
                                       titanic_rf_pred_prob[:,1])

print("Accuracy: " + str(titanic_rf_acc) + "\nPrecision: " 
      + str(titanic_rf_prec) + "\nRecall: " + str(titanic_rf_rec)
      + "\nF1-score: " + str(titanic_rf_f1) + "\nAUC: " + str(titanic_rf_auc))





[[150  11]
 [ 43  62]]
Accuracy: 0.7969924812030075
Precision: 0.8493150684931506
Recall: 0.5904761904761905
F1-score: 0.696629213483146
AUC: 0.8634723454599231


In [118]:
titanic_rf_test = titanic_rf_clf.predict(testdf_features)

In [120]:
# Do we have 418 records?
len(titanic_rf_test)

418

In [127]:
predictions = pd.DataFrame(titanic_rf_test, columns=['Survived'])
testread = pd.read_csv('Datasets/titanic_test.csv')
predictions = pd.concat((testread.iloc[:, 0], predictions), axis = 1)
predictions.to_csv('random_forest.csv', sep=",", index = False)

In [131]:
predictions.to_csv('Datasets/random_forest.csv', sep=",", index = False)