Name: Dr D  
Class: Pattern Recognition Spring 2018

# Predicting Titanic survival with a random forest classifier



## Problem Description

In this assignment we are given a sample set of data on demographics and survival of a large number of the passengers of the Titanic and we are asked to determine whether some other passengers would have survived.

## Solution Method

We are limited to using decision trees or random forest classifiers for this problem. I am going to try to solve it using a random forest, partially because I think it will work better, but mostly because I have not built one before. A random forest classifier is a meta classifier which creates a number of decisions trees using randomely selected sub-sets of the original data set and randomly selected sub-sets of the features. It wouldn't be too hard to implement on our own (in theory), but it is already built in [SKLearn](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) so there is no reason to build it again. I am going to try to pull some information out of the forest to display how it is making its decisions to show that I know what it is doing, but it should be really easy to classify using it. 


## Input

Here we load and pre-process all of the data to prepare it to be used to test and train classifiers.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
import pandas as pd
df = pd.read_csv('titanic_sample_data.csv')
ds = pd.read_csv('titanic_challenge.csv')
# OK.  Enough messing around.  Let's get to work.
# We need make an array X with just numeric features, and y with just the category (survived)
X = df.drop(['survived','name'], axis = 1)
y = df['survived']
X['sex'] = X['sex'].apply(lambda x: 1 if x == 'male' else 0)
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
X[:5]

Unnamed: 0,pclass,sex,age,fare
0,3,1,42.0,7.55
1,3,1,13.0,20.25
2,3,1,16.0,20.25
3,3,0,35.0,20.25
4,3,0,16.0,7.65


## Analysis

Here we produce a random forest classifier. I went with 1000 trees in the forest arbitrarily, and ran some tests to determine an optimal depth. I also eventually settled on using entropy instead of Gini score to evaluate purity. 

The random forest classifier is just a container of a large number of decision trees, each made with a random subset of the training samples and of the available features. I just stuck with defaults, which means each tree gets to use two of the four features. Each tree gets a 'vote' on how to classify a new data point. I can not find information on how voting is handled or what the decision rule is. 


In [3]:
# building 1000 trees in the forest each with a max depth of 5. Also set to give verbose output and to parallel compute on all processors
clf = RandomForestClassifier(n_estimators=1000,max_depth=5,verbose=1,n_jobs=-1,criterion = 'entropy')
clf.fit(X_train,y_train)


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:    1.6s finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
            oob_score=False, random_state=None, verbose=1,
            warm_start=False)

In [4]:
# it was real easy to train that forest. Let's see how it did.
print(clf.score(X_test,y_test))
print(clf.feature_importances_)

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    0.4s finished


0.785992217899
[ 0.13903435  0.47152513  0.17725675  0.21218377]


In [5]:
# that's not bad. Let's see if we can try classifying 
from decimal import *
name = ds['name']
newX = ds.drop(['name'],axis=1)
newX['sex'] = newX['sex'].apply(lambda x: 1 if x == 'male' else 0)
ans = pd.DataFrame()
ans['name'] = name.values
ans['survived?'] = clf.predict(newX)
probs = clf.predict_proba(newX)
ans['certainty'] = probs[:,1]
# holy cow, pandas support for applying arbitrary lambda functions is awesome
ans['survived?'] = ans['survived?'].apply(lambda x: "yep" if x else "nope")
# certainty is calculated by normalizing probabilities from 0-.5 (inverse) and .5-1 to 0-100%
ans['certainty'] = ans['certainty'].apply(lambda x: abs(x-.5)*200)
ans['certainty'] = ans['certainty'].apply(lambda x: '%.1f%%' %  x)


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    0.4s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    0.3s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    0.4s finished


## Results

In [6]:
ans

Unnamed: 0,name,survived?,certainty
0,"Allison, Master. Hudson Trevor",yep,22.1%
1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",yep,83.5%
2,"Baclini, Miss. Helene Barbara",yep,33.4%
3,"Braund, Mr. Lewis Richard",nope,69.0%
4,"Carlsson, Mr. Frans Olof",nope,43.9%
5,"Cavendish, Mrs. Tyrell William (Julia Florence...",yep,79.3%
6,"Frolicher-Stehli, Mr. Maxmillian",nope,73.5%
7,"Gracie, Col. Archibald IV",nope,32.4%
8,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",yep,10.7%
9,"Icard, Miss. Amelie",yep,94.6%


In [7]:
ans['survived?']

0      yep
1      yep
2      yep
3     nope
4     nope
5      yep
6     nope
7     nope
8      yep
9      yep
10    nope
11    nope
12    nope
13    nope
14    nope
15    nope
16     yep
17    nope
18     yep
19    nope
20    nope
21    nope
22    nope
23     yep
24    nope
25     yep
26    nope
27    nope
Name: survived?, dtype: object

You can see here the expected result for each passenger. I have also attempted to approximate certainty of our predictions by mapping the score from 0.5->0 and 0.5->1 to 0->100%.

## Discussion

I have no way to validate my results. I expect that they have about the same accuracy as the test data set, that is, 80-83%. According to my classifier, all of the guys in our class will die, but both of the girls will live. The classifier follows expected trends, that women and children are more likely to survive and higher class passengers are more likely to survive. 

I learned a considerable amount about pandas dataframes in working on this. It is a super helpful library. I learned about random forests, but there is really not that much to them since they just implement a number of decision trees. 