Name: Sam Mulder

Class: Pattern Recognition Spring 2018

# HW 2 
Trees on the Titanic

## Problem Description
1) We must create a classifying mechanism to determine who lives and who dies from the Titanic. Explain problem, how to solve, and what my results mean. 2) Make it easy to paste your predictions in a Google Spreadsheet.


## Solution Method
We will be allowed to use either a Decision Tree Classifier(DTC) or a Random Forest Classifier(RFC). It is very easy to simply run the RandomForestClassifier (RFC). Our goal is to 'twist' the knobs on its available parameters to achieve the most optimal result for the classifier to have the most optimal accuracy possible on average. I will also go about explaining the results and output the predicted data for the classifier.

In [102]:
## Input
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
from IPython.display import Image as PImage
from subprocess import check_call
from PIL import Image, ImageDraw, ImageFont
import re
from sklearn import model_selection
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

data = pd.read_csv('titanic_sample_data.csv')
data.describe()

X = data.drop(['survived','name'], axis = 1)
y = data['survived']
X['sex'] = X['sex'].apply(lambda x: 1 if x == 'male' else 0)
X[:5]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)

## Analysis of Random Forest Classifier

In [103]:
print('% of children who survived', 100*np.mean(data['survived'][data['age'] < 18]))
print('% of adults who survived', 100*np.mean(data['survived'][data['age'] > 18]))
print('% of females who survived', 100*np.mean(data['survived'][data['sex'] == 'female']))
print('% of males who survived', 100*np.mean(data['survived'][data['sex'] == 'male']))

% of children who survived 51.65562913907284
% of adults who survived 38.70967741935484
% of females who survived 75.06561679790026
% of males who survived 20.0


Our first goal is to determine the most optimal values to attribute to the different parameters available on the FRC. Some of them will only be an intuitive argument and some will require testing along with intuition. 

Parameters available to choose from:
n_estimators; 
criterion; 
max_features; 
max_depth; 
min_samples_split; 
min_samples_leaf; 
min_weight_fraction_leaf; 
max_leaf_nodes; 
min_impurity_split; 
min_impurity_decrease; 
bootstrap; 
oob_score; 
n_jobs; 
random_state; 
verbose; 
warm_start; 
class_weight; 

In [104]:
#This is the accuracy of our RFC without any parameter adjusmtents
DTrfc = RandomForestClassifier()
DTrfc.fit(X_train, y_train)
DTrfc.score(X_test, y_test)

0.82101167315175094

In [105]:
DTdtc = DecisionTreeClassifier()
DTdtc.fit(X_train, y_train)
DTdtc.score(X_test, y_test)

0.78599221789883267

In [106]:
DTbest = RandomForestClassifier(max_depth = 5, max_features = 4, criterion = 'entropy', n_estimators = 1000, min_samples_leaf = 3, verbose = 1)
DTbest.fit(X_train, y_train)
DTbest.score(X_test, y_test)

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.1s finished


0.8443579766536965

In [111]:
print(DTbest.feature_importances_)

[ 0.23020467  0.42556306  0.18805202  0.15618025]


Now to describe why I chose these parameter definitions. I chose a max_depth = 5 based purely on empirical intuition. I ran the classifier many times and based on the score output I was able to generate the highest accuracy value on average when I set this parameter = 5. For max_features = 4 becuase this is the maximum number of features possible. I don't expect there to be much variability in this parameter becuase the most meaningful features will be used first so the less impactful ones will not have as great of an impact on the resulting tree. I set criterion = 'entropy' becuase of how they are defined. For one, I understand the definition of 'entropy' much better so it simply defines it as information gain. This parameter controls the split of each node and rather than using the 'gini impurity' we will use the 'entropy' criterion. I set the n_estimators = 1000 because this is a fair number that allows for a large amount trees to be created in the random forest however becuase of the structure of the random forest we do not risk overfitting. I chose min_samples_leaf = 3 

Here we can also see the class importance according to the classifier built DTbest. It shows that 'sex' by far is the most important feature when determining the survival of the passenger and ticket class is next but not even close. This is an important thing to recognize to develop a full understanding of our model of the classifier.

## Results
In this section I will apply the RFC to the titanic_challenge.csv dataset to determine how effective this classifier is at the unknown dataset. 

In [112]:
data2 = pd.read_csv('titanic_challenge.csv')
X2 = data2.drop(['name'], axis = 1)
X2['sex'] = X2['sex'].apply(lambda x: 1 if x == 'male' else 0)
X2[19:28]

Unnamed: 0,pclass,sex,age,fare
19,1,1,38.0,6.0
20,2,1,21.0,8.0
21,3,1,21.0,10.0
22,1,1,20.0,12.0
23,2,0,21.0,14.0
24,3,1,21.0,16.0
25,1,0,21.0,18.0
26,2,1,22.0,20.0
27,3,1,20.0,22.0


In [113]:
pred = DTbest.predict(X2)
predprobraw = DTbest.predict_proba(X2)
predprob = predprobraw[:,1]

[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.1s finished


Here we applied the RFC to the challenge dataset to calculate the predicted survival and the mean class probability. 

In [114]:
names = data2['name']
outtable = pd.DataFrame()
outtable['Name:'] = names
outtable['Prediction'] = pred
outtable['Probability'] = predprob
outtable['Certainty'] = predprob
outtable['Certainty'] = outtable['Certainty'].apply(lambda x: abs(x-.5)*200)
outtable['Certainty'] = outtable['Certainty'].apply(lambda x: '%.2f%%' %  x)
outtable

Unnamed: 0,Name:,Prediction,Probability,Certainty
0,"Allison, Master. Hudson Trevor",1,0.933106,86.62%
1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",1,0.99034,98.07%
2,"Baclini, Miss. Helene Barbara",1,0.673236,34.65%
3,"Braund, Mr. Lewis Richard",0,0.224595,55.08%
4,"Carlsson, Mr. Frans Olof",0,0.341712,31.66%
5,"Cavendish, Mrs. Tyrell William (Julia Florence...",1,0.849237,69.85%
6,"Frolicher-Stehli, Mr. Maxmillian",0,0.088307,82.34%
7,"Gracie, Col. Archibald IV",0,0.352381,29.52%
8,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,0.439404,12.12%
9,"Icard, Miss. Amelie",1,0.998213,99.64%


Here we needed to alter the values given to us by predict_proba becuase it returns the mean predicted class to create the certainty item for each passenger. The probability value is useful because it shows you essentially how often they received a 1/0 based on how close that value is to either 1/0. This is because it is the mean predicted class. The certainty I chose to show is useful because it gives a clear indication of how close the value is to the ideal state of 0/1 for its average depending on if we predict that person survives or dies. 

## Discussion
Here we have produced a graph of each of the challenge data set passenegers predicted survival status. The row producing values of 1/0 represents the survival output with '1' being alive and '0' being dead. The row of data with continuous set of values from 0 - 1 represents the predicted class probabilities. This data was produced by using the random forest classifier built by pass the training data through the classifier.

The RFC works by essentially created a 'forest' of decision trees that are based on choosing a random asortment of features and sample data points for each tree. There is some overlap based on the settings that we have but this is beneficial becuase it results in increased variability over the course of all 1000 decsion trees that we have it set to create. Once we have all of these trees created in our random forest, we simply passed more data points through it and reported the result for each new passenger (whether or not they lived or died).

The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. The predicted class is the one with highest mean probability estimate across the trees.

The predicted class probabilities of an input sample are computed as the mean predicted class probabilities of the trees in the forest. The class probability of a single tree is the fraction of samples of the same class in a leaf.