# Challenge compare decision trees and random forests

Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Will be using the [Titanic dataset](https://www.kaggle.com/c/titanic/data)

In [101]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

# import os
# print(os.listdir("../input"))

In [102]:
train_df = pd.read_csv('titanic_train.csv')
test_df = pd.read_csv('titanic_test.csv')
combine = [train_df, test_df]

In [103]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Variable Notes
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.



In [104]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [105]:
test_df.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [106]:
train_df.info()
print('_'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

In [107]:
# drop unnecessary columns, these columns won't be useful in analysis and prediction
train_df = train_df.drop(['PassengerId','Name','Ticket'], axis=1)
test_df = test_df.drop(['PassengerId', 'Name','Ticket'], axis=1)

In [108]:
# Cabin
# It has a lot of NaN values, so it won't cause a remarkable impact on prediction
train_df.drop("Cabin",axis=1,inplace=True)
test_df.drop("Cabin",axis=1,inplace=True)

Some feature engineering:

In [109]:
# Family

# Instead of having two columns Parch & SibSp, 
# we can have only one column represent if the passenger had any family member aboard or not,
# Meaning, if having any family member(whether parent, brother, ...etc) will increase chances of Survival or not.
train_df['Family'] =  train_df["Parch"] + train_df["SibSp"]
train_df['Family'].loc[train_df['Family'] > 0] = 1
train_df['Family'].loc[train_df['Family'] == 0] = 0

test_df['Family'] =  test_df["Parch"] + test_df["SibSp"]
test_df['Family'].loc[test_df['Family'] > 0] = 1
test_df['Family'].loc[test_df['Family'] == 0] = 0

# drop Parch & SibSp
train_df = train_df.drop(['SibSp','Parch'], axis=1)
test_df    = test_df.drop(['SibSp','Parch'], axis=1)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [110]:
# Age
# we will impute values for missing ages for the passengers:

average_age_train = train_df["Age"].mean()
std_age_train = train_df["Age"].std()
count_nan_age_train = train_df["Age"].isnull().sum()

# get average, std, and number of NaN values in test_df
average_age_test = test_df["Age"].mean()
std_age_test = test_df["Age"].std()
count_nan_age_test = test_df["Age"].isnull().sum()

# generate random numbers between (mean - std) & (mean + std)
rand_1 = np.random.randint(average_age_train - std_age_train, 
                           average_age_train + std_age_train, 
                           size = count_nan_age_train)
rand_2 = np.random.randint(average_age_test - std_age_test, 
                           average_age_test + std_age_test, 
                           size = count_nan_age_test)



In [111]:
# fill NaN values in Age column with random values generated
train_df["Age"][np.isnan(train_df["Age"])] = rand_1
test_df["Age"][np.isnan(test_df["Age"])] = rand_2

# convert from float to int
train_df['Age'] = train_df['Age'].astype(int)
test_df['Age']    = test_df['Age'].astype(int)
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [112]:
print("train_df features missing values: ")
print(train_df.isnull().sum())
print("\ntest_df features missing values: ")
print(test_df.isnull().sum())

train_df features missing values: 
Survived    0
Pclass      0
Sex         0
Age         0
Fare        0
Embarked    2
Family      0
dtype: int64

test_df features missing values: 
Pclass      0
Sex         0
Age         0
Fare        1
Embarked    0
Family      0
dtype: int64


In [113]:
# create dummy variables for Pclass column, 
pclass_dummies_train  = pd.get_dummies(train_df['Pclass'])
pclass_dummies_train.columns = ['Class_1','Class_2','Class_3']


pclass_dummies_test  = pd.get_dummies(test_df['Pclass'])
pclass_dummies_test.columns = ['Class_1','Class_2','Class_3']


train_df.drop(['Pclass'],axis=1,inplace=True)
test_df.drop(['Pclass'],axis=1,inplace=True)

train_df = train_df.join(pclass_dummies_train)
test_df    = test_df.join(pclass_dummies_test)

In [114]:
# classify passengers as males, females, and child
def get_person(passenger):
    age,sex = passenger
    return 'child' if age < 16 else sex
    
train_df['Person'] = train_df[['Age','Sex']].apply(get_person,axis=1)
test_df['Person']    = test_df[['Age','Sex']].apply(get_person,axis=1)

# No need to use Sex column since we created Person column
train_df.drop(['Sex'],axis=1,inplace=True)
test_df.drop(['Sex'],axis=1,inplace=True)

# create dummy variables for Person column, 
person_dummies_train  = pd.get_dummies(train_df['Person'])
person_dummies_train.columns = ['Child','Female','Male']


person_dummies_test  = pd.get_dummies(test_df['Person'])
person_dummies_test.columns = ['Child','Female','Male']


train_df = train_df.join(person_dummies_train)
test_df = test_df.join(person_dummies_test)
train_df.drop(['Person'], axis=1, inplace=True)
test_df.drop(['Person'], axis=1, inplace=True)


In [120]:
train_df.drop(['Embarked'], axis=1, inplace=True)
test_df.drop(['Embarked'], axis=1, inplace=True)

In [121]:
train_df.head()

Unnamed: 0,Survived,Age,Fare,Family,Class_1,Class_2,Class_3,Child,Female,Male
0,0,22,7.25,1,0,0,1,0,0,1
1,1,38,71.2833,1,1,0,0,0,1,0
2,1,26,7.925,0,0,0,1,0,1,0
3,1,35,53.1,1,1,0,0,0,1,0
4,0,35,8.05,0,0,0,1,0,0,1


In [122]:
test_df.head()

Unnamed: 0,Age,Fare,Family,Class_1,Class_2,Class_3,Child,Female,Male
0,34,7.8292,0,0,0,1,0,0,1
1,47,7.0,1,0,0,1,0,1,0
2,62,9.6875,0,0,1,0,0,0,1
3,27,8.6625,0,0,0,1,0,0,1
4,22,12.2875,1,0,0,1,0,1,0


In [123]:
# define training and testing sets

X_train = train_df.drop("Survived",axis=1)
Y_train = train_df["Survived"]
X_test  = test_df

## Logisitic Regression

In [149]:
# Testing Logistic Regression: 
from sklearn import linear_model
lrm = linear_model.LogisticRegression()

#lrm.fit(X_train, Y_train)
start_time = time.perf_counter()

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
sk = StratifiedKFold(5, shuffle=True)
score = cross_val_score(lrm, X_train, Y_train, cv=sk)


print('score array:\n', score)
print('\nruntime:\n',time.perf_counter() - start_time, "seconds")
print('\nscore array mean:\n', np.mean(score))
print('\nscore array std dev:\n', np.std(score))


score array:
 [0.80446927 0.83798883 0.82022472 0.80337079 0.74576271]

runtime:
 0.03486447999966913 seconds

score array mean:
 0.8023632636082088

score array std dev:
 0.030975103502569268




## Decision Tree

In [138]:
# decision tree
from sklearn import tree
from sklearn.model_selection import cross_val_score
from IPython.display import Image
import pydotplus
import graphviz
import time


#set variable to time program
start_time = time.perf_counter()

decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=1,
    max_depth=12
)

#variables for validation results
cv = 5
scores_tree = cross_val_score(decision_tree,X_train,Y_train,cv=StratifiedKFold(5, shuffle=True))

print('score array:\n', scores_tree)
print('\nruntime:\n',time.perf_counter() - start_time, "seconds")
print('\nscore array mean:\n', np.mean(scores_tree))
print('\nscore array std dev:\n', np.std(scores_tree))

score array:
 [0.81564246 0.79329609 0.73595506 0.79213483 0.72881356]

runtime:
 0.03511636800067208 seconds

score array mean:
 0.7731683988897033

score array std dev:
 0.03441243921529032


## Random Forest

In [142]:
start_time = time.perf_counter()

from sklearn import ensemble
rfc = ensemble.RandomForestClassifier(n_estimators = 100)

scores_rfc = cross_val_score(rfc,X_train,Y_train,cv=StratifiedKFold(5, shuffle=True))

print('score array:\n', scores_rfc)
print('\nruntime:\n',time.perf_counter() - start_time, "seconds")
print('\nscore array mean:\n', np.mean(scores_rfc))
print('\nscore std dev:\n', np.std(scores_rfc))

score array:
 [0.82122905 0.79888268 0.79213483 0.81460674 0.80225989]

runtime:
 0.5068463270017673 seconds

score array mean:
 0.8058226383765866

score std dev:
 0.010608772339327294


Looking at type 1 and type 2 error

In [141]:
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

X_tr, X_ts, y_tr, y_ts = train_test_split(X_train, Y_train)

rfc = RandomForestClassifier(n_estimators = 100)
rfc.fit(X_tr, y_tr)
Y_pred = rfc.predict(X_ts)

confusion_matrix(y_ts, Y_pred)

array([[115,  18],
       [ 21,  69]])

^ To reduce type 2 error (26/(26+59)), look to include more features or gridsearch or use boosted models 

## Conclusion
- Random Forest consistently returns more accurate predictions

- RF scores typically show less variance than a single Decision Tree

- RF takes much longer to run