<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/decision_tree_sklearn_titanic_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



decision tree titanic tutorial

Dr. Olivia Sheng
September 16, 2016

Converted to python by Steven Wang and Matthew Pecsok 5/2021



## table of contents

1.   Data Description
2.   Set up, data import and inspections
3.   Build decision trees
4.   Post-model-building data exploration
5.   Generate performance metrics
6.   Simple hold-out evaluation
7.   Tree pruning/unpruning



# 1 Data Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier Survived Survival (0 = No; 1 = Yes) Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower) Name Name Sex Sex Age Age (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5) Sibsp Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard Ticket Ticket Number Fare Passenger Fare Cabin Cabin Embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

# 2 Set up, data import and inspections

## load libraries

In [None]:
## Load packages 

import pandas as pd
import numpy as np
import sklearn
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt



## import data

In [None]:
# read_csv has some defaults, we can just take the defaults here, but be aware they exist. 
titanic_raw = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2022/main/data/titanic_cleaned.csv")
titanic = titanic_raw.copy()

# raw is the original unedited version of our data which can be useful for inspecting changes we've made 
# compared to the original unedited data

## get summary statistics of dataframe

In [None]:
titanic.info()

In [None]:
titanic.head()

In [None]:
titanic.describe(include='all')

In [None]:
# count null values (extremely important to identify nulls)

titanic.isnull().sum()

# no nulls, that's good news and almost never what happens in the real world.

In [None]:
titanic.Survived.value_counts()

In [None]:
round(titanic.Survived.value_counts() / len(titanic),2)

## transform character/string to categorical (factor in R)

In [None]:
# astype is function in pandas that allows one to convert from one type of data to another ie string to int, or in this 
# case string to categorical
# https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html

titanic = titanic.astype({'Survived': 'category',
                                          'Sex': 'category',
                                          'Pclass': 'category',
                                          'Cabin': 'category',
                                          'Embarked': 'category'})
titanic.dtypes

## dummy encode the data

these models cannot handle string/words. they must be converted to numeric values

In [None]:
# extract the target column of survived. target aka y
# while R is happy to have the target in the dataframe with the X predictors sklearn prefers them separate
y_target = titanic.pop('Survived')

# use pandas get_dummies to one-hot-encode categorical values
# we would expect only numeric values left in our dataframe
# rename this df as encoded so we understand it's the encoded version
# of the original
titanic_encoded_X = pd.get_dummies(titanic)

titanic_encoded_X.head()

In [None]:
# lucky for us the binary target values are already numeric ie 0,1 instead of "yes","no" "true","false" etc
# this saves us a step of having to encode the series. 
y_target

# 3 build decision trees

In [None]:
# random state
# set random state for all models for reproducbility
# if this is NOT set then you will see variations each time you run the model
# for this reason reproducibility is desirable in homeworks
random_state = 42

In [None]:
# what is tree?
tree

# an instance of a sklearn tree classifier

In [None]:
tree_model_1 = tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11)
tree_model_1

## model 1 (all data)

### fit/train the model

In [None]:
# model 1 is a model trained on all the data. subsequent models will be variations of this model and should be compared to understand how these changes impact the model. 

In [None]:
# note, there is a lot going on behind the scenes here fitting is a complex process
# the first argument is a dataset of the predictors. the second is a series of the target or y variable. 
tree_model_1 = tree_model_1.fit(titanic_encoded_X,y_target) # this trains the model on the x and y data 

# check to see if the model is fited
sklearn.utils.validation.check_is_fitted(tree_model_1) # only get output if model is not fitted

### see a textual view of the model

### plot the tree

In [None]:
fig = plt.figure(figsize=(20,10))
_ = tree.plot_tree(tree_model_1,
                   feature_names=titanic_encoded_X.columns.to_list(), # make sure the feature names are in output
                   filled=True) # filled true color codes by the class. shading indicates proportion or quality of split

## model 2 (all data but with cabin removed)

In [None]:
# demonstrating how to drop all columns starting with "Cabin"
titanic_encoded_X.drop(titanic_encoded_X.columns[titanic_encoded_X.columns.str.startswith('Cabin')], axis=1, inplace=False)

In [None]:
# create a new model 2
tree_model_2 = tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11)
tree_model_2

# intentionally drop the cabin column. pay attention to the decision to do this 
# if we drop a column should it increase or decrease model performance? Can you know this ahead of time?
titanic_encoded_X_no_cabin = titanic_encoded_X.drop(titanic_encoded_X.columns[titanic_encoded_X.columns.str.startswith('Cabin')], axis=1, inplace=False)


# note, there is a lot going on behind the scenes here fitting is a complex process
tree_model_2 = tree_model_2.fit(titanic_encoded_X_no_cabin,y_target) # this trains the model on the x and y data 

# check to see if the model is fitted
sklearn.utils.validation.check_is_fitted(tree_model_2) # only get output if model is not fitted


### plot the tree

In [None]:
fig = plt.figure(figsize=(20,10))
_ = tree.plot_tree(tree_model_2,
                   feature_names=titanic_encoded_X.columns.to_list(), # make sure the feature names are in output
                   filled=True) # filled true color codes by the class. shading indicates proportion or quality of split

# 4 Post-model-building data exploration

In [None]:
# generate metrics for male and female passengers. in this notebook we will demonstrate doing what was done in R,
# but in this case we will use a function to simplify the code
# often when doing the same thing 2 or more times a function can reduce redundant code

In [None]:
titanic[titanic['Sex']=='male'] # demonstration of how to filter a dataframe

In [None]:
def metrics_by_gender(gender,df):
  # filter df by gender
  display("Dataframe subset of: "+gender)

  df = df[df['Sex']==gender]
  
  print(gender+": shape")
  display(df.shape)

  print(gender+": describe")
  display(df.describe())


In [None]:
# demonstration of a very simple function that just subsets the dataframe and prints the new shape
for gender in ['male','female']:
  df = titanic[titanic['Sex']==gender]
  print(gender)
  print(df.shape)
  print()

In [None]:
# demonstration of a very simple function that just subsets the dataframe and prints the new shape
for gender in ['male','female']:
  df = titanic[titanic['Sex']==gender]
  print(gender)
  print(df.describe())
  print()

In [None]:
# demonstration of a very simple function that just subsets the dataframe and prints the new shape
for gender in ['male','female']:
  df = titanic_raw[titanic_raw['Sex']==gender]
  print(gender)
  #print(df.groupby('Survived').value_counts())
  print(df.groupby(['Survived','Pclass'])[['Pclass']].agg(['count']))
  print()

In [None]:
for gender in ['male','female']:
  df = titanic_raw[titanic_raw['Sex']==gender]
  print(gender)
  #print(df.groupby('Survived').value_counts())
  print(df.groupby(['Survived','Embarked'])[['Embarked']].agg(['count']))
  print()

In [None]:
# demonstrating a nested for loop
# be careful of going much deeper than this in a loop 
# the code becomes very difficult to read
np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)   # ignore some ugly warnings because of code that is being deprecated

for numeric_var in ['Age','Fare','Parch','SibSp']:
  for gender in ['male','female'] :
    df = titanic_raw[titanic_raw['Sex']==gender]
    df.boxplot(column=[numeric_var],by=['Survived']) # R boxplot(Age~Survived, data = titanic)
    plt.title( 'Boxplot of %s by Survived and Sex=%s' % (numeric_var,gender) )
    plt.suptitle('')
    plt.show()


# 5 Generate performance metrics

In [None]:
# predict() applies a model (1st argument) to a testing data set (2nd argument).
# Let's apply it to the whole data set that was used to train the model 
# to see the model's performance metrics in training data (i.e., not holdout evaluation)
# Take a look at the structure and summary of predicted_Survived_w1 to understand the output of predict()

In [None]:
# create a confusion matrix comparing y_true and y_predicted
model_1_pred = tree_model_1.predict(titanic_encoded_X)

print(model_1_pred)
print(model_1_pred.shape)
# pay attention to the predictions. how does the model choose 0 or 1 for the predictions? What is going on under the hood here? 



In [None]:
predicted_probs = tree_model_1.predict_proba(titanic_encoded_X)


In [None]:
pd.DataFrame(data={'predicted_class':model_1_pred,'predicted_probability':predicted_probs[:,0],'real_class':y_target}) 
# so what's the relationship between the class and the probability?
# nothing more than if prob >.5 then 0 else 1
# can you see how we might begin to quantify how wrong a given prediction is from the truth?
# when the model is uncertain we can expect the probability to be close to .5 in that case can 
# you see the model is getting the answer wrong? Is that surprising?

In [None]:
print(len(model_1_pred))



In [None]:
model_1_cf = confusion_matrix(y_true=y_target,y_pred=model_1_pred)
model_1_cf

### model 1 & 2 performance metrics

In [None]:
model_2_pred = tree_model_2.predict(titanic_encoded_X_no_cabin)

print(model_2_pred)



In [None]:
print(len(model_2_pred))

In [None]:
model_2_cf = confusion_matrix(y_true=y_target,y_pred=model_2_pred)
model_2_cf

### confusion matrix comparison

In [None]:
print("model1")
print(model_1_cf)

print()

print("model2")
print(model_2_cf)

# does dropping cabin improve the model or make it worse?
# notice that the model has decreased in performance for one class and increased in another. 
# is this model over or underfitted? more analysis needed to be certain.


### recall, precision f1 etc

In [None]:
# performance of the tree_model_1
# be sure to compare these metrics across the models
# metrics themselves are more useful when comparing across models
print(metrics.classification_report(y_target,tree_model_1.predict(titanic_encoded_X)))

In [None]:
print(metrics.classification_report(y_target,tree_model_2.predict(titanic_encoded_X_no_cabin)))

# 6 Simple hold-out evaluation

In [None]:
# Examine the impacts of simple hold-out evaluation, the training set size, the feature selection and the pruning factor - CF

# Only knowing the model's training performance is not sufficient. Let's try a simple hold-out evaluation. 

# Use train_test_split() in sklearn package to split titanic 50%-50% into a train set and a test set
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

# set random state to a value for train_test_split(). With the same value and input, 

In [None]:
X_sample = [0,0,0,0,0,0,0,0,0,0,0,0,1,1]
train_test_split(X_sample,test_size=.5,stratify=X_sample)

In [None]:
# split the dataset into two main groups
# train will be used for training the model
# test will be used for evaluation of the mode
# both of these are simply subsets of the original dataset

X_train, X_test, y_train, y_test = train_test_split(titanic_encoded_X,
                                                    y_target, 
                                                    test_size=.3, 
                                                    random_state=random_state,
                                                    stratify=y_target)

In [None]:
y_test.value_counts()

In [None]:
y_train.value_counts()

In [None]:
X_train.describe(include='all')

In [None]:
X_test.describe(include='all')

In [None]:
print(y_train.value_counts())
print(round(y_train.value_counts(normalize=True),2))

In [None]:
print(y_test.value_counts())
print(round(y_test.value_counts(normalize=True),2))

## model 3 (simple hold out 70%/30% split)

### fit the model

In [None]:
model_3_simple_hold_out = tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11)

# fit the model to the training data
model_3_simple_hold_out = model_3_simple_hold_out.fit(X_train, y_train)

# show what the trained model looks like
print(tree.export_text(model_3_simple_hold_out, feature_names=X_train.columns.to_list()))

In [None]:
fig = plt.figure(figsize=(20,10))
_ = tree.plot_tree(model_3_simple_hold_out,
                   feature_names=X_train.columns.to_list(),
                   filled=True)

### performance metrics of model 3

In [None]:
model_3_pred = model_3_simple_hold_out.predict(X_test)

print(metrics.classification_report(y_test,model_3_pred))
print(metrics.confusion_matrix(y_test,model_3_pred))

## model 4 (simple hold out 70%/30% split and cabin removed)

In [None]:
# another way to drop cabin columns
# notice there are multiple ways to drop columns.

# Drop Cabin
X_train_no_cabin = X_train.drop(list(X_train.filter(regex = '^Cabin')), axis=1, inplace=False)
X_test_no_cabin = X_test.drop(list(X_test.filter(regex = '^Cabin')), axis=1, inplace=False)

### fit

In [None]:
model_4 = tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11)

model_4 = model_4.fit(X_train_no_cabin, y_train)
print(tree.export_text(model_4, feature_names=X_train_no_cabin.columns.to_list()))

### plot

In [None]:
fig = plt.figure(figsize=(20,10))
_ = tree.plot_tree(model_4,
                   feature_names=X_train.columns.to_list(),
                   filled=True)

### performance metrics of model 4

In [None]:
# Test set performance metrics
print(metrics.classification_report(y_test,model_4.predict(X_test_no_cabin)))
print(metrics.confusion_matrix(y_test,model_4.predict(X_test_no_cabin)))

# 7 Regularization (tree model pruning)

## model 5 (pruned tree on simple hold out 70%/30%)

In [None]:
# check out alpha level on the default tree sklearn gives us.
# notice it's set to 0!
tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11) #

# notice we have set a new hyper parameter ccp_alpha


In [None]:
tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11,ccp_alpha=.005) # notice we have set a new hyper parameter ccp_alpha

In [None]:
model_5 = tree.DecisionTreeClassifier(random_state=random_state,max_leaf_nodes=11,ccp_alpha=.01)

### fit

In [None]:
model_5 = model_5.fit(X_train, y_train)
print(tree.export_text(model_5, feature_names=X_train.columns.to_list()))

### plot

In [None]:
fig = plt.figure(figsize=(20,10))
_ = tree.plot_tree(model_5,
                   feature_names=X_train.columns.to_list(),
                   filled=True)

### performance metrics of model 5

In [None]:
# Test set
print(metrics.classification_report(y_test,model_5.predict(X_test)))
print(metrics.confusion_matrix(y_test,model_5.predict(X_test)))

In [None]:
# Train set
print(metrics.classification_report(y_train,model_5.predict(X_train)))
print(metrics.confusion_matrix(y_train,model_5.predict(X_train)))

In [None]:
# when viewing the performance metrics above which set did "better"? Why would the model perform better on the train set instead of the test set? What term do we use when this occurs?