<a href="https://colab.research.google.com/github/matthewpecsok/6482/blob/main/tutorials/Decision_Tree_CV_Titanic_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Decision Tree Titanic CV Tutorial


1 Data description

2 Library Setup

3 Overall data inspection

4 Tree model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people such as women, children, and the upper-class were more likely to survive than others.

VARIABLE DESCRIPTIONS:

PassengerID Unique passenger identifier

Survived Survival (0 = No; 1 = Yes)

Pclass Passenger Class(1 = 1st; 2 = 2nd; 3 = 3rd) (Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower)

Name

Sex

Age - (Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5)

Sibsp - Number of Siblings/Spouses Aboard Parch Number of Parents/Children Aboard

Ticket Number

Fare - Passenger Fare

Cabin

Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)


# 2 Library Setup

In [18]:
import pandas as pd
import numpy as np

import os

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn import tree

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

from sklearn.model_selection import cross_validate


In [20]:
random_state = 42

# 3 overall data inspection

In [2]:
titanic = pd.read_csv("https://raw.githubusercontent.com/matthewpecsok/4482_fall_2024/main/data/titanic_cleaned.csv")

In [3]:
type(titanic)

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [4]:
titanic.shape

(714, 9)

In [5]:
titanic.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [6]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  714 non-null    int64  
 1   Pclass    714 non-null    int64  
 2   Sex       714 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     714 non-null    int64  
 5   Parch     714 non-null    int64  
 6   Fare      714 non-null    float64
 7   Cabin     714 non-null    object 
 8   Embarked  714 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 50.3+ KB


In [7]:
# remove all non-categorical type columns
# also remove cabin as it is causing issues currently when splitting
titanic = titanic[['Survived','Sex','Embarked','Pclass']]

In [8]:
titanic

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1
2,1,female,S,3
3,1,female,S,1
4,0,male,S,3
...,...,...,...,...
709,0,female,Q,3
710,0,male,S,2
711,1,female,S,1
712,1,male,C,1


In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 714 entries, 0 to 713
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  714 non-null    int64 
 1   Sex       714 non-null    object
 2   Embarked  714 non-null    object
 3   Pclass    714 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 22.4+ KB


In [10]:
titanic.describe(include='all')

Unnamed: 0,Survived,Sex,Embarked,Pclass
count,714.0,714,714,714.0
unique,,2,4,
top,,male,S,
freq,,453,554,
mean,0.406162,,,2.236695
std,0.49146,,,0.83825
min,0.0,,,1.0
25%,0.0,,,1.0
50%,0.0,,,2.0
75%,1.0,,,3.0


## Dummy encoding the dataframe

In [11]:
titanic.head(2)

Unnamed: 0,Survived,Sex,Embarked,Pclass
0,0,male,S,3
1,1,female,C,1


Pop the target variable and change it to 0,1

In [12]:
y = titanic.pop('Survived')

## 3.2 encode the data

In [13]:
#convert all columns to
titanic['Pclass'] = titanic['Pclass'].astype(str)
titanic.dtypes


Unnamed: 0,0
Sex,object
Embarked,object
Pclass,object


In [14]:
titanic_enc = pd.get_dummies(titanic)

In [15]:
titanic_enc.dtypes

Unnamed: 0,0
Sex_female,bool
Sex_male,bool
Embarked_C,bool
Embarked_Q,bool
Embarked_S,bool
Embarked_missing,bool
Pclass_1,bool
Pclass_2,bool
Pclass_3,bool


In [16]:
titanic_enc.head(2)

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_missing,Pclass_1,Pclass_2,Pclass_3
0,False,True,False,False,True,False,False,False,True
1,True,False,True,False,False,False,True,False,False


## 4 build a simple model and use cross validation to see how it performs across folds

In [21]:
tree_1 = tree.DecisionTreeClassifier(random_state=random_state,ccp_alpha=0.05)

In [22]:
cross_validate(
    tree_1, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)

{'fit_time': array([0.04227972, 0.00494552, 0.03831482, 0.04797864, 0.01712012]),
 'score_time': array([0.04615951, 0.01529813, 0.04362941, 0.0480535 , 0.02426124]),
 'test_f1': array([0.73043478, 0.76785714, 0.71428571, 0.61386139, 0.73873874]),
 'train_f1': array([0.71100917, 0.70159453, 0.71526196, 0.73777778, 0.70909091]),
 'test_accuracy': array([0.78321678, 0.81818182, 0.77622378, 0.72727273, 0.79577465]),
 'train_accuracy': array([0.7793345 , 0.77057793, 0.78108581, 0.79334501, 0.77622378]),
 'test_recall': array([0.72413793, 0.74137931, 0.68965517, 0.53448276, 0.70689655]),
 'train_recall': array([0.66810345, 0.6637931 , 0.67672414, 0.71551724, 0.67241379]),
 'test_precision': array([0.73684211, 0.7962963 , 0.74074074, 0.72093023, 0.77358491]),
 'train_precision': array([0.75980392, 0.74396135, 0.75845411, 0.76146789, 0.75      ])}

In [23]:
scores = cross_validate(
    tree_1, titanic_enc, y, cv=5, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.005204,0.024764,0.730435,0.711009,0.783217,0.779335,0.724138,0.668103,0.736842,0.759804
1,0.005114,0.035339,0.767857,0.701595,0.818182,0.770578,0.741379,0.663793,0.796296,0.743961
2,0.010059,0.029521,0.714286,0.715262,0.776224,0.781086,0.689655,0.676724,0.740741,0.758454
3,0.005124,0.016202,0.613861,0.737778,0.727273,0.793345,0.534483,0.715517,0.72093,0.761468
4,0.006008,0.016168,0.738739,0.709091,0.795775,0.776224,0.706897,0.672414,0.773585,0.75


In [24]:
five_fold = pd.DataFrame(scores)


In [25]:
print("mean\n\n",five_fold.mean(axis=0))
print("\n\nstd\n\n",five_fold.std(axis=0))

mean

 fit_time           0.006302
score_time         0.024398
test_f1            0.713036
train_f1           0.714947
test_accuracy      0.780134
train_accuracy     0.780113
test_recall        0.679310
train_recall       0.679310
test_precision     0.753679
train_precision    0.754737
dtype: float64


std

 fit_time           0.002133
score_time         0.008381
test_f1            0.058749
train_f1           0.013688
test_accuracy      0.033583
train_accuracy     0.008407
test_recall        0.083224
train_recall       0.020806
test_precision     0.030561
train_precision    0.007472
dtype: float64


In [27]:
scores = cross_validate(
    tree_1, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
pd.DataFrame(scores)

Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.004421,0.009805,0.786885,0.706122,0.819444,0.775701,0.827586,0.662835,0.75,0.755459
1,0.004055,0.01125,0.666667,0.720322,0.75,0.783489,0.62069,0.685824,0.72,0.758475
2,0.00327,0.009276,0.730769,0.713427,0.805556,0.777259,0.655172,0.681992,0.826087,0.747899
3,0.003339,0.009093,0.8,0.704684,0.833333,0.774143,0.827586,0.662835,0.774194,0.752174
4,0.003184,0.008772,0.666667,0.720648,0.732394,0.785381,0.655172,0.681992,0.678571,0.763948
5,0.003204,0.011909,0.75,0.711111,0.802817,0.777605,0.724138,0.67433,0.777778,0.752137
6,0.003249,0.009503,0.653846,0.721443,0.746479,0.783826,0.586207,0.689655,0.73913,0.756303
7,0.003258,0.008656,0.583333,0.727634,0.71831,0.786936,0.482759,0.701149,0.736842,0.756198
8,0.003045,0.008941,0.75,0.711111,0.802817,0.777605,0.724138,0.67433,0.777778,0.752137
9,0.003149,0.008683,0.727273,0.71371,0.788732,0.77916,0.689655,0.678161,0.769231,0.753191


## 5 add the mean and standard devation

(across fold) performance to the dataframe itself for easier reading.

In [30]:
scores = cross_validate(
    tree_1, titanic_enc, y, cv=10, scoring=['f1','accuracy','recall','precision'],return_train_score=True)
ten_fold = pd.DataFrame(scores)

summary_row = pd.DataFrame([ten_fold.mean(), ten_fold.std()], index=['Mean', 'Std'])
ten_fold = pd.concat([ten_fold, summary_row])
ten_fold





Unnamed: 0,fit_time,score_time,test_f1,train_f1,test_accuracy,train_accuracy,test_recall,train_recall,test_precision,train_precision
0,0.005606,0.009826,0.786885,0.706122,0.819444,0.775701,0.827586,0.662835,0.75,0.755459
1,0.003395,0.010017,0.666667,0.720322,0.75,0.783489,0.62069,0.685824,0.72,0.758475
2,0.003079,0.008824,0.730769,0.713427,0.805556,0.777259,0.655172,0.681992,0.826087,0.747899
3,0.003367,0.00883,0.8,0.704684,0.833333,0.774143,0.827586,0.662835,0.774194,0.752174
4,0.003133,0.008442,0.666667,0.720648,0.732394,0.785381,0.655172,0.681992,0.678571,0.763948
5,0.00309,0.008729,0.75,0.711111,0.802817,0.777605,0.724138,0.67433,0.777778,0.752137
6,0.003072,0.010391,0.653846,0.721443,0.746479,0.783826,0.586207,0.689655,0.73913,0.756303
7,0.004039,0.011315,0.583333,0.727634,0.71831,0.786936,0.482759,0.701149,0.736842,0.756198
8,0.003203,0.012569,0.75,0.711111,0.802817,0.777605,0.724138,0.67433,0.777778,0.752137
9,0.003218,0.008926,0.727273,0.71371,0.788732,0.77916,0.689655,0.678161,0.769231,0.753191
