# Decision Tree
## Titanic data

This notebook runs the Decision Tree algorithm on the Titanic data. 



In [1]:
### load the data
import pandas as pd
df = pd.read_csv('data/titanic3.csv', usecols=['pclass', 'survived', 'sex', 'age'])
print(df.head())
print('\nDimensions of data frame:', df.shape)

   pclass  survived     sex      age
0       1         1  female  29.0000
1       1         1    male   0.9167
2       1         0  female   2.0000
3       1         0    male  30.0000
4       1         0  female  25.0000

Dimensions of data frame: (1309, 4)


In [2]:
# convert columns to factors
df.survived = df.survived.astype('category').cat.codes
df.pclass = df.pclass.astype('category').cat.codes
df.sex = df.sex.astype('category').cat.codes
df.head()

Unnamed: 0,pclass,survived,sex,age
0,0,1,0,29.0
1,0,1,1,0.9167
2,0,0,0,2.0
3,0,0,1,30.0
4,0,0,0,25.0


In [3]:
# count missing values

df.isnull().sum()

pclass        0
survived      0
sex           0
age         263
dtype: int64

In [4]:
# fill missing values
import numpy as np

age_mean = np.mean(df.age)
df.age.fillna(age_mean, inplace=True)

In [5]:
# train test split
from sklearn.model_selection import train_test_split

X = df.loc[:, ['pclass', 'age', 'sex']]
y = df.survived

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print('train size:', X_train.shape)
print('test size:', X_test.shape)

train size: (1047, 3)
test size: (262, 3)


In [6]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [7]:
# make predictions

pred = clf.predict(X_test)

In [8]:
# evaluate
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.7786259541984732
precision score:  0.7692307692307693
recall score:  0.6
f1 score:  0.6741573033707865


In [9]:
# confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred)

array([[144,  18],
       [ 40,  60]])

In [10]:
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.78      0.89      0.83       162
           1       0.77      0.60      0.67       100

    accuracy                           0.78       262
   macro avg       0.78      0.74      0.75       262
weighted avg       0.78      0.78      0.77       262



The results were similar to the Logistic Regression algorithm. 

### Random Forest

In [11]:
from sklearn.ensemble import RandomForestClassifier

# train
clf2 = RandomForestClassifier(max_depth=4, random_state=1234)
clf2.fit(X_train, y_train)

# predict
pred2 = clf2.predict(X_test)

# evaluate
print(classification_report(y_test, pred2))

              precision    recall  f1-score   support

           0       0.75      0.97      0.85       162
           1       0.91      0.48      0.63       100

    accuracy                           0.78       262
   macro avg       0.83      0.72      0.74       262
weighted avg       0.81      0.78      0.76       262



The random forest achieved similar results to the decision tree in terms of accuracy, but with higher precision and lower recall. 