## The Titanic Case - Prediction on Survival

## Decision Tree

#### Questions you might need to think about
- Decision Tree is supervised learning or unsupervised learning
- How do we construct the tree (Model Build Phase)
- How do we use our model to do the prediction (Model USE Phase)

In [None]:
# import library used for data management
import numpy as np 
import pandas as pd 

In [None]:
# load datasets
train = pd.read_csv('train_clean.csv',index_col='PassengerId')

In [None]:
# view first 5 rows in train data frame
train.head()

In [None]:
# get data frame info
train.info()

## Build decision tree model

###  Get our data ready for modeling stage

In [None]:
# define our features
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q','Embarked_S']

# define target varialbe
target = ['Survived']

In [None]:
# Get the data
X = train [features]
y = train [target]

### Split Data

In [None]:
# split the data into train and test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state = 23)

### Fit the model

In [None]:
# Import DecisionTreeClassifier
# https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = 'entropy')

# fit the model with data
model.fit(X_train,y_train)

In [None]:
# construct a tree object
treeObj = model.tree_

In [None]:
# get how many nodes are there in the tree
treeObj.node_count

In [None]:
treeObj.max_depth

In [None]:
# feature importance
model.feature_importances_

In [None]:
from sklearn import tree
from matplotlib import pyplot as plt

In [None]:
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(model,
                   feature_names = features, 
                   class_names = ['Not Survived','Survived'],
                   filled=True)
# - If the condition at the top is true, move left. Otherwise, move right.
# - 'samples': Total number of individuals.
# - 'value': Number of people that died (left) and survived (right).
# - 'class': Predicted outcome for people at that node.

### Test the model on hold out test data

In [None]:
# make prediction for test set
y_pre = model.predict(X_test)
y_pre

In [None]:
y_pre_proba = model.predict_proba (X_test)
y_pre_proba

In [None]:
# import evlaution tools
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
# print the evaluation result on test set
print(accuracy_score(y_test,y_pre))
print(confusion_matrix(y_test,y_pre))
print(classification_report(y_test,y_pre))

### Evalute the cross validation result

In [None]:
# import the CV
from sklearn.model_selection import cross_val_score

In [None]:
score_cv = cross_val_score(model, X_train, y_train, cv=10)
score_cv.mean()

## Make prediction on unseen data

In [None]:
# load testset
test = pd.read_csv('test.csv',index_col='PassengerId')