# DECISION TREES
Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

In [80]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import sklearn.metrics

from acquire import get_titanic_data
from prepare import prep_titanic
import env

from FUNctions import describe_data

1.) What is your baseline prediction? What is your baseline accuracy? 
- remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [81]:
titanic = get_titanic_data() 
#describe_data(titanic)
titanic = prep_titanic(titanic)
describe_data(titanic)

Using cached csv
The first three rows are: 
----------------------------------------------------------
   survived  pclass  sibsp  parch     fare  alone  Queenstown  Southampton  \
0         0       3      1      0   7.2500      0           0            1   
1         1       1      1      0  71.2833      0           0            0   
2         1       3      0      0   7.9250      1           0            1   

   male  
0     1  
1     0  
2     0  
----------------------------------------------------------
The data frame's shape is: 
-------------------------
(891, 9)
-------------------------
The data types and column names are: 
['Queenstown', 'Southampton', 'alone', 'fare', 'male', 'parch', 'pclass', 'sibsp', 'survived']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     891 non-null    int64  
 1   pclass       891 non-null   

In [82]:
# Since my data is already prepared due to a previous exercise, I can immediately split. 
train, test = train_test_split(titanic, test_size=.2, random_state=123, stratify=titanic.survived)
train, validate = train_test_split(train, test_size=.3, random_state=123, stratify=train.survived)

train.shape, validate.shape, test.shape
# columns line up, good to go. 

((498, 9), (214, 9), (179, 9))

In [83]:
X_train, y_train = train.drop(columns='survived'), train.survived
X_validate, y_validate = validate.drop(columns='survived'), validate.survived
X_test, y_test = test.drop(columns='survived'), test.survived

In [84]:
print(f'{titanic.survived.value_counts()}')
print('Based on this, the baseline is fatalities. Non-survivors.')
train['baseline'] = 0

0    549
1    342
Name: survived, dtype: int64
Based on this, the baseline is fatalities. Non-survivors.


In [85]:
accuracy_score(train.survived, train.baseline)

0.6164658634538153

2.) Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [86]:
model1 = DecisionTreeClassifier()
model1.fit(X_train, y_train)

print(f'training score: {model1.score(X_train, y_train):.2%}')
print(f'validate score: {model1.score(X_validate, y_validate):.2%}')

training score: 94.58%
validate score: 75.70%


In [87]:
train['model1'] = model1.predict(X_train)
train.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,Queenstown,Southampton,male,baseline,model1
583,0,1,0,0,40.125,1,0,0,1,0,0
165,1,3,0,2,20.525,0,0,1,1,0,1
50,0,3,4,1,39.6875,0,0,1,1,0,0
259,1,2,0,1,26.0,0,0,1,0,0,1
306,1,1,0,0,110.8833,1,0,0,0,0,1


In [88]:
y_pred = model1.predict(X_train)

3.) Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [89]:
#using model score
model1.score(X_validate,y_validate)

0.7570093457943925

In [90]:
#confusion matrix
pd.crosstab(y_train, y_pred, normalize = True)

col_0,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.61245,0.004016
1,0.050201,0.333333


In [91]:
#classification report
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.99      0.96       307
           1       0.99      0.87      0.92       191

    accuracy                           0.95       498
   macro avg       0.96      0.93      0.94       498
weighted avg       0.95      0.95      0.95       498



4.) Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [None]:
# COnfusion Matrix will give me the values needed for TP, FP, TN, FN rates

In [110]:
#accuracy
acc = sklearn.metrics.accuracy_score(y_train, y_pred)
print(f' The accuracy is: {acc:.2%}')

#Confusion Matrix for TP FP TN FN
cm = confusion_matrix(y_train, y_pred)
print(f' Confusion Matrix: \n {cm} \n ')

titanic_cm = confusion_matrix(y_train, y_pred)
titanic_cm

tn, fp, fn, tp = titanic_cm.ravel()
print("Number of true negatives  (tn) = ",tn)
print("Number of true positives  (tp) = ",tp)
print("Number of false negatives (fn) = ",fn)
print("Number of false positives (fp) = ",fp)

print("""
What the confusion matrix tells us now, in more detail, is that we got 305
of the people who did not survive right, and 166 of the people who did survive right.
But also that we killed 25 of the passengers in our model,
and brought 2 of them back to life through Necromancy!""")
#precision
pre = sklearn.metrics.precision_score(y_train,y_pred)
print(f' The precision is: {pre:.2%}')

#recall
rec = sklearn.metrics.recall_score(y_train,y_pred)
print(f' The recall rate is: {rec:.2%}')

#f1-score
f1 = sklearn.metrics.f1_score(y_train,y_pred)
print(f' The F1 score is: {f1:.2%}')

#support

#FP = confusion_matrix.sum(axis=0) - np.diag(confusion_matrix)  
#FN = confusion_matrix.sum(axis=1) - np.diag(confusion_matrix)
#TP = np.diag(confusion_matrix)
#TN = confusion_matrix.sum() - (FP + FN + TP)
#
#print(FP, FN, TP, TN)

 The accuracy is: 94.58%
 Confusion Matrix: 
 [[305   2]
 [ 25 166]] 
 
Number of true negatives  (tn) =  305
Number of true positives  (tp) =  166
Number of false negatives (fn) =  25
Number of false positives (fp) =  2

What the confusion matrix tells us now, in more detail, is that we got 305
of the people who did not survive right, and 166 of the people who did survive right.
But also that we killed 25 of the passengers in our model,
and brought 2 of them back to life through Necromancy!
 The precision is: 98.81%
 The recall rate is: 86.91%
 The F1 score is: 92.48%


5.) Run through steps 2-4 using a different max_depth value.

6.) Which model performs better on your in-sample data?

7.) Which model performs best on your out-of-sample data, the validate set?