## Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

In [2]:
import numpy as np
import pandas as pd
import acquire
##############
# Note that this prepare is modified to reflect the specific work done in the Prepare Lesson
#############
import prepare
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import confusion_matrix, classification_report
import graphviz
from graphviz import Graph

In [3]:
# acquire the df
df = acquire.get_titanic_data()

In [4]:
# peek at the df
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [5]:
# clean the data
df = prepare.prep_titanic_data(df)

In [6]:
df.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,0,3,1,0,7.25,0,1,0,1
1,1,1,1,0,71.2833,0,0,0,0
2,1,3,0,0,7.925,1,0,0,1
3,1,1,1,0,53.1,0,0,0,1
4,0,3,0,0,8.05,1,1,0,1


In [7]:
# split the data
train, validate, test = prepare.train_validate_test_split(df)

In [8]:
train.head()

Unnamed: 0,survived,pclass,sibsp,parch,fare,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
583,0,1,0,0,40.125,1,1,0,0
165,1,3,0,2,20.525,0,1,0,1
50,0,3,4,1,39.6875,0,1,0,1
259,1,2,0,1,26.0,0,0,0,1
306,1,1,0,0,110.8833,1,0,0,0


In [9]:
# We will be attempting to make a Decision Tree Classifier Model that will predict survival on the 
# Titanic that performs better than the baseline

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [10]:
# obtain our mode
train.survived.value_counts()

0    307
1    191
Name: survived, dtype: int64

In [11]:
train['baseline_assumption_death'] = 0

In [18]:
print(f'Our baseline accuracy for nonsurvival in all cases on the Titanic Dataset is {(train.baseline_assumption_death == train.survived).mean():.3}')

Our baseline accuracy for nonsurvival in all cases on the Titanic Dataset is 0.616


2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [19]:
# create the model
clf1 = DecisionTreeClassifier()

In [20]:
# remove baseline assumption from the train
train.drop(columns='baseline_assumption_death', inplace=True)

In [21]:
# split our X and y
X_train = train.drop(columns='survived')
y_train = train.survived

In [22]:
# fit the model
clf1.fit(X_train, y_train)

DecisionTreeClassifier()

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [23]:
# use the model to predict
y_pred = clf1.predict(X_train)

In [25]:
# check out the values in the predictions
pd.Series(y_pred).value_counts()

0    330
1    168
dtype: int64

In [None]:
# model score: accuracy

In [29]:
accuracy = clf1.score(X_train, y_train)

In [30]:
accuracy

0.9457831325301205

In [31]:
# confusion matrix
conf = confusion_matrix(y_train, y_pred)

In [32]:
conf

array([[305,   2],
       [ 25, 166]])

In [39]:
# get the classification report
class_report = classification_report(y_train, y_pred, output_dict=True)

In [40]:
class_report

{'0': {'precision': 0.9242424242424242,
  'recall': 0.993485342019544,
  'f1-score': 0.9576138147566718,
  'support': 307},
 '1': {'precision': 0.9880952380952381,
  'recall': 0.8691099476439791,
  'f1-score': 0.9247910863509748,
  'support': 191},
 'accuracy': 0.9457831325301205,
 'macro avg': {'precision': 0.9561688311688312,
  'recall': 0.9312976448317616,
  'f1-score': 0.9412024505538233,
  'support': 498},
 'weighted avg': {'precision': 0.9487321580695075,
  'recall': 0.9457831325301205,
  'f1-score': 0.9450251779585028,
  'support': 498}}

In [81]:
pd.DataFrame(class_report).rename(columns={'0': 'deceased', '1': 'survived'}).T

Unnamed: 0,precision,recall,f1-score,support
deceased,0.924242,0.993485,0.957614,307.0
survived,0.988095,0.86911,0.924791,191.0
accuracy,0.945783,0.945783,0.945783,0.945783
macro avg,0.956169,0.931298,0.941202,498.0
weighted avg,0.948732,0.945783,0.945025,498.0


4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [44]:
conf

array([[305,   2],
       [ 25, 166]])

In [45]:
# turn our confusion matrix into a dataframe for human legibility
conf_df = pd.DataFrame(conf, columns=['predict_death', 'predict_survive'], index=['actual_death', 'actual_survive'])

In [46]:
conf_df

Unnamed: 0,predict_death,predict_survive
actual_death,305,2
actual_survive,25,166


In [47]:
# make a key for reference
rubric_df = pd.DataFrame([['true negative', 'false positive'],['false negative', 'true positive']], columns=['predict_death', 'predict_survive'], index=['actual_death', 'actual_survive'])

In [48]:
rubric_df

Unnamed: 0,predict_death,predict_survive
actual_death,true negative,false positive
actual_survive,false negative,true positive


In [49]:
joined = pd.concat([conf_df, rubric_df], axis=1)

In [50]:
joined

Unnamed: 0,predict_death,predict_survive,predict_death.1,predict_survive.1
actual_death,305,2,true negative,false positive
actual_survive,25,166,false negative,true positive


In [51]:
# crate a function to calculate these metrics
def get_metrics_binary(clf):
    '''
    get_metrics_binary takes in a confusion matrix (cnf) for a binary classifier and prints out metrics based on
    values in variables named X_train, y_train, and y_pred.
    
    return: a classification report as a transposed DataFrame
    '''
    accuracy = clf.score(X_train, y_train)
    class_report = pd.DataFrame(classification_report(y_train, y_pred, output_dict=True)).T
    conf = confusion_matrix(y_train, y_pred)
    tpr = conf[1][1] / conf[1].sum()
    fpr = conf[0][1] / conf[0].sum()
    tnr = conf[0][0] / conf[0].sum()
    fnr = conf[1][0] / conf[1].sum()
    print(f'''
    The accuracy for our model is {accuracy:.4}
    The True Positive Rate is {tpr:.3}, The False Positive Rate is {fpr:.3},
    The True Negative Rate is {tnr:.3}, and the False Negative Rate is {fnr:.3}
    ''')
    return class_report
    

In [52]:
# call our function
report_df = get_metrics_binary(clf1)


    The accuracy for our model is 0.9458
    The True Positive Rate is 0.869, The False Positive Rate is 0.00651,
    The True Negative Rate is 0.993, and the False negative Rate is 0.131
    


In [53]:
report_df

Unnamed: 0,precision,recall,f1-score,support
0,0.924242,0.993485,0.957614,307.0
1,0.988095,0.86911,0.924791,191.0
accuracy,0.945783,0.945783,0.945783,0.945783
macro avg,0.956169,0.931298,0.941202,498.0
weighted avg,0.948732,0.945783,0.945025,498.0


5. Run through steps 2-4 using a different max_depth value.

In [55]:
# clf2
clf2 = DecisionTreeClassifier(max_depth=3)

In [56]:
# fit the model

In [57]:
clf2.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=3)

In [59]:
y_pred = clf2.predict(X_train)

6. Which model performs better on your in-sample data?

In [60]:
report_df = get_metrics_binary(clf2)


    The accuracy for our model is 0.8233
    The True Positive Rate is 0.702, The False Positive Rate is 0.101,
    The True Negative Rate is 0.899, and the False negative Rate is 0.298
    


In [61]:
report_df

Unnamed: 0,precision,recall,f1-score,support
0,0.828829,0.899023,0.8625,307.0
1,0.812121,0.701571,0.752809,191.0
accuracy,0.823293,0.823293,0.823293,0.823293
macro avg,0.820475,0.800297,0.807654,498.0
weighted avg,0.822421,0.823293,0.82043,498.0


7. Which model performs best on your out-of-sample data, the validate set?

In [65]:
# get predictions for our validation sets
y_val_pred_1 = clf1.predict(validate.drop(columns='survived'))
y_val_pred_2 = clf2.predict(validate.drop(columns='survived'))

In [66]:
# get validation accuracy
accuracy_v_1 = clf1.score(validate.drop(columns='survived'), validate.survived)
accuracy_v_2 = clf2.score(validate.drop(columns='survived'), validate.survived)

In [67]:
accuracy_v_1

0.7523364485981309

In [68]:
accuracy_v_2

0.7850467289719626

In [69]:
dot_data = export_graphviz(clf2, feature_names= X_train.columns, rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)

In [70]:
graph.render('titanic_model_2_tree', view=True)

'titanic_model_2_tree.pdf'

In [72]:
dot_data = export_graphviz(clf1, feature_names= X_train.columns, rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)

In [73]:
graph.render('titanic_model_1_tree', view=True)

'titanic_model_1_tree.pdf'