Model exercises

In [80]:
import pandas as pd
from acquire import get_telco_data, get_titantic_data
from prepare import train_validate_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

## Using titantic

In [47]:
df = get_titantic_data()

Using cached data


In [48]:
df.shape

(891, 13)

In [49]:
df = df.drop(columns = ['passenger_id','deck','embarked','class'])

In [50]:
df.shape

(891, 9)

In [51]:
df = df.dropna()
df.shape

(712, 9)

In [52]:
df = pd.get_dummies(df,columns = ['embark_town','sex'], drop_first=True)

In [53]:
train, validate, test = train_validate_test_split(df, target = 'survived')

In [54]:
train.shape, validate.shape, test.shape

((398, 10), (171, 10), (143, 10))

1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [56]:
train.survived.value_counts()

0    237
1    161
Name: survived, dtype: int64

In [57]:
# baseline is most common value - so 0, didn't survive
train['baseline'] = 0

In [58]:
print(f"Baseline accuracy is: {(train.survived==train.baseline).mean():.2%}")

Baseline accuracy is: 59.55%


2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [60]:
# Based on exploration believe that pclass, sex, and embark_town are features of interest
selected_features = ['pclass','embark_town_Queenstown','embark_town_Southampton','sex_male']

In [61]:
X_train = train[selected_features]
y_train = train[['survived']]

In [63]:
clf = DecisionTreeClassifier(max_depth=3, random_state=123)

In [64]:
clf = clf.fit(X_train, y_train)

In [70]:
y_pred = clf.predict(X_train)

In [75]:
y_pred[0:20]

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0])

3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [79]:
print(f"Accuracy score on training set is: {accuracy_score(y_train.survived, y_pred):.2%}")

Accuracy score on training set is: 80.15%


In [85]:
labels = sorted(y_train.survived.unique())
pd.DataFrame(confusion_matrix(y_train, y_pred), index = labels, columns = labels)

Unnamed: 0,0,1
0,223,14
1,65,96


In [88]:
# rows are truth, columns are pred
tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()
(tn, fp, fn, tp)

(223, 14, 65, 96)

In [86]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.77      0.94      0.85       237
           1       0.87      0.60      0.71       161

    accuracy                           0.80       398
   macro avg       0.82      0.77      0.78       398
weighted avg       0.81      0.80      0.79       398



4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.
- False positive rate: How likely is it we get a false positive when the actual value is negative? FP/(FP+TN)

In [89]:
print(f"False positive rate: {fp/(fp+tn):.2%}")

False positive rate: 5.91%


- False negative rate: How likely is it we get a false negative when the actual value is positive? FN/(FN+TP)

In [91]:
print(f"False negative rate: {fn/(fn+tp):.2%}")

False negative rate: 40.37%


- True positive rate: How likely is it we get a true positive when the actual value is positive? This is sensitivity/recall. TP/(TP+FN)

In [92]:
print(f"True positive rate: {tp/(tp+fn):.2%}")

True positive rate: 59.63%


- True negative rate: How likely is it we get a true positive when the actual value is positive? This is specificity/selectivity. TN/(TN+FP)

In [93]:
print(f"True negative rate: {tn/(fp+tn):.2%}")

True negative rate: 94.09%


5. Run through steps 2-4 using a different max_depth value.

In [94]:
def decision_tree(train, d = 5):
    
    selected_features = ['pclass','embark_town_Queenstown','embark_town_Southampton','sex_male']
    X_train = train[selected_features]
    y_train = train[['survived']]
    clf = DecisionTreeClassifier(max_depth=d, random_state=123)
    clf = clf.fit(X_train, y_train)
    y_pred = clf.predict(X_train)
    print(f"Accuracy score on training set is: {accuracy_score(y_train.survived, y_pred):.2%}")
    print(classification_report(y_train, y_pred))
    
    tn, fp, fn, tp = confusion_matrix(y_train, y_pred).ravel()

    print(f"False positive rate: {fp/(fp+tn):.2%}")
    print(f"False negative rate: {fn/(fn+tp):.2%}")
    print(f"True positive rate: {tp/(tp+fn):.2%}")
    print(f"True negative rate: {tn/(fp+tn):.2%}")

In [100]:
for i in range(3,6,2):
    print(f'For decision tree with depth {i}:')
    decision_tree(train, d = i)

For decision tree with depth 3:
Accuracy score on training set is: 80.15%
              precision    recall  f1-score   support

           0       0.77      0.94      0.85       237
           1       0.87      0.60      0.71       161

    accuracy                           0.80       398
   macro avg       0.82      0.77      0.78       398
weighted avg       0.81      0.80      0.79       398

False positive rate: 5.91%
False negative rate: 40.37%
True positive rate: 59.63%
True negative rate: 94.09%
For decision tree with depth 5:
Accuracy score on training set is: 80.40%
              precision    recall  f1-score   support

           0       0.77      0.96      0.85       237
           1       0.90      0.58      0.70       161

    accuracy                           0.80       398
   macro avg       0.84      0.77      0.78       398
weighted avg       0.82      0.80      0.79       398

False positive rate: 4.22%
False negative rate: 42.24%
True positive rate: 57.76%
True ne

6. Which model performs better on your in-sample data?

### Between decision trees with depth of 3 and 5:
    - with depth 5 accuracy is barely better at 80.4% compared to 80.15%
    - fairly even performance otherwise - higher true negative rate for depth 5 but lower true positive rate


7. Which model performs best on your out-of-sample data, the validate set?