# Decision Tree Exercises

Using the titanic data, in your classification-exercises repository, create a notebook, decision_tree.ipynb where you will do the following:

In [1]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

import acquire as ac
import prepare as prep

In [2]:
titanic = ac.get_titanic_data()
titanic = prep.impute_df(titanic,'age',strategy='median')
titanic = prep.prep_titanic(titanic)
train, val, test = prep.train_val_test(titanic,'survived')
train.shape, test.shape, val.shape

((623, 13), (134, 13), (134, 13))

In [3]:
train.head(3)

Unnamed: 0,survived,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
410,0,28.0,0,0,7.8958,1,1,0,0,1,0,0,1
824,0,2.0,4,1,39.6875,0,1,0,0,1,0,0,1
11,1,58.0,0,0,26.55,1,0,1,0,0,0,0,1


In [4]:
x_train, y_train = prep.split_x_y(train, 'survived')
x_train.head(3)

Unnamed: 0,age,sibsp,parch,fare,alone,sex_male,class_First,class_Second,class_Third,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton
410,28.0,0,0,7.8958,1,1,0,0,1,0,0,1
824,2.0,4,1,39.6875,0,1,0,0,1,0,0,1
11,58.0,0,0,26.55,1,0,1,0,0,0,0,1


In [5]:
y_train.head(3)

410    0
824    0
11     1
Name: survived, dtype: int64

## 1. What is your baseline prediction? What is your baseline accuracy? remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.

In [6]:
# A baseline prediction is simply a "model" that guesses the most frequently appearing value in a data set or the mode.
# Baseline accuracy is the accuracy of your baseline "model"

## 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [7]:
clf = DecisionTreeClassifier(random_state=100)
clf.fit(x_train, y_train)

In [8]:
train_preds = clf.predict(x_train)
train_preds[:5]

array([0, 0, 1, 0, 0])

## 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [9]:
results = train
results['model1'] = train_preds
prep.evaluate(results,'survived',model='model1',target=1)

Model accuracy is: 97.91%.
Baseline accuracy is: 61.64%.

Model recall is: 94.98%.
Baseline recall is: 0.0%.

Model precision is: 99.56%.
Baseline precision is: 0%.



In [10]:
clf.score(x_train,y_train)

0.9791332263242376

In [11]:
print(classification_report(y_train, train_preds))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       384
           1       1.00      0.95      0.97       239

    accuracy                           0.98       623
   macro avg       0.98      0.97      0.98       623
weighted avg       0.98      0.98      0.98       623



## 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [12]:
pd.crosstab(train.survived, train_preds)

col_0,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,383,1
1,12,227


In [13]:
# TP:227  TN:383  FP:1  FN:12 

## 5. Run through steps 2-4 using a different max_depth value.

In [14]:
clf_two = DecisionTreeClassifier(max_depth=5, random_state=100)
clf_two.fit(x_train, y_train)

In [15]:
model2 = clf_two.predict(x_train)
model2[:5]

array([0, 0, 1, 0, 0])

In [16]:
results = train
results['model2'] = model2
prep.evaluate(results,'survived',model='model2',target=1)

Model accuracy is: 85.07%.
Baseline accuracy is: 61.64%.

Model recall is: 65.27%.
Baseline recall is: 0.0%.

Model precision is: 93.98%.
Baseline precision is: 0%.



In [17]:
print(classification_report(y_train, model2))

              precision    recall  f1-score   support

           0       0.82      0.97      0.89       384
           1       0.94      0.65      0.77       239

    accuracy                           0.85       623
   macro avg       0.88      0.81      0.83       623
weighted avg       0.86      0.85      0.84       623



In [18]:
pd.crosstab(train.survived, model2)

col_0,0,1
survived,Unnamed: 1_level_1,Unnamed: 2_level_1
0,374,10
1,83,156


## 6. Which model performs better on your in-sample data?

In [19]:
# The first model with no max depth limit performs better.

## 7. Which model performs best on your out-of-sample data, the validate set?

In [20]:
x_val, y_val = prep.split_x_y(val,'survived')

In [21]:
m1_val = clf.predict(x_val)
m2_val = clf_two.predict(x_val)

In [22]:
clf.score(x_val, y_val)

0.8134328358208955

In [23]:
clf_two.score(x_val, y_val)

0.8134328358208955

In [24]:
# The second model performs better in the validate set.

# Telco

## 1. Work through these same exercises using the Telco dataset.

In [25]:
telco = ac.get_telco_data()
telco = prep.prep_telco(telco)
telco.head(3)

Unnamed: 0,senior_citizen,tenure,monthly_charges,total_charges,gender_Male,partner_Yes,dependents_Yes,phone_service_Yes,multiple_lines_No phone service,multiple_lines_Yes,...,streaming_movies_Yes,paperless_billing_Yes,churn_Yes,contract_type_One year,contract_type_Two year,internet_service_type_Fiber optic,internet_service_type_None,payment_type_Credit card (automatic),payment_type_Electronic check,payment_type_Mailed check
0,0,9,65.6,593.3,0,1,1,1,0,0,...,0,1,0,1,0,0,0,0,0,1
1,0,9,59.9,542.4,1,0,0,1,0,1,...,1,0,0,0,0,0,0,0,0,1
2,0,4,73.9,280.85,1,0,0,1,0,0,...,0,1,1,0,0,1,0,0,1,0


In [26]:
train, val, test = prep.train_val_test(telco,'churn_Yes')
x_train, y_train = prep.split_x_y(train, 'churn_Yes')
x_val, y_val = prep.split_x_y(val, 'churn_Yes')

In [27]:
clf = DecisionTreeClassifier(random_state=100)
clf2 = DecisionTreeClassifier(max_depth=8, random_state=100)
clf.fit(x_train, y_train)
clf2.fit(x_train, y_train)

In [28]:
train.churn_Yes.value_counts()

0    3614
1    1308
Name: churn_Yes, dtype: int64

In [29]:
model1 = clf.predict(x_train)
model1[:5]

array([0, 1, 0, 0, 0], dtype=uint8)

In [30]:
model2 = clf2.predict(x_train)
model2[:5]

array([0, 1, 0, 0, 0], dtype=uint8)

In [31]:
results = train
results['model1'] = model1
prep.evaluate(results,'churn_Yes',model='model1',target=1)

Model accuracy is: 99.76%.
Baseline accuracy is: 73.43%.

Model recall is: 99.08%.
Baseline recall is: 0.0%.

Model precision is: 100.0%.
Baseline precision is: 0%.



In [32]:
results = train
results['model2'] = model2
prep.evaluate(results,'churn_Yes',model='model2',target=1)

Model accuracy is: 83.97%.
Baseline accuracy is: 73.43%.

Model recall is: 63.76%.
Baseline recall is: 0.0%.

Model precision is: 72.58%.
Baseline precision is: 0%.



In [33]:
print(classification_report(y_train, model1))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3614
           1       1.00      0.99      1.00      1308

    accuracy                           1.00      4922
   macro avg       1.00      1.00      1.00      4922
weighted avg       1.00      1.00      1.00      4922



In [34]:
clf.feature_importances_
x_train.columns
importance = {'cols': x_train.columns,
              'importance':clf.feature_importances_}

pd.DataFrame(importance).sort_values(by='importance',ascending=False)

Unnamed: 0,cols,importance
1,tenure,0.213253
3,total_charges,0.209517
2,monthly_charges,0.198517
25,internet_service_type_Fiber optic,0.107736
28,payment_type_Electronic check,0.022906
22,paperless_billing_Yes,0.022711
4,gender_Male,0.02038
5,partner_Yes,0.017859
11,online_security_Yes,0.017373
6,dependents_Yes,0.017234


In [35]:
clf.score(x_val, y_val)

0.7308056872037915

In [36]:
clf2.score(x_val, y_val)

0.7668246445497631

## 2. Experiment with this model on other datasets with a higher number of output classes.

In [37]:
iris = ac.get_iris_data()
iris = prep.prep_iris(iris)
iris.head(3)

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width
0,setosa,5.1,3.5,1.4,0.2
1,setosa,4.9,3.0,1.4,0.2
2,setosa,4.7,3.2,1.3,0.2


In [38]:
train, val, test = prep.train_val_test(iris,'species')
x_train, y_train = prep.split_x_y(train, 'species')
x_val, y_val = prep.split_x_y(iris, 'species')

In [39]:
clf = DecisionTreeClassifier(max_depth=8, random_state=100)
clf.fit(x_train, y_train)

In [40]:
model1 = clf.predict(x_train)

In [41]:
results = train
results['model1'] = model1
prep.evaluate(results,'species',model='model1',target='setosa')

Model accuracy is: 100.0%.
Baseline accuracy is: 33.33%.

Model recall is: 100.0%.
Baseline recall is: 100.0%.

Model precision is: 100.0%.
Baseline precision is: 33.33%.



In [42]:
print(classification_report(y_train, model1))

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        35
  versicolor       1.00      1.00      1.00        35
   virginica       1.00      1.00      1.00        35

    accuracy                           1.00       105
   macro avg       1.00      1.00      1.00       105
weighted avg       1.00      1.00      1.00       105



In [43]:
clf.feature_importances_
x_train.columns
importance = {'cols': x_train.columns,
              'importance':clf.feature_importances_}

pd.DataFrame(importance).sort_values(by='importance',ascending=False).head()

Unnamed: 0,cols,importance
2,petal_length,0.545645
3,petal_width,0.432586
1,sepal_width,0.021769
0,sepal_length,0.0


In [44]:
clf.score(x_val, y_val)

0.98