### Codio Activity 14.1: Decision Trees with `sklearn`

**Expected Time = 60 minutes**

**Total Points = 50**

This activity introduces using the `DecisionTreeClassifier` from the `sklearn.tree` module.  You will build some basic models and explore hyperparameters available.  Using the results of the model, you will explore decision boundaries determined by the estimator. 

#### Index 

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)
- [Problem 5](#Problem-5)



In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn import set_config

set_config("diagram")

### The Data

For this activity, you will again use the `penguins` data from seaborn.  You will target the two most important features to determining between `Adelie` and `Gentoo`.  

In [16]:
penguins = sns.load_dataset("penguins").dropna()

In [17]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [18]:
X = penguins.select_dtypes(["float"])
y = penguins.species

In [19]:
# sns.pairplot(data = penguins, hue = 'species')

[Back to top](#-Index)

### Problem 1

#### Fitting a model

To being, build a `DecisionTreeClassifier` with the parameter `max_depth = 1`.  Fit the model on the training data and assign it to the variable `dtree` below.

**10 Points**



In [20]:
### GRADED

dtree = DecisionTreeClassifier(max_depth=1).fit(X, y)

# Answer check
print(dtree)

DecisionTreeClassifier(max_depth=1)


[Back to top](#-Index)

### Problem 2

#### Examining the Decision

To examine a basic text representation of the fit tree, use the `export_text` function and set the argument `feature_names = list(X.columns)`.  

**10 Points**

In [21]:
### GRADED

depth_1 = export_text(dtree, feature_names=list(X.columns))

### ANSWER CHECK
print(depth_1)

|--- flipper_length_mm <= 206.50
|   |--- class: Adelie
|--- flipper_length_mm >  206.50
|   |--- class: Gentoo



[Back to top](#-Index)

### Problem 3

#### Two Features

**10 Points**

Now, to make it simpler to plot the boundaries the data is subset to `flipper_length_mm` and `bill_length_mm`.  Below, fit the model and assign the results of the tree with `export_text()` as `tree2` below.  Try replicating the image below using the information from the tree. (vertical and horizontal lines represent decision boundaries of tree)

<center>
    <img src = 'images/p3.png' />
</center>



In [22]:
X2 = X[["flipper_length_mm", "bill_length_mm"]]
# Use max depth = 2 to use both features
dtree = DecisionTreeClassifier(max_depth=2).fit(X2, y)
tree2 = export_text(dtree, feature_names=list(X2.columns))

### ANSWER CHECK
print(tree2)

|--- flipper_length_mm <= 206.50
|   |--- bill_length_mm <= 43.35
|   |   |--- class: Adelie
|   |--- bill_length_mm >  43.35
|   |   |--- class: Chinstrap
|--- flipper_length_mm >  206.50
|   |--- bill_length_mm <= 40.85
|   |   |--- class: Adelie
|   |--- bill_length_mm >  40.85
|   |   |--- class: Gentoo



[Back to top](#-Index)

### Problem 4

#### Evaluating the tree

**10 Points**

Again, the default metric of the classifier is accuracy.  Evaluate the accuracy of the estimator `DecisionTreeClassifier` and assign as a float to `acc_depth_2` below.  As you see there are a few points misclassified in the image of the decision boundaries.

In [23]:
### GRADED

acc_depth_2 = accuracy_score(y, dtree.predict(X2))

### ANSWER CHECK
print(acc_depth_2)

0.9519519519519519


[Back to top](#-Index)

### Problem 5

#### A Deeper Tree

**10 Points**

Finally, consider a tree with `max_depth = 3`.  Print the results and and use them to decide a prediction for the following penguin:

| flipper_length_mm | bill_length_mm |
| ----------------- | -------------  |
| 209 | 41.2 |

Assign your results as a string `Adelie`, `Chinstrap`, or `Gentoo` to `prediction` below.

In [28]:
### GRADED
x = pd.DataFrame([[209, 41.2]], columns=list(X2.columns), index=None)
display(x)
dtree3 = DecisionTreeClassifier(max_depth=3).fit(X2, y)
prediction = dtree3.predict(x)

# Answer check
print(prediction)

Unnamed: 0,flipper_length_mm,bill_length_mm
0,209,41.2


['Gentoo']
