## <span style='color:#DB822E'>Activity 2: Visualizing a Decision Tree</span>

In the Activity 01 we used a decision tree as a classifier. There are many different types of classifiers but we chose decision tree because they are easy to read and understand. Decision Trees are one of the few models that are interpretable meaning we can understand exactly why a classifier predicts an answer. In this lab we will add modules to the environment that will allow us to visualize the decision tree.

We will move on from the 🍎 and 🍊 data for now. In this activity we will use a well-known dataset used in beginner ML code projects called [Iris](https://en.wikipedia.org/wiki/Iris_flower_data_set). This data evaluates 4 different features of an Iris to predict its specific species from one of three choices.

1. setosa
2. versicolor
3. virginica

This dataset is a bit more complex and interesting than the data set used in activity 01. This lab will emphasize the importance of test data to validate the model and deliver a cool graphical representation of the decision tree model. In this lab we will

1. Import a data set
2. Train a classifier
3. Predict a label for new flower
4. Visualize the decision tree

To learn more about this lab you can watch [Visualize a Decision Tree Machine Learning Recipes 2](https://youtu.be/tNa99PG8hR8?si=IFuJcyxgLf8htxNz)


***

## Step 1 Configure Python

First, we need to further prepare our python environment by importing a two additional libraires required by this lab (pandas and NumPy).  
- `Pandas` is a very popular python library for working with data.  Since, we are importing data we need this library.
- `NumPy` is an open-source Python library that facilitates efficient numerical operations on large quantities of data. 

Let's get started with lab by running the code cell just below 👇🏾 to load Pandas and NumPy libraries. **Hint:  highlight the cell and try Shift+Enter**


In [None]:
# Import Jupyter Display module for better data output
from IPython.display import display, Markdown, Image
import pandas as pd
import numpy as np

display(Markdown('<span style="color: #14B326">Done! On to Step 2</span>'))



***

## Step 2 - Load the datasets
Sci-kit-learn is a great tool for ML learning because it provides many sample data sets (e.g. Iris) and utilities for downloading.  In this step we will import and load the data.

- `feature_names` is the variable for the 4 features inlcuded in the dataset

1. `sepal length (cm)`
2. `sepal width (cm)`
3. `petal length (cm)`
4. `petal width (cm)`

**Terminology:**

Sepal
:is the (usually) the green leaf of the flower<br>

Petal
: is the non-green leaf/petal of the flower<br>

- `target_names` is the variable for the labels (the 3 different species of Iris flower)

1. `setosa`
2. `versicolor`
3. `virginica`

We then print the feature and target (label) names to confirm the datasets are loaded. Goa ahead and try it now 👇🏾.

    
**Tip: This step may take a moment to run and complete. Remember a status indicator `[*]:`, indicates the code cell is still running.**

In [None]:
from sklearn.datasets import load_iris
from sklearn import tree

iris = load_iris()

display(Markdown("### <font color='#DB822E'>Features Names:"))
display(iris.feature_names)

display(Markdown("### <font color='#DB822E'>Label Names: "))
for types in iris.target_names:
    display(types)

display(Markdown('<span style="color: #14B326">Done! On to Step 3.</span>'))


*** 

## Step 3 - Display the Iris Data

In Step 2 we imported the Iris data and started the validation process by displaying the feature and label names. Let's finish the validation process by displaying and exploring the data.  Iris has 150 examples (records).  You can display a single line from the data with `display(iris.data[0])` 

Feel free to uncomment the line above to display the features from any given line (0-149) or Iris data.  Or run the code cell as is to display the entire data set.

In [None]:
# iris_data = []
# for i in range(len(iris.target)):
#     iris_data.append(
#         {"Example": i, "Label": iris.target[i], "Features": iris.data[i]}
#     )

# df = pd.DataFrame(iris_data)
# display(HTML(df.to_html(index=False)))

for i in range(len(iris.target)):
    display(f"Example {i}: label {iris.target[i]}: features {iris.data[i]}")

#display(iris.data[0])

display(Markdown('<span style="color: #14B326">Done! The data has displayed with variable names instead of the actual labels. In step 4 we will display the data with the actual labels.</span>'))



***

## Step 4 - Print out entire dataset with label names</span>

Displaying with data with the actual labels makes it more human readable.  You will also notice that the data is ordered such that a new species starts lines 0, 50 and 100.  This is important because in upcoming step 5 we will reserve some test data for each species to use for model validation.

- Example 0 begin Setosa
- Example 50 begin Versicolor
- Example 100 begin Virginica

**Go for it!** 👇🏾

In [None]:

for i in range(len(iris.target)):
    display(f"Example {i}: label {iris.target_names[iris.target[i]]}({iris.target[i]}): features {iris.data[i]}")
    
display(Markdown('<span style="color: #14B326">Done! On to step 5!</span>'))


***

## Step 5 - Prep the data by removing test data entries</span>
This is supervised learning. This means we use a labeled set of data to train model.  After we train a model, we need to test the model with known good data. Therefore, we will remove three examples to create a test data set. Most of this data set is used to train the model.  In the previous we learned that a new species starts lines 0, 50 and 100. 

In this code block we will remove examples 0, 50, 100 and use these examples to create a new set of data for test.  To validate our work we display the data.

Remember 

- `test_target` is the variable for the label array. `[0, 1, 2]` for setosa, versicolor and virginica.
- `test_data` is the array of features.



In [None]:

# iris = load_iris()
test_idx = [0, 50, 100]

# training data
train_target = np.delete(iris.target, test_idx)
train_data = np.delete(iris.data, test_idx, axis=0)

# testing data
test_target = iris.target[test_idx]
test_data = iris.data[test_idx]



# What to expect
display(Markdown("This will show the array of the test data. It will be relevant in the next section:"))

display(test_target)
display(test_data)


display(Markdown('<span style="color: #14B326">Done! Now that we have training and test data, we will train the model in step 6.</span>'))


***

## Step 6 - Train the model What does the model predict?</font>

We now train the model with the new dataset. Similar, to the 🍎 and 🍊data in activity 01.  We will 
1. create a classifier model
2. apply a fit algorithm
3. Train with our training data

Using the test data we entered above, we then predict what we should get based on that data
<br><br>


In [None]:
# create new classifier
clf = tree.DecisionTreeClassifier()
# train on training data
clf.fit(train_data, train_target)

# What the tree predicts
display(Markdown("This should match the output in the previous section:"))
display(clf.predict(test_data))


display(Markdown('<span style="color: #14B326">Done! On to Step 7!</span>'))


***

## Step 7 - Visualize the Decision Tree</span>

"A picture is worth a thousand words."  Decision tree models are a great start for people new to machine learning.  As mentioned earlier, this model is one of the few that can be visualized and interpreted.  The code block below will load two libraries, `graphviz` and `pydotplus` that are used to render an image of the actual decision tree.  Go ahead and run the code now 👇🏾 and review the decision tree.

In [None]:
# Visualize
# from scikit decision tree tutorial: http://scikit-learn.org/stable/modules/tree.html
from six import StringIO
import pydotplus

dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
                     feature_names=iris.feature_names,
                     class_names=iris.target_names,
                     filled=True, rounded=True,
                     impurity=False)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

graph.write_png("images/iris.png")
display(Image(filename="images/iris.png"))

display(Markdown('<span style="color: #14B326">Done!</span>'))


## You have completed Activity 02.

[Start Activity 03](Activity03.ipynb)