# How to build Decision Trees
Authors: Patrick Wales-Dinan

<img src="./images/new_job.png" align="left" width=1000>

- (This image is courtesy of [Rajesh Brid](https://medium.com/@rajesh_brid).)


### Learning Objectives

- Understand what a decision tree is.
- Calculate Gini Impurity.
- Describe how decision trees use Gini Impurity to make decisions.
- Fit, generate predictions from, and evaluate decision tree models.

## What should we do today?

|$Y = $ Activity|$X_1 = $ Day|$X_2 = $ Weather|
|:---------:|:--------------:|:----------:|
|   Movies  |      Weekend     |   Rainy  |
|   Netflix |      Weekday     |   Sunny  |
|   Beach   |      Weekend     |   Sunny  |
|   Netflix |      Weekday     |   Rainy  |
|   Netflix |      Weekday     |   Rainy  |
|   Beach   |      Weekend     |   Sunny  |

<details><summary>It's a weekday. Based on our past behavior, what do you think we'll do today?</summary>

- Watch Netflix.
- In 100% of past cases where it's a weekday we've watched Netflix!

|$Y = $ Activity|$X_1 = $ Day|$X_2 = $ Weather|
|:---------:|:--------------:|:----------:|
|   Netflix  |      Weekday     |   Sunny  |
|   Netflix  |      Weekday     |   Rainy  |
|   Netflix  |      Weekday     |   Rainy  |

</details>

<details><summary>It's a weekend. Based on our past behavior, what do you think we'll do?</summary>

- Either go to the movies or go to the beach... but we can't say with certainty whether we'd go to the beach or go to the movies.
- Based on our past behavior, we go to the movies on 1/3 of weekend days and we go to the beach on 2/3 of weekend days. (You can think of `.predictproba()`)
- If I **had** to make a guess here, I'd probably predict that we would go to the beach, but we may want to use additional information to be certain.

|$Y = $ Activity|$X_1 = $ Day|$X_2 = $ Weather|
|:---------:|:--------------:|:----------:|
|  Movies |      Weekend     |   Rainy  |
|  Beach  |      Weekend     |   Sunny  |
|  Beach  |      Weekend     |   Sunny  |

</details>

<details><summary>It's the weekend and the weather is sunny! Based on our past behavior, what do you think we'll do?</summary>

- Go to the beach.
- In 100% of past cases where the weather is sunny and where it's a weekend, we've gone to the beach.

|$Y = $ Activity|$X_1 = $ Day|$X_2 = $ Weather|
|:---------:|:--------------:|:----------:|
|  Beach  |      Weekend     |   Sunny  |
|  Beach  |      Weekend     |   Sunny  |

</details>

# Decision Trees: Overview

A decision tree:
- takes a dataset consisting of $X$ and $Y$ data, 
- finds rules based on our $X$ data that partitions (splits) our data into smaller datasets such that
- by the bottom of the tree, the values $Y$ in each "leaf node" are as "pure" as possible.

We frequently see decision trees represented by a graph.

<img src="./images/decision_tree_1.png" alt="what_to_do" width="750"/>

- (This image was created using [Draw.io](https://www.draw.io/).)

### Terminology
Decision trees look like upside down trees. 
- What we see on top is known as the "root node," through which all of our observations are passed.
- At each internal split, our dataset is partitioned.
- A "parent" node is split into two or more "child" nodes.
- At each of the "leaf nodes" (colored blue), we contain a subset of records that are as pure as possible.
    - In the example above, each leaf node is perfectly pure. Once we get to a leaf node, every observation in that leaf node has the exact same value of $Y$!
    - There are ways to quantify the idea of "purity" here so that we can let our computer do most of the tree-building (model-fitting) process... we'll come back to this later.

Decision trees are also called "**Classification and Regression Trees**," sometimes abbreviated "**CART**."
- [DecisionTreeClassifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- [DecisionTreeRegressor Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

## Purity in Decision Trees

When quantifying how "pure" a node is, we want to see what the distribution of $Y$ is in each node, then summarize this distribution with a number.

<img src="./images/decision_tree_1.png" alt="what_to_do" width="750"/>

- For continuous $Y$ (i.e. using a decision tree to predict income), the default option is mean squared error.
- This is the `criterion = 'mse'` argument in `DecisionTreeRegressor`.    

- For discrete $Y$, the default option is the Gini impurity. *(Bonus: This is not quite the same thing as the [Gini coefficient](https://en.wikipedia.org/wiki/Gini_coefficient).)*

$$
\begin{eqnarray*}
\text{Gini impurity} &=& 1 - \sum_{i=1}^{classes} P(\text{class i})^2 \\
\text{Gini impurity (2 classes)} &=& 1 - P(\text{class 1})^2 - P(\text{class 2})^2 \\
\text{Gini impurity (3 classes)} &=& 1 - P(\text{class 1})^2 - P(\text{class 2})^2 - P(\text{class 3})^2 \\
\end{eqnarray*}
$$

In [None]:
# Create our y variable from our "what should we do" dataframe.
y = ['Movies', 'Netflix', 'Beach', 'Netflix', 'Netflix', 'Beach']

In [None]:
# Define Gini function, called gini.
def gini(obs):
        
    # Create a list to store my squared class probabilities.
    probab_sum = []
    
    # Iterate through each class.
        
        # Calculate what is the observed probability or frequency of each class (i).
        prob = obs.count(observation_i) / len(obs)
#         print(f'probability of {observation_i} is {prob}')
        
        # Square the probability and append it to probab_sum.
        probab_sum.append(prob ** 2)
#         print(f' the probability sums squarred are {probab_sum}')
    
    return 1 - sum(probab_sum)

In [None]:
# Check to see if your Gini function is correct on the 
# "what should we do tonight?" data. (Should get 0.6111.)
gini(y)

### Gini Practice

<details><summary>What is the Gini impurity of a node when every item is from the same class?</summary>
    
- Our Gini impurity is zero.

$$
\begin{eqnarray*}
\text{Gini impurity} &=& 1 - \sum_{i=1}^{classes} P(\text{class i})^2 \\
&=& 1 - P(\text{class 1})^2 \\
&=& 1 - 1^2 \\
&=& 1 - 1 \\
&=& 0
\end{eqnarray*}
$$
</details>

In [None]:
# What is Gini when every item is from the same class?
gini(['Netflix', 'Netflix', 'Netflix'])

<details><summary>What is the Gini impurity of a node when we have two classes, one with two items and the other with one item?</summary>
    
- Our Gini impurity is 0.5.

$$
\begin{eqnarray*}
\text{Gini impurity} &=& 1 - \sum_{i=1}^{classes} P(\text{class i})^2 \\
&=& 1 - P(\text{class 1})^2 - P(\text{class 2})^2 \\
&=& 1 - \left(\frac{1}{2}\right)^2 - \left(\frac{1}{2}\right)^2 \\
&=& 1 - \frac{1}{4} - \frac{1}{4} \\
&=& \frac{1}{2}
\end{eqnarray*}
$$
</details>

In [None]:
# What is Gini when we have two classes, one with two items and the other with one item?
gini(['Movie', 'Beach', 'Beach'])

<details><summary>What is the Gini impurity of a node when we have three classes, each with two items?</summary>
    
- Our Gini impurity is 0.6667.

$$
\begin{eqnarray*}
\text{Gini impurity} &=& 1 - \sum_{i=1}^{classes} P(\text{class i})^2 \\
&=& 1 - P(\text{class 1})^2 - P(\text{class 2})^2 - P(\text{class 3})^2 \\
&=& 1 - \left(\frac{1}{3}\right)^2 - \left(\frac{1}{3}\right)^2 - \left(\frac{1}{3}\right)^2 \\
&=& 1 - \frac{1}{9} - \frac{1}{9} - \frac{1}{9} \\
&=& 1 - \frac{1}{3} \\
&=& \frac{2}{3}
\end{eqnarray*}
$$

In [None]:
# What is Gini when we have three classes, each with two items?
gini(['Netflix', 'Netflix', 'Beach', 'Beach', 'Movies', 'Movies'])

<details><summary>Summary of Gini Impurity Scores</summary>

- In the binary case, Gini impurity ranges from 0 to 0.5.
- If we have three classes, Gini impurity ranges from 0 to 0.66667.
- If we have $k$ classes, Gini impurity ranges from 0 to $1-\frac{1}{k}$.
- In all cases, a Gini impurity of 0 means maximum purity - all of our observations are from the same class!
</details>

### So how does a decision tree use Gini to decide which variable to split on?

- At any node, consider the subset of our dataframe that exists at that node.
- Iterate through each variable that could potentially split the data.
- Calculate the Gini impurity for every possible split.
- Select the variable that causes the greatest decrease in Gini impurity from the parent node to the child node.

<details><summary>What is the decrease in Gini impurity if we split on $X_1$? (Weekend vs. Weekday)</summary>

- Answer: 0.389

<img src="./images/gini_decrease_4.png" alt="gini_decrease" width="500"/>

<details><summary>What is the decrease in Gini impurity if we instead split on $X_2$ first? (Sunny Day vs. Rainy Day)</summary>
    
- Answer: 0.167

<img src="./images/gini_decrease_3.png" alt="gini_decrease" width="500"/>

One consequence of this is that a decision tree is fit using a **greedy** algorithm. Simply put, a decision tree makes the best short-term decision by optimizing at each node individually. _This might mean that our tree isn't optimal in the long run!_

## Building a Decision Tree

In [None]:
# Import data.
import pandas as pd

# Read in Titanic data.
titanic = pd.read_csv('./titanic_clean.csv')

# Change sex to float.
titanic['Sex'] = titanic['Sex'].map({'male':0,
                                     'female':1})

# Create embarked_S column.
titanic['Embarked_s'] = titanic['Embarked'].map({'S':1,
                                                 'C':0,
                                                 'Q':0})

# Create embarked_C column.
titanic['Embarked_c'] = titanic['Embarked'].map({'S':0,
                                                 'C':1,
                                                 'Q':0})

# Conduct train/test split.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(titanic.drop(['Survived','PassengerId','Name','Embarked'], axis=1),
                                                    titanic['Survived'],
                                                    test_size = 0.3,
                                                    random_state = 42)

In [None]:
# Check out first five rows of X_train.
X_train.head()

In [None]:
# Import model.
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Instantiate model.


In [None]:
# Fit model.


In [None]:
# Evaluate model.


<details><summary>What conclusion would you make here?</summary>

- Our model is **very** overfit to the data.
</details>

When fitting a decision tree, your model will always grow until it nearly perfectly predicts every observation!
- This is like playing a game of 20 questions, but instead calling it "Infinite Questions." You're always going to be able to win!

<details><summary>Intuitively, what might you try to do to solve this problem?</summary>
    
- As with all models, try to gather more data.
- As with all models, remove some features.
- Is there a way for us to stop our model from growing? (Yes!)
</details>