In [92]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image, display_html, display, Math, HTML;

# Decision Trees

Today we start our study of __classification__ methods.

Recall that in a classification problem, we have data tuples $(\mathbf{x}, y)$ in which the $\mathbf{x}$ are the features, and the $y$ values are __categorical__ data.  

We typically call the $y$ values "labels."

Some examples of classification tasks:
   * Predicting tumor cells as malignant or benign
   * Classifying credit card transactions as legitimate or fraudulent
   * Classifying secondary structures of a protein as alpha-helix, beta-sheet, or other
   * Categorizing news stories as finance, weather, entertainment, sports, etc

The first classification method we will consider is called the __decision tree.__

It is a __very__ popular method, and has some nice properties as we will see.

## Decision Trees in Action

```{note}
This section and a number of following sections are based on slides by Tan, Steinbach, and Kumar (2004)
```

We will start by describing how a decision tree works.

We are assuming a decision tree has been built to solve the following classification problem:

<center>
<font color = "blue">Given an individual's Tax Refund Status, Marital Status, and Taxable Income, predict whether they will repay a loan.</font>
</center>

<center>
    
<img src="figs/L14-DT-Example-1.png" alt="Figure" width="800px">
    
</center>

We then step through the tree, making a decision at each node that takes us to another node in the tree.

Each decision examines a single feature in the item being classified.

<center>
    
<img src="figs/L14-DT-Example-2.png" alt="Figure" width="800px">
    
</center>

<center>
    
<img src="figs/L14-DT-Example-3.png" alt="Figure" width="800px">
    
</center>

<center>
    
<img src="figs/L14-DT-Example-4.png" alt="Figure" width="800px">
    
</center>

<center>
    
<img src="figs/L14-DT-Example-5.png" alt="Figure" width="800px">
    
</center>

<center>
    
<img src="figs/L14-DT-Example-6.png" alt="Figure" width="800px">
    
</center>

We conclude that this record is classified as "Not Repay" is "No".

Note also that decision trees can be used to predict numeric values, so they are used for regression as well.

The general term "Classification and Regression Tree" (CART) is sometimes used -- although this term also refers to a specific decision tree learning algorithm.

## Learning a Decision Tree

<center>
    
<img src="figs/L14-DT-Overview.png" alt="Figure" width="800px">
    
</center>

We've discussed how to apply a decision tree to data (lower portion of this figure).

But how does one train a decision tree?   What algorithm can we use?

A number of algorithms have been proposed for building decision trees:

* Hunt's algorithm (one of the earliest)
* CART
* ID3, C4.5
* etc

### Hunt's Algorithm

We build the tree node by node, starting from the root.

As we build the tree, we divide the training data up.

<div style = "float: left; width: 55%;">

Let $D_t$ be the set of training records that reach node $t$.
    
 * If $D_t$ contains records that all belong to a single class $y_t$, then $t$ is a leaf node labeled as $y_t$.
 * If $D_t$ is an empty set, then $t$ is a leaf node labeled by the default class $y_d$.
 * If $D_t$ contains records that belong to more than one class, use an attribute to split $D_t$ into smaller subsets, and assign that splitting-rule to node $t$.
    
Recursively apply the above procedure until a stopping criterion is met.
</div>
    
<img src="figs/L14-DT-Data-Example.png" alt="Figure" width="40%" float = "right">
    

