# Decision Tree

## Table of Contents
- [1 - Fundamentals](#1)
    - [1.1 - Characteristics](#1.1)
    - [1.2 - Measuring purity](#1.2)
    - [1.3 - Information Gain](#1.3)
    - [1.4 - Decision Tree Learning](#1.4)
    - [1.5 - One-hot encoding for categorical features](#1.5)
    - [1.6 - Continuous valued features](#1.6)
    - [1.7 - Regression Trees](#1.7)
- [2 - Example](#2)  

<a name='1'></a>
# 1 - Fundamentals

<a name='1.1'></a>
## 1.1 - Characteristics

A Decision Tree is a popular recursive machine learning algorithm for classification and regression tasks.

Advantages:
- Easy to understand, interpret and visualize
- Works with categorical and numeric values
- Non-linear relationships between variables do not affect the accuracy of the tree
- Feature importance can be displayed 
- Requires minimal preparation or data cleaning before use

Disadvantages:
- Highly sensitive to small changes of the data (unstable)
- In case of unbalanced training data this so-called bias can also be present in the tree
- Prone to overfitting (as a result, they do not generalize well to previously unseen data)
- Can not extrapolate

<a name='1.2'></a>
## 1.2 - Measuring purity

### Entropy function

The Entropy function is a measure of the impurity of a set of data (there are also other functions like the gini function). Let's go through some examples.

Let's define the fraction of examples that are cats as $p_{1}$.

$p_{1}$ = fraction of examples that are cats.

$p_{0} = 1 - p_{1}$

The Entropy function looks like this:
<img src="images/entropy_function.jpg" style="width:200;height:200px;">
<caption><center><font><b>Figure 2</b>: Entropy function</center></caption>
    
and the general equation to calculate the entropy is this: 
    
$$H(p1) = -p_{1}log_{2}(p_{1}) - p_{0}log_{2}(p_{0}) = -p_{1}log_{2}(p_{1}-(1-p_{1})log_{2}(1-p_{1})$$
    

Note: "$0 log(0)$" = 0
    
### Gini function

    
xyz


<a name='1.3'></a>
## 1.3 - Information Gain

When building a decision tree, the way we'll decide what feature to split on at the node will be based on what choice of feature reduces entropy the most. Reduces entropy or reduces impurity or maximizes purity. In decision tree learning, the reduction of entropy is called information gain.

<img src="images/information_gain.jpg" style="width:200;height:200px;">
<caption><center><font><b>Figure 2</b>: Information Gain</center></caption>
    
    
General formular for computing information gain: $H(p_{1}^{root}) - (w^{left} H(p_{1}^{left}) + w^{right} H(p_{1}^{right}))$
    
<img src="images/information_gain_example.jpg" style="width:200;height:200px;">
<caption><center><font><b>Figure 2</b>: Information Gain Example</center></caption>

<a name='1.4'></a>
## 1.4 - Decision Tree Learning

1. Start with all examples at the root node
2. Calculate information gain for all possible features, and pick the one with the highest information gain
3. Split dataset according to selected feature, and create left and right branches of the tree
4. Keep repeating splitting process until stopping criteria is met:
 - When a node is 100% a single class
 - When splitting a node will results in the tree exceeding a minimum depth
 - Information gain from additional splits is less than a threshold
 - When number of examples in a node is below a threshold

<a name='2'></a>
# 2 - Example

Let's go through some examples for different datasets of cats and dogs:
    
<img src="images/entropy_examples.jpg" style="width:200;height:200px;">
<caption><center><font><b>Figure 2</b>: Entropy examples</center></caption>