# Decision tree
A decision tree is a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes that is used to make predictions.

Decision nodes: evaluates and split the the data based on features

Leaf nodes: represents all the possible the outputs of the model

<img src="https://i0.wp.com/why-change.com/wp-content/uploads/2021/11/Decision-Tree-elements-2.png?resize=715%2C450&ssl=1">

## Measuring purity
Entropy ($H(p)$): a value between 0 and 1 that measures the impurity of a set of data. A high entropy means the data set less homogenous

$p_n$: the ratio of $n$th class in a set of data

For binary classification
$$H(p) = -p_0log_2(p_0)-p_1log_2(p_1)$$
where,
$$p_0 = 1 - p_1$$

<img src="https://miro.medium.com/v2/resize:fit:1400/1*pLl6EiI4KRyf3ClAgFrqHA.png" width=500>
Note: $0log(0) = 0$ for simplicity


## Training
Decision:
* Which feature to use at each node to maximize purity (aiming for only one class at each node)
* When to stop splitting (decide based on purity score, tree depth, number of training examples in each node, etc)

## Information gain
Information gain measures the change in entropy after each split. For each split, we aims to reduce the entropy the most, which increases the purity at each node (highest information gain)


$$\text{Information Gain} = H(p_1^\text{node})- \left(w^{\text{left}}H\left(p_1^\text{left}\right) + w^{\text{right}}H\left(p_1^\text{right}\right)\right),$$

$H(p_1^\text{node})$: the entropy at the node above

$w$: the fraction of training example that has been splitted into each node
$$w = \frac{\text{number of training examples in the current node}}{\text{number of training examples in the node above}}$$

$H\left(p_1^\text{left}\right), H\left(p_1^\text{right}\right)$: entropy at the right and left nodes


## Stopping criteria
We can stop splitting if
* When a node if 100% pure
* When a max depth of the tree is reached
* When information gain from a split is too small (no significant improvement on the model)
* When the number of example in a node is below a threshold

## Building decision tree
Steps:
1. Starting with all training examples at the root node
2. Calculate the information gain for each feature and picked the one that has the highest information gain as the spllting feature
3. Split the training examples based on the feature chosen
4. Repeat from step 1 at each node until a stoping criteria is met

This is a recursive algorithm by building a large decision tree from smaller ones, stopping criteria will be the base case

## One hot encoding
One hot encoding resolves the case when a feature can take on more than two values. This method converts textual features to binary features, which can be feed as inputs to other models as well

One hot encoding: if a feature can take on $k$ different values, we create $k$ binary features that can only be true or false (0 or 1) to replace the old feature

## Continuous valued features
Some features can take in a range of values. In order to split data based on continuous values, we need to pick a threshold value for splitting. This can be done by selecting a midpoint between two adjacent values in the training examples and calcualte the information gain. Repeat this process for each two adjacent value in the trianing examples and split at the value that has the highest information gain

# Regression tree
Instead of predicting a specific class, regression tree predicts a nubmer as output.

Splitting: instead aiming to reduce the entropy, each split aims to reduce the variance of the data as much as possible

# Code

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [6]:
# Create training data
X_train = np.array([[1, 1, 1],
[0, 0, 1],
 [0, 1, 0],
 [1, 0, 1],
 [1, 1, 1],
 [1, 1, 0],
 [0, 0, 0],
 [1, 1, 0],
 [0, 1, 0],
 [0, 1, 0]])

y_train = np.array([1, 1, 0, 0, 1, 1, 0, 1, 0, 0])

In [18]:
# Calculate entropy
def entropy(p):
    # if a class is pure, entropy is 0
    if p == 0 or p == 1:
        return 0
    
    e = - p * np.log2(p) - (1 - p) * np.log2(1 - p)
    return e

entropy(0.5)

1.0

In [19]:
# split training examples
def split(X, feature_index):
    left = []
    right = []
    
    for i, x in enumerate(X): # enumerate return in (index, content) form
        if x[feature_index] == 1:
          left.append(i) # splitting training examples based on their index
        else:
          right.append(i)
    return left, right
          
split(X_train, 0)

([0, 3, 4, 5, 7], [1, 2, 6, 8, 9])

In [32]:
# calculate weighted entropy
def weighted_entropy(X, y, left, right):
    w_l = len(left) / len(X)
    w_r = len(right) / len(X)
    p_l = sum(y[left]) / len(left)   # get the ratio of a class on the left
    p_r = sum(y[right]) / len(right) # get the ratio of a class on the right
    
    w_entropy = w_l * entropy(p_l) + w_r * entropy(p_r)
    return w_entropy

In [33]:
# Test
left_indices, right_indices = split(X_train, 0)
weighted_entropy(X_train, y_train, left_indices, right_indices)

0.7219280948873623

In [34]:
def info_gain(X, y, left, right):
    entropy_reduction = entropy(sum(y)/len(y)) - weighted_entropy(X, y, left, right)
    
    return entropy_reduction

In [27]:
# Test
info_gain(X_train, y_train, left_indices, right_indices)

0.2780719051126377

# Tree ensembles
Issues: a single decision tree is highly sensitive to small changes to data

Tree ensembles: multiple decision trees that will be used for prediction, each tree will make a prediction and vote for a final result 

## Sampling with replacement

## Random forest algorithm
For a training set with $m$ training examples, use sampling with replacement to create $B$ number of similar training set, all with $m$ training examples. For each training set, build a decision tree. Then, form a tree emsemble with $B$ trees

During training, when choosing feature for splitting at each node, if $n$ features are available, pick a random subset of $k < n$ (eg. $k = \sqrt n$) features, and select the feature with the highest information gain within the subset with $k$ features

This ensures a change in the training exmaple will not impact the prediction as much

## XGBoost
When picking a training example with sampling with replacment, instead of picking all examples with equal probability, we increase the probability of picking missclassified examples by previous trees. This improves the algorithm by focusing on the examples that the model is not doing well on

XGBoost is
* Fast to implement
* Automatically choose splitting and stopping criteria
* Use regularization to prevent overfitting

# Decision tree vs Neural network
* Decision trees: perform well on structured data (spread sheet format), but not well on unstructured data (eg. images, audios, texts, etc). Also, decision trees are faster to train than neural network
* Neural network: perform well on all types of data, can be used for transfer learning, and stringed together, but is slower to train in general