Decision Trees (Non-parametric Model)

1) Decision Tree Basics and Algorithm
![decision_trees](decision_trees.png)
* Tree Terminology:
    * Nodes
        * Root
        * Leaves
    * Edges
    * Trees are a recursive structure - from any point, we can pick a node, pretend it's root, and treat it and what's below it as a *subtree*
* Decision Tree split logic (binary):
    * For a categorical variable, choose either value or not value (e.g. sunny or not sunny)
    * For a continuous variable, choose a threshold and do $>$ or $\leq$ the value (e.g. temperature $<$ 75 or $\geq$ 75)
* Building Trees:
    * For every item in the dataset is in the same class or there is no feature left to split the data:
        * return a leaf node with the class label
    * else:
        * find the best feature and value to split the data
        * create a node
        * split the dataset using the best split from above
        * for each split:
            * call BuildTree() and add the result as a child of the node
        * return node
* Decision Tree Classification Splitting Steps:
    1. calculate the information gain for every possible split
    2. select the split that has the highest information gain
* Decision Tree Regression Splitting Steps:
![decision_tree_reg_1](dt_regression_1.png)
![decision_tree_reg_2](dt_regression_2.png)
    * responses are real values, so we can't use cross-entropy or Gini-index
    * can average the values at each leaf node to predict a continuous value
    * can also use a combination of decision trees and linear regression on the leaf nodes (modelling the trees)
    * when to stop? best practice to grow a very large tree and prune backwards
    * choose the best splits using residual sum of squares (RSS) against the mean value of each leaf
        * equation: $\sum_{m=1}^{|T|}\sum_{x_i \in R_m}(y_i-\hat{y}_{R_m})^2 + \alpha |T|$
            * let $|T| =$ # of terminal nodes
            * then, for any penalty term, $\alpha$, we have a subtree $T$ which minimizes the equation above
            * cross-validate as usual to choose $\alpha$ and it's corresponding tree
            * $\alpha$ is like $\lambda$ for Lasso/Ridge to reduce high variance

2) Measure of how good the split is:
![measure_split](https://static.commonlounge.com/fp/original/Gy8ohommQMaImi6o8tOF_wHwyHdHBss5IsrY1507278630)
1. **Entropy/Cross-entropy** - measure of the amount of disorder in a set
![entropy](entropy.png)
    * equation: $H(X) = \sum_{i=1}^n P(x_i)I(x_i) = -\sum_{i=1}^n P(x_i)log_b P(x_i)$
        * $P(x_i)$ - the percent of the group that belongs to a given class
    * low entropy: if a set has all the same labels, that's pretty ordered
    * high entropy: if a set has a good mix of labels, that's not very ordered
    * numerically similar to Gini index
2. **Information Gain** - decides which feature to split on at each step in building the tree
    * equation: $I_G(X,a)=H(X)-H(X|a)$
        * $I_G(X,a)$ - Information Gain
        * $H(X)$ - Entropy (parent)
        * $H(X|a)$ - Weighted Sum of Entropy (children)
    * objective: to create splits that minimize entropy in each side of split
        * if the splits on the boundary between classes is good, then the splits have more predictive power
    * compare entropy of parent and children nodes:
        * measure the entropy of the parent
        * measure the entropy of the children of the proposed split and the difference to parent
        * maximize entropy(parents - mean(entropy(children))
    * splitting examples:
        * Some Information Gain
            ![some_info](split_1.png)
            * Parent's entropy = $-(\frac{1}{2}log_2\frac{1}{2})-(\frac{1}{2}log_2\frac{1}{2})=1$
            * Left child's entropy = $-(\frac{2}{3}log_2\frac{2}{3})-(\frac{1}{3}log_2\frac{1}{3})=0.918$
            * Right child's entropy = $-(1log_21)=0$
            * Information gain from splitting: $X = 1 - (\frac{3}{4}* .918 + \frac{1}{4}* 0) = 0.311$
        * No Information Gain
            ![no_info](split_2.png)
            * Parent's entropy = $-(\frac{1}{2}log_2\frac{1}{2})-(\frac{1}{2}log_2\frac{1}{2})=1$
            * Left child's entropy = $-(\frac{1}{2}log_2\frac{1}{2})-(\frac{1}{2}log_2\frac{1}{2})=1$
            * Right child's entropy = $-(\frac{1}{2}log_2\frac{1}{2})-(\frac{1}{2}log_2\frac{1}{2})=1$
            * Information gain from splitting: $Z = 1 - (\frac{1}{2}* 1 + \frac{1}{2}* 1) = 0$
        * Max Information Gain (Perfect Split)
            ![max_info](split_3.png)
            * Parent's entropy = $-(\frac{1}{2}log_2\frac{1}{2})-(\frac{1}{2}log_2\frac{1}{2})=1$
            * Left child's entropy = $-(1log_21)=0$
            * Right child's entropy = $-(1log_21)=0$
            * Information gain from splitting: $Y = 1 - (\frac{1}{2}* 0 + \frac{1}{2}* 0) = 1$
3. **Gini Index/Impurity** - measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset
    * equation: $G(X) = \sum_{i=1}^n P(x_i)(1-P(x_i)) = 1 -\sum_{i=1}^n P(x_i)^2$
    * Steps for Gini Impurity:
        1. Take random element from the set
        2. Label it randomly according to the distribution of labels in the set
        3. What is the probability that is it labeled incorrectly?
    * small gini: if the class distribution all close to 0 or 1
    * large gini: if the class distributions are more mixed, this takes on a larger value
    * numerically similar to *cross-entropy*

3) Optimizing Trees / Hyperparameters
* **Pruning** - decreases variance of the model by stopping the decision tree algorithm early
    * (-) Decision trees have high variance (Prone to overfitting)
    * **Pre-pruning**
        * **Leaf Size** - stop when there's a few data points at a node
        * **Depth** - stop when a tree gets too deep
        * **Class Mix** - stop when some percent of data points are the same class
        * **Error Reduction/Threshold** - stop when the information gains are too little
    * **Post-pruning**
        * **cut off** - built full tree, then cut off some leaves / merge two leaves up to their parent, thus creating a new leaf
    * Pruning pseudocode

4) Advantages/Disadvantages and Intuitions
* Advantages:
    * (+) easily interpretable
    * (+) can model complex phenomenon / non-linear
        * no assumption that the structure of the model is fixed
        * feature interactions
    * (+) computationally cheap to predict
    * (+) can handle irrelevant features, missing values, and outliers well
    * (+) very extensible (e.g. ties to bagging, random forest, boosting)
* Disadvantages:
    * (-) computationally expensive to train
    * (-) greedy algorithm (gets stuck in local optima)
    * (-) very high variance (super easy to overfit)
* Intuitions for Decision Trees:
    * Always prune to avoid overfitting
    * Most likely will extend into using bagging, random forests, and boosting
    * Gini or cross-entropy are both fine and not that different
    * Sometimes fully splitting categorical features is preferred, but generally binary splits are fine
    * Sklearn library for Decision Trees:
        * (-) Doesn't support missing values
        * Gini index is default, entropy is an option
        * Pruning well supported (max_depth, min_samples_split, min_samples_leaf, max_leaf_nodes)
        * Does binary splits (you need modify categorical variables to be binary)

5) Decision Tree Variants
1. **Iterative Dichotomiser 3 (ID3)**
    * designed for **only** categorical features
    * splits categorical features completely
    * uses entropy and information gain to pick the best split
2. **Classification and Regression Tree (CART)**
    * handles both categorical and continuous data
    * always uses binary splits
    * uses gini impurity to pick the best split
3. **C4.5**
    * handles continuous data
    * implements pruning to reduce overfitting
4. **C5.0**