# Decision Tree

## Summary

Decision trees are a class of supervised ML algorithms used for both classification and regression . There are several variants of decision trees, of which CART (classification and regression trees) is the most popular . Other variants are ID3 and C4.5

Advantages - 1) Easy to interpret (almost think of it as a bunch of if else)
Disadvantages - Prone to overfit, for which Bagging/Boosting methods are a popular solution



## How does it work ? Training

Think of it like a tree based algorithm, where at every node, a decision using a set of feature is made. 

During training, the model using gini gain or entropy gain, which feature to use at every single node are decided

CART for example uses Gini gain, where ID3 and C4.5 use entropy gain


The goal is to keep each split as "pure" as possible to achieve classification. For example, in theory, if there are 4 classes, we would like 4 leaf nodes, each leaf node capturing all the sample points of that class


## Wait a minute, what are gini gain and entropy gain ?

Gini gain and entropy gain are two alternative cost functions used for decision trees. 


## Gini index

The formula for Gini index is   1 -  $\sum_{i}{p_{i}}^{2}$ or equivalently,
$\sum_{i}p_i*({1-p_{i}})$ which are the same

where the sum is across estimated probabilities of all classes within each group.
Then the gini index for all groups are summed together to get overall gini index for the split

One estimate of how effective the split was at a node
For example, at a node, if based on a condition learnt at a node, the data is split into two groups

Say all points with Feature X1 < 5 goes to one node (group G1), and all with X1 >= 5 goes to another node (group G2). Assume only two classes in GT C1 and C2

We get a perfect split if G1 and G2 belong to completely different classes

## How do we mathematically formulate this ?

If pC1G1 is the estimated fraction of points in group G1 belong to class C1 (just the count of all points in G1 belonging to C1 divided by total no of samples in G1) and pC2G1 is estimated fraction of points in group G1 

gini index of Group G1  GiniG1= 1 - $({pC1G1}^{2} + {pC2G1}^{2})$

Similarly , gini index of Group G2 GiniG2 = 1 - $({pC1G2}^{2} + {pC2G2}^{2})$


Overall Gini index = GiniG1 + GiniG2 

## So what happens if split is completely perfect or completely imperfect ?

For a completely perfect split, pC1G1 = 1, pC2G2=1, pC1G2 = 0, pC2G1 = 0

Therefore, GiniG1 = 1 - (1 + 0) = 0
GiniG2 = 0

Theferefore, Gini Index for a perfect split is 0


For a completely imperfect split, all 4 probabilities will be 0.5

GiniG1 = (1 - $({0.5}^{2} + {0.5}^{2})$ = 1 - 0.5 = 0.5
Similarly, GiniG2 = 0.5

Gini Index overall = 0.5 + 0.5 = 1

This will extend even if we have a multiclass situation and more groups

If we have 2 groups, and 4 classes each,
a perfectly imperfect split will have a probability of 0.25 in each group

Therefore , GiniG1 =  (1 - $(4*{0.25}^{4})$  = 0.984375
Similarly , GiniG2 = 0.984375
Total Gini = 1.96875

## So what does this mean when used in training ?

We define Gini Gain, which is Gini Index before splitting - Gini Index after splitting

A feature which results in a split with the least gini index  (or equivalently the highest gini gain ) is used as the root node 


## Ok. Features are selected based on highest gini gain, but how are thresholds selected for continuous features ? Are all possible thresholds tried out ?

TBA


## Information gain

An alternative to Gini gain

The formula for entropy using the same notation  above is the usual shannon entropy formulation

Entropy = -$\sum_{i}{p_{i}log_{2}p_{i}}$

where the sum again is over all classes in the group

In a perfect split, Entropy = 0, as p_i will be 1 or 0
For a non perfect split, entropy will be a positive number

Similar to Gini gain, in training, we compute entropy across all groups before and after splitting to and subtract to get information gain

We choose the feature to take a split, based on which feature maximizes entropy gain at a given step

## Nice properties of information gain

1) Information gain is always non-negative (which means if a decision tree uses entropy, it is guaranteed to not become worse at every step, atleast for training data)

How ?

Assume split is done using feature/input variable Xi
Information gain = entropy(before split) - entropy(after split)
      = entropy($p_{Y}(D)$) - $\sum_{j}(fraction of points in group j after splitting by Xi)*entropy(group j)$
      
 entropy($p_{Y}(D)$) simply means that entropy is a function of distribution of label Y in the training data D
      
Information gain = entropy($p_{Y}(D)$) - $\sum_{j}(fraction of points in group j)*entropy(p_{Y}(subset of D belongs to group j))$

which can be equivalently written in terms of conditional entropy as 

Information gain = entropy($p_{Y}(D)$) - $entropy(p_{Y}(D) | p_{Xi}(D))$ [see here][5]

Term 2 is entropy after split - which is a function of probability distribution of Y over D, given D being split according to split variable Xi which can take m possible values, giving rise to m possible groups after splitting (these m groups are represented by index j in the equation above)

Entropy before split is entropy given distribution of Y over training data D,
entropy after split is entropy given distribution of Y over training data D conditional on doing a split based on feature $X_{j}$

the second term is a relative entropy term


Writing this as
information gain = Entropy(Q) - Entropy(Q|P)
where Q is distribution before split, and Q|P is distribution after split on feature Xj

Entropy(Q|P) = $\sum_{j}q_{j}log_{2}(p_{j}|q_{j})$


Expanding this further, assume P and Q are two features with distributions with n and m elements respectively, and we split jointly on both P and Q
This creates a table T, where the count in cell i,j (Tij) represents the data which survives split Pi and split Qj. Let pij = Tij/|D| 

Entropy(Q|P) is the entropy of a record surviving split Q conditional on it already having survived split P

Therefore, Entropy(Q|P)



How is conditional entropy Entropy(Q|P) defined ?
Given a split P first, and then a split q on top of P



= -$\sum_{j}q_{j}log_{2}q_{j}$  +  




## Ok. Features are selected based on highest entropy gain, but how are thresholds selected for continuous features ? Are all possible thresholds tried out ?

TBA

## Gini gain vs Entropy gain

Some decision tree variants such as CART (classification and regression trees) use gini gain . Other variants are ID3 and C4.5 use entropy gain

1) Pros of gini gain - Entropy gain needs a log computation, which is more expensive computationally than gini gain
2) Pros of entropy gain - symmetric
3) Practically, gini gain favors larger partitions, entropy gain smaller partitions
4) Entropy  has theoretically better underpinnings - it is non negative (as is gini index) and is symmetric if you switch the target variable and split variable 

## general properties of impurity functions

Both entropy and gini index are what we call impurity functions, and we've seen that for both, in base case (complete separation of classes in groups), entropy and gini index attain their lowest value of zero

In the worst case, we have a uniform distribution of classes in groups, which gives the maximum value of gini index/entropy

This is something we want for any impurity function - a non negative value, which is 0 for a perfect split, and is maximum for the worst case split of a uniform distribution

## Basic training pseudocode

The basic training process involves deciding the following aspects
1) Selection of attribute splits (splitting criteria)
2) Decision of when to stop splitting (stopping criteria)
3) Assignment of label to each terminal mode
4) Pruning tree if necessary

## Pseudocode

1) Create root node R

2) If stopping criteria already reached, label root note with most common label yi in data set D, exit

3) If not, for each input feature Xi,
find tests T, which partition data D in D1,D2..Dk in such a way that information gain/gini gain is maximized

4) for each partition of data, repeat 1 to 3

## References

[1]: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
[2]: https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees  
[3]: https://sites.math.washington.edu/~morrow/336_15/papers/lev.pdf 
[4]: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
[5]: https://machinelearningmastery.com/information-gain-and-mutual-information/




1) 
2) https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees
3) https://sites.math.washington.edu/~morrow/336_15/papers/lev.pdf
4) https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
5) https://machinelearningmastery.com/information-gain-and-mutual-information/
