# Decision Tree

## Summary

Decision trees are a class of supervised ML algorithms used for both classification and regression . There are several variants of decision trees, of which CART (classification and regression trees) is the most popular . Other variants are ID3 and C4.5

Advantages - 1) Easy to interpret (almost think of it as a bunch of if else)
Disadvantages - Prone to overfit, for which Bagging/Boosting methods are a popular solution



## How does it work ? Training

Think of it like a tree based algorithm, where at every node, a decision using a set of feature is made. 

During training, the model using gini gain or entropy gain, which feature to use at every single node are decided

CART for example uses Gini gain, where ID3 and C4.5 use entropy gain


The goal is to keep each split as "pure" as possible to achieve classification. For example, in theory, if there are 4 classes, we would like 4 leaf nodes, each leaf node capturing all the sample points of that class


Note that this is a non-parametric model, so there is no gradient descent or any such model to find the values of parameters

## Wait a minute, what are gini gain and entropy gain ?

Gini gain and entropy gain are two alternative cost functions used for decision trees. 


## Gini index

The formula for Gini index is   1 -  $\sum_{i}{p_{i}}^{2}$ or equivalently,
$\sum_{i}p_i*({1-p_{i}})$ which are the same

where the sum is across estimated probabilities of all classes within each group.
Then the gini index for all groups are summed together to get overall gini index for the split

One estimate of how effective the split was at a node
For example, at a node, if based on a condition learnt at a node, the data is split into two groups

Say all points with Feature X1 < 5 goes to one node (group G1), and all with X1 >= 5 goes to another node (group G2). Assume only two classes in GT C1 and C2

We get a perfect split if G1 and G2 belong to completely different classes

## How do we mathematically formulate this ?

If pC1G1 is the estimated fraction of points in group G1 belong to class C1 (just the count of all points in G1 belonging to C1 divided by total no of samples in G1) and pC2G1 is estimated fraction of points in group G1 

gini index of Group G1  GiniG1= 1 - $({pC1G1}^{2} + {pC2G1}^{2})$

Similarly , gini index of Group G2 GiniG2 = 1 - $({pC1G2}^{2} + {pC2G2}^{2})$


Overall Gini index = GiniG1 + GiniG2 

## So what happens if split is completely perfect or completely imperfect ?

For a completely perfect split, pC1G1 = 1, pC2G2=1, pC1G2 = 0, pC2G1 = 0

Therefore, GiniG1 = 1 - (1 + 0) = 0
GiniG2 = 0

Theferefore, Gini Index for a perfect split is 0


For a completely imperfect split, all 4 probabilities will be 0.5

GiniG1 = (1 - $({0.5}^{2} + {0.5}^{2})$ = 1 - 0.5 = 0.5
Similarly, GiniG2 = 0.5

Gini Index overall = 0.5 + 0.5 = 1

This will extend even if we have a multiclass situation and more groups

If we have 2 groups, and 4 classes each,
a perfectly imperfect split will have a probability of 0.25 in each group

Therefore , GiniG1 =  (1 - $(4*{0.25}^{4})$  = 0.984375
Similarly , GiniG2 = 0.984375
Total Gini = 1.96875

## So what does this mean when used in training ?

We define Gini Gain, which is Gini Index before splitting - Gini Index after splitting

A feature which results in a split with the least gini index  (or equivalently the highest gini gain ) is used as the root node 


## Ok. Features are selected based on highest gini gain, but how are thresholds selected for continuous features ? Are all possible thresholds tried out ?

Yes kind of - look at implementation below


## Information gain

An alternative to Gini gain

The formula for entropy using the same notation  above is the usual shannon entropy formulation

Entropy = -$\sum_{i}{p_{i}log_{2}p_{i}}$

where the sum again is over all classes in the group

In a perfect split, Entropy = 0, as p_i will be 1 or 0
For a non perfect split, entropy will be a positive number

Similar to Gini gain, in training, we compute entropy across all groups before and after splitting to and subtract to get information gain

We choose the feature to take a split, based on which feature maximizes entropy gain at a given step

## Nice properties of information gain

1) Information gain is always non-negative (which means if a decision tree uses entropy, it is guaranteed to not become worse at every step, atleast for training data)

How ?

Assume split is done using feature/input variable Xi
Information gain = entropy(before split) - entropy(after split)
      = entropy($p_{Y}(D)$) - $\sum_{j}(fraction of points in group j after splitting by Xi)*entropy(group j)$
      
 entropy($p_{Y}(D)$) simply means that entropy is a function of distribution of label Y in the training data D
      
Information gain = entropy($p_{Y}(D)$) - $\sum_{j}(fraction of points in group j)*entropy(p_{Y}(subset of D belongs to group j))$

which can be equivalently written in terms of conditional entropy as 

Information gain = entropy($p_{Y}(D)$) - $entropy(p_{Y}(D) | p_{Xi}(D))$ [see here][5]

Term 2 is entropy after split - which is a function of probability distribution of Y over D, given D being split according to split variable Xi which can take m possible values, giving rise to m possible groups after splitting (these m groups are represented by index j in the equation above)

Entropy before split is entropy given distribution of Y over training data D,
entropy after split is entropy given distribution of Y over training data D conditional on doing a split based on feature $X_{j}$

the second term is a relative entropy term


Writing this as
information gain = Entropy(Q) - Entropy(Q|P)
where Q is distribution before split, and Q|P is distribution after split on feature Xj

Entropy(Q|P) = $\sum_{j}q_{j}log_{2}(p_{j}|q_{j})$


Expanding this further, assume P and Q are two features with distributions with n and m elements respectively, and we split jointly on both P and Q
This creates a table T, where the count in cell i,j (Tij) represents the data which survives split Pi and split Qj. Let pij = Tij/|D| 

Entropy(Q|P) is the entropy of a record surviving split Q conditional on it already having survived split P

Therefore, Entropy(Q|P)



How is conditional entropy Entropy(Q|P) defined ?
Given a split P first, and then a split q on top of P



= -$\sum_{j}q_{j}log_{2}q_{j}$  +  




## Ok. Features are selected based on highest entropy gain, but how are thresholds selected for continuous features ? Are all possible thresholds tried out ?

TBA

## Gini gain vs Entropy gain

Some decision tree variants such as CART (classification and regression trees) use gini gain . Other variants are ID3 and C4.5 use entropy gain

1) Pros of gini gain - Entropy gain needs a log computation, which is more expensive computationally than gini gain
2) Pros of entropy gain - symmetric
3) Practically, gini gain favors larger partitions, entropy gain smaller partitions
4) Entropy  has theoretically better underpinnings - it is non negative (as is gini index) and is symmetric if you switch the target variable and split variable 

## general properties of impurity functions

Both entropy and gini index are what we call impurity functions, and we've seen that for both, in base case (complete separation of classes in groups), entropy and gini index attain their lowest value of zero

In the worst case, we have a uniform distribution of classes in groups, which gives the maximum value of gini index/entropy

This is something we want for any impurity function - a non negative value, which is 0 for a perfect split, and is maximum for the worst case split of a uniform distribution

## Basic training pseudocode

The basic training process involves deciding the following aspects
1) Selection of attribute splits (splitting criteria)
2) Decision of when to stop splitting (stopping criteria)
3) Assignment of label to each terminal mode
4) Pruning tree if necessary

## Pseudocode

1) Create root node R

2) If stopping criteria already reached, label root note with most common label yi in data set D, exit

3) If not, for each input feature Xi,
find tests T, which partition data D in D1,D2..Dk in such a way that information gain/gini gain is maximized

4) for each partition of data, repeat 1 to 3

## Code from scratch

### data Preparation

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("500_Person_Gender_Height_Weight_Index.csv")
data.head()

Unnamed: 0,Gender,Height,Weight,Index
0,Male,174,96,4
1,Male,189,87,2
2,Female,185,110,4
3,Female,195,104,3
4,Male,149,61,3


We want to predict whether a person is obese or not.
We are saying a person with weight index 3 or 4 is obese. That is our label

In [4]:
data['obese'] = (data.Index >= 4).astype('int')
data.drop('Index', axis = 1, inplace = True)

Let's say we define a particular rule in the tree, as patient with weight >= 100 kg are obese.
Then , for each of the two subtrees thus obtained, if any one of the subtrees had all patients as obese or all patients as not obese, the impurity function is 0 (either gini or entropy)





In [6]:
print(
  " Misclassified when cutting at 100kg:",
  data.loc[(data['Weight']>=100) & (data['obese']==0),:].shape[0], "\n",
  "Misclassified when cutting at 80kg:",
  data.loc[(data['Weight']>=80) & (data['obese']==0),:].shape[0]
)

## This means both the splitting at 100 and splitting at 80 results in impure partitions


 Misclassified when cutting at 100kg: 18 
 Misclassified when cutting at 80kg: 63


Let's now calculate impurity using gini index and entropy which are two alternative methods

In [7]:
def gini_impurity(y):
  '''
  Given a Pandas Series, it calculates the Gini Impurity. 
  y: variable with which calculate Gini Impurity.
  '''
  if isinstance(y, pd.Series):
    p = y.value_counts()/y.shape[0]
    gini = 1-np.sum(p**2)
    return(gini)

  else:
    raise('Object must be a Pandas Series.')

gini_impurity(data.Gender) 

0.4998

In [9]:
def entropy(y):
  '''
  Given a Pandas Series, it calculates the entropy. 
  y: variable with which calculate entropy.
  '''
  if isinstance(y, pd.Series):
    a = y.value_counts()/y.shape[0]
    entropy = np.sum(-a*np.log2(a+1e-9)) ## the 10-9 is to avoid situations where p is 0
    return(entropy)

  else:
    raise('Object must be a Pandas Series.')

entropy(data.Gender)  

0.9997114388674198

Let's now define information gain for any split as entropy before splitting - entropy after splitting.

(We can also use gini index before splitting - gini afterwards, but here we are using entropy)

For regression, we instead use variance before splitting - variance after splitting

In [10]:
def variance(y):
  '''
  Function to help calculate the variance avoiding nan.
  y: variable to calculate variance to. It should be a Pandas Series.
  '''
  if(len(y) == 1):
    return 0
  else:
    return y.var()

def information_gain(y, mask, func=entropy):
  '''
  It returns the Information Gain of a variable given a loss function.
  y: target variable.
  mask: split choice.
  func: function to be used to calculate Information Gain in case os classification.
  '''
  
  a = sum(mask)
  b = mask.shape[0] - a
  
  if(a == 0 or b ==0): 
    ig = 0
  
  else:
    if y.dtypes != 'O':
      ig = variance(y) - (a/(a+b)* variance(y[mask])) - (b/(a+b)*variance(y[-mask]))
    else:
      ig = func(y)-a/(a+b)*func(y[mask])-b/(a+b)*func(y[-mask])
  
  return ig

In [16]:
information_gain(data['obese'], data['Gender']=='Male')

-0.0002808244603327431

Why is it negative ? Information gain by definition cannot be negative (see above). Could be rounding errors

Ok. so how do we now choose the split which maximizes information gain ?

If feature is numeric, we try thresholds for all possible values of the feature,
for each threshold, we apply a condition where if f<=t, we call it 0, else 1, we compute 
for each of these conditions information gain, and see what threshold maximizes it

For categorical values, we have to compute information gain for all possible values 
of that category 
If the number of possible categories is too much,  this leads to a combinatorial explosion, so we typically restrict to <= 20 categories



In [17]:
import itertools

def categorical_options(a):
  '''
  Creates all possible combinations from a Pandas Series.
  a: Pandas Series from where to get all possible combinations. 
  '''
  a = a.unique()

  opciones = []
  for L in range(0, len(a)+1):
      for subset in itertools.combinations(a, L):
          subset = list(subset)
          opciones.append(subset)

  return opciones[1:-1]

In [21]:
a = pd.DataFrame({'value' : ['a']*5 + ['b']*6 + ['c']*7 + ['d']*8})

In [23]:
categorical_options(a.value)

[['a'],
 ['b'],
 ['c'],
 ['d'],
 ['a', 'b'],
 ['a', 'c'],
 ['a', 'd'],
 ['b', 'c'],
 ['b', 'd'],
 ['c', 'd'],
 ['a', 'b', 'c'],
 ['a', 'b', 'd'],
 ['a', 'c', 'd'],
 ['b', 'c', 'd']]

In [24]:
def max_information_gain_split(x, y, func=entropy):
  '''
  Given a predictor & target variable, returns the best split, the error and the type of variable based on a selected cost function.
  x: predictor variable as Pandas Series.
  y: target variable as Pandas Series.
  func: function to be used to calculate the best split.
  '''

  split_value = []
  ig = [] 

  numeric_variable = True if x.dtypes != 'O' else False

  # Create options according to variable type
  if numeric_variable:
    options = x.sort_values().unique()[1:]
  else: 
    options = categorical_options(x)

  # Calculate ig for all values
  for val in options:
    mask =   x < val if numeric_variable else x.isin(val)
    val_ig = information_gain(y, mask, func)
    # Append results
    ig.append(val_ig)
    split_value.append(val)

  # Check if there are more than 1 results if not, return False
  if len(ig) == 0:
    return(None,None,None, False)

  else:
  # Get results with highest IG
    best_ig = max(ig)
    best_ig_index = ig.index(best_ig)
    best_split = split_value[best_ig_index]
    return(best_ig,best_split,numeric_variable, True)


weight_ig, weight_slpit, _, _ = max_information_gain_split(data['Weight'], data['obese'],)  


print(
  "The best split for Weight is when the variable is less than ",
  weight_slpit,"\nInformation Gain for that split is:", weight_ig
)

The best split for Weight is when the variable is less than  103 
Information Gain for that split is: 0.10625190497954848


This is for one feature, we repeat for each feature to see the feature contributing to best split

In [25]:
data.drop('obese', axis= 1).apply(max_information_gain_split, y = data['obese'])

Unnamed: 0,Gender,Height,Weight
0,-0.000281,0.019684,0.106252
1,[Male],174,103
2,False,True,True
3,True,True,True


The variable with the highest Information Gain is Weight. Therefore, it will be the variable that we use first to do the split. In addition, we also have the value on which the split must be performed: 103.

With this, we already have the first split, which would generate two dataframes. If we apply this recursively, we will end up creating the entire decision tree (coded in Python from scratch)

### Complete implementation - Hyperparameters

max_depth: maximum depth of the tree. If we set it to None, the tree will grow until all the leaves are pure or the hyperparameter min_samples_split has been reached.
min_samples_split: indicates the minimum number of observations a sheet must have to continue creating new nodes.
min_information_gain: the minimum amount the Information Gain must increase for the tree to continue growing.

In [26]:
def get_best_split(y, data):
  '''
  Given a data, select the best split and return the variable, the value, the variable type and the information gain.
  y: name of the target variable
  data: dataframe where to find the best split.
  '''
  masks = data.drop(y, axis= 1).apply(max_information_gain_split, y = data[y])
  if sum(masks.loc[3,:]) == 0:
    return(None, None, None, None)

  else:
    # Get only masks that can be splitted
    masks = masks.loc[:,masks.loc[3,:]]

    # Get the results for split with highest IG
    split_variable = max(masks)
    #split_valid = masks[split_variable][]
    split_value = masks[split_variable][1] 
    split_ig = masks[split_variable][0]
    split_numeric = masks[split_variable][2]

    return(split_variable, split_value, split_ig, split_numeric)

In [28]:
def make_split(variable, value, data, is_numeric):
  '''
  Given a data and a split conditions, do the split.
  variable: variable with which make the split.
  value: value of the variable to make the split.
  data: data to be splitted.
  is_numeric: boolean considering if the variable to be splitted is numeric or not.
  '''
  if is_numeric:
    data_1 = data[data[variable] < value]
    data_2 = data[(data[variable] < value) == False]

  else:
    data_1 = data[data[variable].isin(value)]
    data_2 = data[(data[variable].isin(value)) == False]

  return(data_1,data_2)

def make_prediction(data, target_factor):
  '''
  Given the target variable, make a prediction.
  data: pandas series for target variable
  target_factor: boolean considering if the variable is a factor or not
  '''

  # Make predictions
  if target_factor:
    pred = data.value_counts().idxmax()
  else:
    pred = data.mean()

  return pred

## training code

In [29]:
def train_tree(data,y, target_factor, max_depth = None,min_samples_split = None, min_information_gain = 1e-20, counter=0, max_categories = 20):
  '''
  Trains a Decission Tree
  data: Data to be used to train the Decission Tree
  y: target variable column name
  target_factor: boolean to consider if target variable is factor or numeric.
  max_depth: maximum depth to stop splitting.
  min_samples_split: minimum number of observations to make a split.
  min_information_gain: minimum ig gain to consider a split to be valid.
  max_categories: maximum number of different values accepted for categorical values. High number of values will slow down learning process. R
  '''

  # Check that max_categories is fulfilled
  if counter==0:
    types = data.dtypes
    check_columns = types[types == "object"].index
    for column in check_columns:
      var_length = len(data[column].value_counts()) 
      if var_length > max_categories:
        raise ValueError('The variable ' + column + ' has '+ str(var_length) + ' unique values, which is more than the accepted ones: ' +  str(max_categories))

  # Check for depth conditions
  if max_depth == None:
    depth_cond = True

  else:
    if counter < max_depth:
      depth_cond = True

    else:
      depth_cond = False

  # Check for sample conditions
  if min_samples_split == None:
    sample_cond = True

  else:
    if data.shape[0] > min_samples_split:
      sample_cond = True

    else:
      sample_cond = False

  # Check for ig condition
  if depth_cond & sample_cond:

    var,val,ig,var_type = get_best_split(y, data)

    # If ig condition is fulfilled, make split 
    if ig is not None and ig >= min_information_gain:

      counter += 1

      left,right = make_split(var, val, data,var_type)

      # Instantiate sub-tree
      split_type = "<=" if var_type else "in"
      question =   "{} {}  {}".format(var,split_type,val)
      # question = "\n" + counter*" " + "|->" + var + " " + split_type + " " + str(val) 
      subtree = {question: []}


      # Find answers (recursion)
      yes_answer = train_tree(left,y, target_factor, max_depth,min_samples_split,min_information_gain, counter)

      no_answer = train_tree(right,y, target_factor, max_depth,min_samples_split,min_information_gain, counter)

      if yes_answer == no_answer:
        subtree = yes_answer

      else:
        subtree[question].append(yes_answer)
        subtree[question].append(no_answer)

    # If it doesn't match IG condition, make prediction
    else:
      pred = make_prediction(data[y],target_factor)
      return pred

   # Drop dataset if doesn't match depth or sample conditions
  else:
    pred = make_prediction(data[y],target_factor)
    return pred

  return subtree


max_depth = 5
min_samples_split = 20
min_information_gain  = 1e-5


decisiones = train_tree(data,'obese',True, max_depth,min_samples_split,min_information_gain)


decisiones

{'Weight <=  103': [{'Weight <=  74': [0,
    {'Weight <=  84': [{'Weight <=  75': [1, 0]},
      {'Weight <=  98': [1, 0]}]}]},
  1]}

## Predict using our decision tree in Python

In [30]:
def clasificar_datos(observacion, arbol):
  question = list(arbol.keys())[0] 

  if question.split()[1] == '<=':

    if observacion[question.split()[0]] <= float(question.split()[2]):
      answer = arbol[question][0]
    else:
      answer = arbol[question][1]

  else:

    if observacion[question.split()[0]] in (question.split()[2]):
      answer = arbol[question][0]
    else:
      answer = arbol[question][1]

  # If the answer is not a dictionary
  if not isinstance(answer, dict):
    return answer
  else:
    residual_tree = answer
    return clasificar_datos(observacion, answer)

We ensure that both min_samples_split and max_depth are fulfilled.
If they are fulfilled, we get the best split and obtain the Information Gain. If any of the conditions are not fulfilled, we make the prediction.
We check that the Information Gain Comprobamos passes the minimum amount set by min_information_gain.
If the condition above is fulfilled, we make the split and save the decision. If it is not fulfilled, then we make the prediction.

## References

[1]: https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/
[2]: https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees  
[3]: https://sites.math.washington.edu/~morrow/336_15/papers/lev.pdf 
[4]: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
[5]: https://machinelearningmastery.com/information-gain-and-mutual-information/




1) 
2) https://www.analyticssteps.com/blogs/what-gini-index-and-information-gain-decision-trees
3) https://sites.math.washington.edu/~morrow/336_15/papers/lev.pdf
4) https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
5) https://machinelearningmastery.com/information-gain-and-mutual-information/
6) https://anderfernandez.com/en/blog/code-decision-tree-python-from-scratch/
