## Session 02: Decision Trees

In [1]:
from collections import Counter

import numpy as np
import pandas as pd

**From Wikipedia**

_Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modelling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees._

We first define a very basic interface for a decision tree.

```
class DecisionTree(object):
    def __init__(self, split_scorer):
        ...

    def fit(self, data):
        ...

    def predict(self, row):
        ...
```

The `split_scorer` will be a callback which will allow us to determine which feature to split on next.

Now that we have a basic API defined, we can write a few scorer functions

In [2]:
def split(df, feature):
    """Returns an array of non-overlapping dataframes, the union of which is the original"""
    return [x for _, x in df.groupby(feature)]

**Gini Impurity**

$G = \sum_{k=1}^{N} p_i (1 - p_i)$

Equivalent form:

$G = 1 - \sum_{k=1}^{N} p_i^2$

The Gini index is referred to as a measure of node purity—a small value indicates that a node contains predominantly observations from a single class.

In [3]:
def gini_impurity(target):
    c = Counter(target)
    frequencies = np.array(list(c.values()))
    n = frequencies.sum()
    pi = frequencies / n
    return 1 - sum(pi**2)

def gini_impurity_scorer(df, target_label):
    """Given a dataframe with the target label, determine the feature that gives the best split
        using the Gini impurity as a measure
    """
    target = df[target_label]
    features = df.drop(target_label, axis=1).columns
    
    gini_before = gini_impurity(df[target_label])
    
    best_feature = None
    
    for feature in features:
        df_split = split(df, feature)
        gini_after = np.mean([gini_impurity(sub_df[target_label]) for sub_df in df_split])
        if gini_after < gini_before:
            best_feature = feature
            
    return best_feature

Information Gain = entropy(parent) – weighted average entropy(children)

In [4]:
def entropy(target):
    c = Counter(target)
    frequencies = np.array(list(c.values()))
    n = frequencies.sum()
    pi = frequencies / n
    return -np.sum(pi*np.log2(pi))

def weighted_entropy(targets):
    n = sum(len(t) for t in targets)
    weights = np.array([len(t)/n for t in targets])
    entropies = [entropy(target) for target in targets]
    return -np.dot(weights, entropies)

def information_gain(df, target_label):
    """Given a dataframe with the target label, determine the feature that gives the best split
        using the Information Gain as a measure
    """
    target = df[target_label]
    features = df.drop(target_label, axis=1).columns
    
    parent_entropy = entropy(target)
    
    best_feature = None
    ig_max = -np.inf
    
    for feature in features:
        df_split = split(df, feature)
        targets = [df_subset[target_label] for df_subset in df_split]
        child_entropy = weighted_entropy(targets)
        info_gain = parent_entropy - child_entropy
        if info_gain > ig_max:
            ig_max = info_gain
            best_feature = feature
            
    return best_feature

In [5]:
df = pd.DataFrame({"x": [1,1,1,1,2,2,2,1,1,1,3], "y": [2,2,2,2,3,3,1,2,2,2,4]})

Now that we have a few scorers avaliable, we can now sketch out an initial implementation of a decision tree:

In [6]:
class DecisionTree(object):
    def __init__(self, split_scorer):
        self.scorer = split_scorer
        self.tree = {}
        
    def _majority(self, data, target_label):
        return data[target_label].value_counts().idxmax()

    def _fit(self, data, target_label):
        features = data.drop(target_label, axis=1).columns.tolist()
        
        # Stopping criterion is when we have only a single class label left
        if len(set(data[target_label])) == 1: return data[target_label].iloc[0]

        # Or we don't have any features left to split on, in which case we will just predict majority:
        if not features: return self._majority(data, target_label)
        
        # You could also put in a condition here for the depth of the tree as a stopping criterion
        
        # Decide which is the best feature to split on using the split scorer
        best_feature = self.scorer(data, target_label)
        
        # Split on the best feature
        splits = split(data, best_feature)
        
        # Start a (sub) tree
        current_node = {best_feature: {}}

        # For every value that this feature can take, we fit a sub-tree to that subset of data 
        for val in set(data[best_feature]):
            data_subset = data[data[best_feature] == val].drop(best_feature, axis=1)

            # Recursively fit trees for each split value of this feature
            current_node[best_feature][val] = self._fit(data_subset, target_label)

        return current_node
                        
    def fit(self, data, target_label):
        self.tree = self._fit(data, target_label)

    def predict(self, row):
        ...

In [7]:
cars = pd.read_csv('car.csv', names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'evaluation'])

In [None]:
cars.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [None]:
dt = DecisionTree(gini_impurity_scorer)
dt.fit(cars, 'evaluation')

In [None]:
dt.tree