# Week 5 Discussion: Decision Tree

## Objectives

This week's discussion centers on training a decision tree model. During training, at each node, the trees choose a variable and a threshold for splitting the data into two subsets. The aim is to make splits that minimize the label entropy in the resulting child nodes. By iteratively applying this splitting process, we aim to achieve uniformity in the labels of the data points in the child nodes. For each child node, the most prevalent label is selected as the prediction. This process includes:

* Revisiting the core principles of learning decision trees.
* Importing a dataset from `scikit-learn`.
* Generating predictions using the trained tree.
* Measuring performance by accuracy.

## Sources

This discussion on fitting decision trees is based on the following references:
<br>
https://youtu.be/jVh5NA9ERDA?si=-QQP_ctg8TY3IkMR

https://youtu.be/Bqi7EFFvNOg?si=WMZMJoggVBBzRMWd

https://github.com/patrickloeber/MLfromscratch/blob/master/mlfromscratch/decision_tree.py

## Fitting Decision Trees: A Review

We aim to predict whether a person will go for a walk based on two factors: the duration of the walk and whether it's raining. We have the following dataset available:

| Rain | Time | Walk |
| --- | ----------- | --- |
| 1 | 30 | No |
| 1 | 15 | No |
| 1 | 5 | No |
| 0 | 10 | No |
| 0 | 5 | No |
| 0 | 15 | Yes |
| 0 | 20 | Yes |
| 0 | 25 | Yes |
| 0 | 30 | Yes |
| 0 | 35 | Yes |

Here are some basic observations about the data:

* There are 10 data points.
* The dataset consists of 2 features.
* The output is binary, indicating whether the person goes for a walk or not.

Our objective is to fit a tree to this dataset. We begin with all data points at the root node and aim to identify the optimal feature/split point combination.

When deciding between feature and split value combinations for splitting the data, we typically use a criterion that measures the effectiveness of the split. A commonly used criterion is Information Gain:

Information Gain: It measures the reduction in entropy or uncertainty achieved by splitting the data on a particular feature. We want to maximize the information gain.

For each feature, we iterate over possible split values and calculate the information gain for each split. We then choose the feature and split value combination that maximizes information gain. This process helps us find the most effective way to partition the data into subsets at each node of the decision tree.

We know the formula for information gain is:
<br>
<br>
$$
H(\text{parent node}) - \frac{\text{\# points in left child node} * H(\text{left child node}) + \text{\# points in right child node} * H(\text{right child node})}{\text{\# points in parent node}},
$$

Where $H(.)$ denotes the entropy of the node, computed as:
<br>
<br>
$$
H(\text{node}) = -\sum{p(l) log_2 (p(l))},
$$

where $l$ denotes the labels of the data points inside the node and $p(l)$ denotes the empirical probability (relative frequency) of that label within the node.
For instance, the entropy at the root node in our illustration is determined by the data points it contains (all of them) as:
<br>
<br>
$$
H(\text{root}) = - (\frac{5}{10} log_2 (\frac{5}{10}) + \frac{5}{10} log_2 (\frac{5}{10})) = 1,
$$
***note: the base of the logarithm with be the amount of classifications***


As mentioned earlier, each split leads to potentially different entropies in the left and right child nodes, resulting in varying levels of information gain. Our aim is to identify the split that yields the highest information gain in the training process.


The algorithm outline is as follows:

* Start at the top node. At each node, select the best split based on the highest information gain.
* Iterate over all features and thresholds.
* Save the best split feature and split value at each node.
* Build the tree recursively.
* Implement stopping criteria, such as maximum depth, minimum number of samples in a node, or achievement of minimum entropy (0).
* When a leaf node is reached, store the most common class label as the prediction.

For prediction according to this scheme:

* Traverse the tree recursively.
* At each node, examine the split feature of the test data point and proceed left or right based on whether $x[\text{feature\_idx}]\geq threshold$
* Once a leaf node is reached, return the value associated with that leaf.

## Coding it up!

To code it up, let's begin by crafting a utility function that calculates the entropy within an array of labels:

In [1]:
import numpy as np

def entropy(y):
    hist = np.bincount(y)
    ps = hist / len(y)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])

The function `entropy` outlined above operates on the labels $y$ and performs the following steps:

* It generates a histogram of labels by tallying identical values. For instance, an array like $[1,2,2,2,2]$ would yield $[1,4]$.
* The histogram of labels is then divided by the length of the label array to obtain relative frequencies.
* It calculates the entropy using the formula described earlier, ensuring that the logarithm of 0 is not taken. Specifically, $p log_2 (p)$ is defined as $0$ when $p=0$.

The foundational component of our tree structure is the class `node`, as defined below:

In [2]:
class Node:
    def __init__(
        self, feature=None, threshold=None, left=None, right=None, value=None
    ):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

    def is_leaf_node(self):
        return self.value is not None

The `node` class `__init__()` function is designed to five arguments:

* `feature`: Represents the feature on which the node splits. Applicable to non-leaf nodes.

* `threshold`: Denotes the threshold for the node's split. Applicable to non-leaf nodes.

* `left`: Refers to the left child of the node. Applicable to non-leaf nodes.

* `right`: Denotes the right child of the node. Applicable to non-leaf nodes.

*  `value`: Represents the node's value used for prediction. Applicable to leaf nodes.

The function `is_leaf_node()` returns `True` if the class attribute `value` is set, indicating that the node is a leaf node.

We will construct the `DecisionTree` class gradually, one function at a time. To enhance understanding, we'll initially define the functions outside the class and later assemble them within the class structure.

The `DecisionTree` class `__init__` function is as follows:

In [3]:
def __init__(self, min_samples_split=2, max_depth=100, n_feats=None):
    self.min_samples_split = min_samples_split
    self.max_depth = max_depth
    self.n_feats = n_feats
    self.root = None

In the `__init__()` function of the `DecisionTree` class, there are three arguments:

* `min_samples_split`: Represents the minimum number of samples in a node required for further splitting.

* `max_depth`: Denotes the maximum depth allowed for the tree.

* `n_feats`: Indicates the number of features present in the dataset.

The `root` attribute of the class is initialized to $None$ and will be assigned later during the tree fitting process.

The function `fit` is used for fitting the decision tree:

In [4]:
def fit(self, X, y):
    self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
    self.root = self._grow_tree(X, y)

In the `fit()` takes in two arguments: input features and the output labels.

* Sets the `n_feats` attribute as the number of features.
* Calls the `_grow_tree()` function and assigns the output to the `root` attribute. We will see the function `_grow_tree()` below.

Next, we will look at the `_grow_tree()` function:

In [5]:
def _grow_tree(self, X, y, depth=0):
    n_samples, n_features = X.shape
    n_labels = len(np.unique(y))

    # stopping criteria
    if (
        depth >= self.max_depth
        or n_labels == 1
        or n_samples < self.min_samples_split
    ):
        leaf_value = self._most_common_label(y)
        return Node(value=leaf_value)

    feat_idxs = np.random.choice(n_features, self.n_feats, replace=False)

    # greedily select the best split according to information gain
    best_feat, best_thresh = self._best_criteria(X, y, feat_idxs)

    # grow the children that result from the split
    left_idxs, right_idxs = self._split(X[:, best_feat], best_thresh)
    left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth + 1)
    right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth + 1)
    return Node(best_feat, best_thresh, left, right)

The `_grow_tree()` function takes in 3 inputs:

* `X`: Represents the input features of the training data within the node to be split.

* `y`: Denotes the labels of the training data within the node to be split.

* `depth`: Indicates the depth within the tree. It ensures that the tree is fitted only up to the specified `max_depth`.

The `_grow_tree()` function performs the following tasks:

* Line 2 computes the dimensions of the input data.
* Line 3 calculates the number of unique labels present in the input data.
* Line 6 checks the stopping criteria:
  - Line 7: If the depth of the tree has reached `max_depth`.
  - Line 8: If there is only one unique label present in the node (i.e., entropy equals 0).
  - Line 9: If the number of data points in the node is less than `min_samples_split`.
* Line 11: If any of the conditions in the if statement on line 6 evaluates to True:
  - We have reached a leaf node.
  - The most common label in the node is computed and assigned as the leaf label.
* Line 14: A subset of features is randomly chosen for splitting. This is particularly useful for random forest where feature selection is randomized at each node.
* Line 17: The best split is determined by calling the `_best_criteria` function.
* Line 20: The data point indexes that belong to the left and right nodes after the split are computed.
* Lines 21 and 22: The `_grow_tree()` function is recursively called on the data in the left and right nodes.
* Line 23: The subtrees built in the left and right nodes are used to construct the overall tree, with the current node as the root. The result is then returned.

Next, we examine the `_best_criteria()` function, which identifies the best feature and splitting point. It accepts the input data along with the list of feature indexes designated for splitting in this node.

In [6]:
def _best_criteria(self, X, y, feat_idxs):
    best_gain = -1
    split_idx, split_thresh = None, None
    for feat_idx in feat_idxs:
        X_column = X[:, feat_idx]
        thresholds = np.unique(X_column) # simple and exhaustive
        for threshold in thresholds:
            gain = self._information_gain(y, X_column, threshold)
            if gain > best_gain:
                best_gain = gain
                split_idx = feat_idx
                split_thresh = threshold

    return split_idx, split_thresh

In summary, `_best_criteria()` function executes the following steps:
* Line 5: Retrieves the column on which we are evaluating splits and stores it in the variable `X_column`. 
* Line 6: Computes all the unique values present in `X_column`, which will serve as our splitting thresholds.
* Line 7: In a loop over different thresholds:
  - Line 8: Calculates the gain as the information gain by calling the `_information_gain()` function.
  - Line 9: If the current gain is greater than the best gain observed so far, updates the current gain as the best gain.

Now let's examine the `_information_gain()` function. This function accepts the data labels `y`, the column we are evaluating for splitting `X_column`, and the threshold under consideration `split_thresh`. Its purpose is to compute the information gain if we were to choose that split.

In [7]:
def _information_gain(self, y, X_column, split_thresh):
    # parent loss
    parent_entropy = entropy(y)

    # generate split
    left_idxs, right_idxs = self._split(X_column, split_thresh)

    if len(left_idxs) == 0 or len(right_idxs) == 0:
        return 0

    # compute the weighted avg. of the loss for the children
    n = len(y)
    n_l, n_r = len(left_idxs), len(right_idxs)
    e_l, e_r = entropy(y[left_idxs]), entropy(y[right_idxs])
    child_entropy = (n_l / n) * e_l + (n_r / n) * e_r

    # information gain is difference in loss before vs. after split
    ig = parent_entropy - child_entropy
    return ig

In summary, `_information_gain()` function executes the following steps:
* Line 3: Computes the entropy in the parent.
* Line 6: Calculates the indexes of data points in the left and right nodes if the split under consideration is implemented.
* Line 8: Returns 0 if either child node ends up without any data points after the split.
* Lines 12 to 15: Computes the weighted average of child entropies based on the discussed formula.
* Line 19: Returns the difference between the parent entropy and the child entropy as the information gain.

Now, let's direct our attention to the `_split()` function, which determines the distribution of data samples into the right or left child nodes.

In [8]:
def _split(self, X_column, split_thresh):
    left_idxs = np.argwhere(X_column <= split_thresh).flatten()
    right_idxs = np.argwhere(X_column > split_thresh).flatten()
    return left_idxs, right_idxs

The `_split()` function takes in the column we are evaluating for splitting, as well as the threshold. It identifies the indexes where the value in `X_column` is less than or equal to or greater than the `split_thresh`, and returns them.

To obtain the most common label value, we utilize the `_most_common_label()` function. Let's delve into how this function operates:

In [9]:
def _most_common_label(self, y):
    from collections import Counter
    counter = Counter(y)
    most_common = counter.most_common(1)[0][0]
    return most_common

This function simply counts the repetitions of all labels in `y`. Chooses the most frequent one and returns it.

We've completed the functions responsible for fitting the tree. Finally, let's explore how we perform predictions:

In [10]:
def predict(self, X):
    return np.array([self._traverse_tree(x, self.root) for x in X])

def _traverse_tree(self, x, node):
    if node.is_leaf_node():
        return node.value

    if x[node.feature] <= node.threshold:
        return self._traverse_tree(x, node.left)
    return self._traverse_tree(x, node.right)

During prediction, we invoke the `predict()` function, which accepts the testing data. As shown in line 2, it iterates through all the data points in the test data `X`, calling the _traverse_tree() function.

The `_traverse_tree()` function, defined on line 4, is a recursive function that takes a single test data point and the current node (initially the root). If the current node is a leaf node, it returns the value of the leaf node as the prediction. If not, based on the splitting criteria of that node, it recursively calls the `_traverse_tree()` function on either the left or the right child node.

Putting all we developed together in a single class, we will have:

In [11]:
from collections import Counter

import numpy as np


def entropy(y):
    hist = np.bincount(y)
    ps = hist / len(y)
    return -np.sum([p * np.log2(p) for p in ps if p > 0])


class Node:
    def __init__(
        self, feature=None, threshold=None, left=None, right=None, *, value=None
    ):
        self.feature = feature
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value

    def is_leaf_node(self):
        return self.value is not None


class DecisionTree:
    def __init__(self, min_samples_split=2, max_depth=100, n_feats=None):
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.n_feats = n_feats
        self.root = None

    def fit(self, X, y):
        self.n_feats = X.shape[1] if not self.n_feats else min(self.n_feats, X.shape[1])
        self.root = self._grow_tree(X, y)

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])

    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        n_labels = len(np.unique(y))

        # stopping criteria
        if (
            depth >= self.max_depth
            or n_labels == 1
            or n_samples < self.min_samples_split
        ):
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)

        feat_idxs = np.random.choice(n_features, self.n_feats, replace=False)

        # greedily select the best split according to information gain
        best_feat, best_thresh = self._best_criteria(X, y, feat_idxs)

        # grow the children that result from the split
        left_idxs, right_idxs = self._split(X[:, best_feat], best_thresh)
        left = self._grow_tree(X[left_idxs, :], y[left_idxs], depth + 1)
        right = self._grow_tree(X[right_idxs, :], y[right_idxs], depth + 1)
        return Node(best_feat, best_thresh, left, right)

    def _best_criteria(self, X, y, feat_idxs):
        best_gain = -1
        split_idx, split_thresh = None, None
        for feat_idx in feat_idxs:
            X_column = X[:, feat_idx]
            thresholds = np.unique(X_column)
            for threshold in thresholds:
                gain = self._information_gain(y, X_column, threshold)

                if gain > best_gain:
                    best_gain = gain
                    split_idx = feat_idx
                    split_thresh = threshold

        return split_idx, split_thresh

    def _information_gain(self, y, X_column, split_thresh):
        # parent loss
        parent_entropy = entropy(y)

        # generate split
        left_idxs, right_idxs = self._split(X_column, split_thresh)

        if len(left_idxs) == 0 or len(right_idxs) == 0:
            return 0

        # compute the weighted avg. of the loss for the children
        n = len(y)
        n_l, n_r = len(left_idxs), len(right_idxs)
        e_l, e_r = entropy(y[left_idxs]), entropy(y[right_idxs])
        child_entropy = (n_l / n) * e_l + (n_r / n) * e_r

        # information gain is difference in loss before vs. after split
        ig = parent_entropy - child_entropy
        return ig

    def _split(self, X_column, split_thresh):
        left_idxs = np.argwhere(X_column <= split_thresh).flatten()
        right_idxs = np.argwhere(X_column > split_thresh).flatten()
        return left_idxs, right_idxs

    def _traverse_tree(self, x, node):
        if node.is_leaf_node():
            return node.value

        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)

    def _most_common_label(self, y):
        counter = Counter(y)
        most_common = counter.most_common(1)[0][0]
        return most_common

## Loading a Dataset and Testing Our Implementation

Let us first import the breast cancer dataset and `train_test_split` function from the `scikit-learn` library:

In [12]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

Now we prepare the data:

In [13]:
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1234)

We will be assessing our model with accuracy. Let's define it:

In [14]:
def accuracy(y_true, y_pred):
    accuracy = np.sum(y_true == y_pred) / len(y_true)
    return accuracy

Finally, we initiate a `DecisionTree` instance, fit it, use it for prediction, and obtain the accuracy:

In [15]:
clf = DecisionTree(max_depth=10)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
acc = accuracy(y_test, y_pred)

print("Accuracy:", acc)

Accuracy: 0.9210526315789473


As you can see, what we built is pretty accurate!

## What We Have Learned

In this discussion, we did the following:
* Revisiting the core principles of learning decision trees.
* Importing a dataset from `scikit-learn`.
* Generating predictions using the trained tree.
* Measuring performance by accuracy.

Hope you have enjoyed this lesson!

In [17]:
X_train[]

numpy.ndarray