## Decision Trees

Decision trees are versatile and interpretable machine learning models used for both classification and regression tasks. They work by recursively partitioning the input space into regions, assigning a label or value to each region based on the majority class or average target value of the training data within that region. In this section, we'll explore the mathematical formulation of decision trees and delve into two commonly used criteria for splitting nodes: entropy and Gini impurity.

### Mathematical Formulation

A decision tree can be represented as a binary tree, where each internal node represents a decision or test on a feature, each branch corresponds to a possible outcome of the test, and each leaf node holds a class label (in classification) or a predicted value (in regression). Let's denote the decision tree as $T$.

### Entropy Criterion

Entropy is a measure of impurity in a set of samples. For a classification problem with $K$ distinct classes, the entropy of a set $S$ with respect to class distribution $p_1, p_2, \ldots, p_K$ is defined as:

$$
H(S) = -\sum_{i=1}^{K} p_i \log_2(p_i)
$$

where $p_i$ is the proportion of samples in class $i$ within set $S$.

The information gain (IG) is used to evaluate the quality of a split based on entropy. Given a set $S$ of samples, and $V$ a split of $S$ into subsets $S_1, S_2, \ldots, S_V$, the information gain is defined as:

$$
IG(S, V) = H(S) - \sum_{v=1}^{V} \frac{|S_v|}{|S|} H(S_v)
$$

The decision tree algorithm aims to maximize the information gain when choosing the best split at each node.

### Gini Impurity Criterion

Gini impurity is another measure of impurity. For a classification problem with $K$ classes, the Gini impurity of a set $S$ is defined as:

$$
Gini(S) = 1 - \sum_{i=1}^{K} p_i^2
$$

where $p_i$ is the proportion of samples in class $i$ within set $S$.

Similar to entropy, the Gini impurity can be used to compute the impurity of a split. Given a set $S$ of samples and a split $V$ into subsets $S_1, S_2, \ldots, S_V$, the Gini impurity of the split is defined as:

$$
Gini(S, V) = \sum_{v=1}^{V} \frac{|S_v|}{|S|} Gini(S_v)
$$

The decision tree algorithm seeks to minimize the Gini impurity when selecting the best split.

Both entropy and Gini impurity criteria are widely used in decision tree algorithms, such as CART (Classification and Regression Trees) and C4.5. These criteria guide the construction of the tree by iteratively selecting the best feature and split that maximizes information gain or minimizes impurity.

Decision trees provide interpretable models and are a fundamental building block for ensemble methods like random forests and gradient boosting.


## Derivation of Gini Impurity

The Gini impurity is a measure of impurity or disorder used in decision trees for classification problems. It represents the probability of misclassifying a randomly selected sample from a set. Let's derive the formula for Gini impurity step by step.

### Step 1: Notation

For a classification problem with $K$ distinct classes, let $p_1, p_2, \ldots, p_K$ be the proportions of samples in each class within a set $S$. We denote $p_i$ as the proportion of samples in class $i$ within set $S$.

### Step 2: Define Gini Impurity

The Gini impurity for a set $S$, denoted as $Gini(S)$, is defined as:

$$
Gini(S) = 1 - \sum_{i=1}^{K} p_i^2
$$

### Step 3: Interpretation

The Gini impurity represents the probability of misclassifying a randomly selected sample from set $S$. Here's the derivation:

- $1$ is the total probability of selecting a sample from the set.
- $\sum_{i=1}^{K} p_i^2$ is the probability that a randomly selected sample will be correctly classified. This is calculated as the sum of the squared proportions of each class within the set.

### Step 4: Derivation

Let's derive $Gini(S)$ explicitly:

$$
\begin{align*}
Gini(S) &= 1 - \sum_{i=1}^{K} p_i^2 \\
&= 1 - (p_1^2 + p_2^2 + \ldots + p_K^2) \\
&= 1 - p_1^2 - p_2^2 - \ldots - p_K^2
\end{align*}
$$

### Step 5: Interpretation (Revisited)

- $1$ is the total probability of selecting a sample.
- $p_1^2$ is the probability of correctly classifying a sample as class $1$.
- $p_2^2$ is the probability of correctly classifying a sample as class $2$, and so on.
- The subtraction from $1$ represents the probability of misclassification.

### Step 6: Conclusion

The Gini impurity ($Gini(S)$) quantifies the impurity or disorder in a set $S$ by measuring the probability of misclassification when randomly selecting a sample from that set. A lower Gini impurity indicates a purer set with less impurity, while a higher Gini impurity implies more mixing of classes within the set.


## Derivation of Entropy Criterion

The Entropy criterion is used in decision trees for classification problems. It measures the impurity or disorder within a set and quantifies the uncertainty associated with class labels. Let's derive the formula for Entropy step by step.

### Step 1: Notation

For a classification problem with $K$ distinct classes, let $p_1, p_2, \ldots, p_K$ be the proportions of samples in each class within a set $S$. We denote $p_i$ as the proportion of samples in class $i$ within set $S$.

### Step 2: Define Entropy

The Entropy for a set $S$, denoted as $Entropy(S)$, is defined as:

$$
Entropy(S) = - \sum_{i=1}^{K} p_i \log_2(p_i)
$$

### Step 3: Interpretation

The Entropy criterion measures the average amount of information needed to predict the class of a randomly selected sample from set $S$. Here's the derivation:

- The negative sign in front of the summation ensures that Entropy is a positive value representing uncertainty.
- $\sum_{i=1}^{K} p_i \log_2(p_i)$ is the information content associated with class probabilities. This is calculated as the sum of the products of class proportions ($p_i$) and their logarithms.

### Step 4: Derivation

Let's derive $Entropy(S)$ explicitly:

$$
\begin{align*}
Entropy(S) &= - \sum_{i=1}^{K} p_i \log_2(p_i) \\
&= - (p_1 \log_2(p_1) + p_2 \log_2(p_2) + \ldots + p_K \log_2(p_K))
\end{align*}
$$

### Step 5: Interpretation (Revisited)

- The negative sign indicates that the more certain or pure a set is, the smaller its entropy.
- $p_1 \log_2(p_1)$ is the information content associated with correctly classifying a sample as class $1$.
- $p_2 \log_2(p_2)$ is the information content for correctly classifying as class $2$, and so on.
- The summation represents the average information content across all possible outcomes.

### Step 6: Conclusion

The Entropy criterion ($Entropy(S)$) quantifies the impurity or disorder in a set $S$ by measuring the average information needed to predict the class of a randomly selected sample from that set. Lower entropy indicates a purer set with less impurity, while higher entropy implies more mixing of classes within the set.


## Derivation of Entropy Criterion from Information Gain

The Entropy criterion is used in decision trees for classification problems to measure impurity or disorder within a set. It can be derived from the concept of Information Gain, which quantifies the reduction in uncertainty about a class label when a set is split based on a specific attribute. Let's derive the Entropy criterion step by step.

### Step 1: Notation

For a classification problem with $K$ distinct classes, let $p_1, p_2, \ldots, p_K$ be the proportions of samples in each class within a set $S$. We denote $p_i$ as the proportion of samples in class $i$ within set $S$.

### Step 2: Define Information Gain

The Information Gain (IG) when splitting set $S$ into subsets $S_1, S_2, \ldots, S_m$ based on an attribute is defined as:

$$
IG(S, A) = H(S) - \sum_{i=1}^{m} \frac{|S_i|}{|S|} \cdot H(S_i)
$$

Where:
- $H(S)$ is the entropy of set $S$.
- $H(S_i)$ is the entropy of subset $S_i$.
- $|S|$ is the total number of samples in set $S$.
- $|S_i|$ is the number of samples in subset $S_i$.

### Step 3: Define Entropy

The Entropy for a set $S$, denoted as $Entropy(S)$, is defined as:

$$
Entropy(S) = - \sum_{i=1}^{K} p_i \log_2(p_i)
$$

### Step 4: Derivation of Information Gain

Let's derive $IG(S, A)$ using the definition of entropy:

$$
\begin{align*}
IG(S, A) &= H(S) - \sum_{i=1}^{m} \frac{|S_i|}{|S|} \cdot H(S_i) \\
&= -\sum_{i=1}^{K} p_i \log_2(p_i) - \sum_{i=1}^{m} \frac{|S_i|}{|S|} \left(-\sum_{j=1}^{K_i} p_{ij} \log_2(p_{ij})\right)
\end{align*}
$$

Where:
- $K$ is the total number of classes.
- $K_i$ is the number of classes in subset $S_i$.
- $p_i$ is the proportion of samples in class $i$ within set $S$.
- $p_{ij}$ is the proportion of samples in class $j$ within subset $S_i$.

### Step 5: Further Derivation

Simplify the expression:

$$
\begin{align*}
IG(S, A) &= -\sum_{i=1}^{K} p_i \log_2(p_i) + \sum_{i=1}^{m} \frac{|S_i|}{|S|} \sum_{j=1}^{K_i} p_{ij} \log_2(p_{ij}) \\
&= -\sum_{i=1}^{K} p_i \log_2(p_i) + \sum_{i=1}^{m} \frac{|S_i|}{|S|} \left(-\sum_{j=1}^{K_i} \frac{|S_i|}{|S|} p_{ij} \log_2(p_{ij})\right) \\
&= -\sum_{i=1}^{K} p_i \log_2(p_i) + \sum_{i=1}^{m} \left(-\frac{|S_i|}{|S|}\right) \sum_{j=1}^{K_i} \left(\frac{|S_i|}{|S|}\right) p_{ij} \log_2(p_{ij})
\end{align*}
$$

### Step 6: Interpreting Information Gain

The Information Gain $IG(S, A)$ quantifies the reduction in uncertainty about the class labels achieved by splitting set $S$ based on attribute $A$. A higher $IG$ indicates a better attribute for the split, as it reduces the entropy (uncertainty) within subsets $S_1, S_2, \ldots, S_m$.

### Step 7: Entropy Criterion

Now, the Entropy criterion for decision tree splitting can be defined as:

$$
EntropyCriterion(S, A) = - \sum_{i=1}^{K} p_i \log_2(p_i) - IG(S, A)
$$

Where $A$ is the attribute being considered for the split.

The


In [None]:
import numpy as np

def gini_inpurity(y):
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    probabilities = np.bincount(y)/y.shape[0]
    print(f"probabilities: {probabilities}")
    print(f"probabilities squared: {probabilities**2}")
    print(f"gini inpurity: {1 - np.sum(probabilities**2)}")
    return 1 - np.sum(probabilities**2)

def entropy_inpurity(y):
    if not isinstance(y, np.ndarray):
        y = np.array(y)
    probabilities = np.bincount(y)/y.shape[0]
    print(f"probabilities: {probabilities}")
    print(f"log probabilities: { np.log2(probabilities, where=(probabilities > 0))}")
    print(f"p*log_2 p: { probabilities*np.log2(probabilities, where=(probabilities > 0))}")
    print(f"entropy inpurity: {np.sum(-probabilities * np.log2(probabilities, where=(probabilities > 0)))}")
    return np.sum(-probabilities * np.log2(probabilities, where=(probabilities > 0)))

y = [0, 0, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)
print()
y = [0, 0, 1, 1, 1, 1, 1, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)

print()
y = [1, 1, 1, 1, 1, 1, 1]
gini_inpurity(y)
print("===================")
entropy_inpurity(y)

Imagine an experiment with $k$ possible output categories. Category $j$ has a probability of occurrence $p(j|t)$ (where $j=1,\ldots,k$).

Reproduce the experiment two times and make these observations:

1. The probability of obtaining two identical outputs of category $j$ is $p^2(j|t)$.
2. The probability of obtaining two identical outputs, independently of their category, is: $\sum_{j=1}^{k} p^2(j|t)$.
3. The probability of obtaining two different outputs is thus: $1 - \sum_{j=1}^{k} p^2(j|t)$.

That's it: the Gini impurity is simply the probability of obtaining two different outputs, which is an "impurity measure".

https://stats.stackexchange.com/questions/308885/a-simple-clear-explanation-of-the-gini-impurity

Imagine an experiment with $k$ possible output categories. Category $j$ has a probability of occurrence $p(j|t)$ (where $j=1,\ldots,k$).

Reproduce the experiment two times and make these observations:

1. The probability of obtaining two identical outputs of category $j$ is $p^2(j|t)$.
2. The probability of obtaining two identical outputs, independently of their category, is: $\sum_{j=1}^{k} p^2(j|t)$.
3. The probability of obtaining two different outputs is thus: $1 - \sum_{j=1}^{k} p^2(j|t)$.

That's it: the Gini impurity is simply the probability of obtaining two different outputs, which is an "impurity measure".

https://stats.stackexchange.com/questions/308885/a-simple-clear-explanation-of-the-gini-impurity

In [2]:
import numpy as np
from matplotlib import pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

iris = load_iris()
X = iris.data
y = iris.target
X = np.array([[1,2, "good"], [4,5,"bad"]])
y = np.array([1, 0])
print(X[0,:])


clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=0)
clf.fit(X, y)

['1' '2' 'good']


ValueError: could not convert string to float: 'good'