# ML, Data Analysis
### Machine learning: Decision tree, node splitting

The **node splitting** is a fundamental mechanism in **decision trees**. Decision trees can be trained for both **classification** and **regression**. 
<br>To build a decision tree, we need to decide at each node of tree except leaf nodes, on which feature, the data should be splitted such that we get most pure (homogenous) subsets of data in the children nodes. 
<br>Here, purity is defined based on a measure such as *Gini impurity*, *Information gain* (entropy),  or *variance reduction*. (for regrssion trees). The purity measure is computed on the label (value) of the subsets. 
<br> We talked about Gini impurity in an earlier post. Assume a node of the tree receives the dataset $X$. it Gini imputiy is computed by:
<div style="margin-top:4px"></div>
$\large Gini(X)=1-\sum_{i=1}^K p_i^2$
<div style="margin-bottom:4px"></div>
where $p_i$ is thr fraction of samples in $X$ with label (class) $i$.
<br>Assume we choose featur $A$ witht threshold $t$ for splling the dataset $X$. Thus, we get two subsets $X_{left}$ and $X_{right}$:
<div style="margin-top:4px"></div>
$\large Gini_{split}(A,t)=\frac{|X_{left}|}{|X|}Gini(X_{left})+\frac{|X_{right}|}{|X|}Gini(X_{right})$
<div style="margin-bottom:4px"></div>
where $|.|$ returns the number of samples in its argument.
<br>For node splitting, we look for feature $A^*$ and threshold $t^*$ that minimizes the weighted Gini defined above:
<div style="margin-top:4px"></div>
$\large (A^*,t^*)=argmin_{(A,t)} Gini_{split}(A,t) $
<div style="margin-bottom:4px"></div>

**Hint**: We stop splitting at a node, when:
 - Maximum depth of the requested decision tree is reached.
 - Minimum samples in a node is reached.
 - Pure Gini is obtained: $Gini(X)=0$
<hr>

In the following, we use the Gini impurity for splitting a toy dataset. In fact, we want to find the best feature and best threshold that produces the minimum Gini impurity for the split. Here, we assume that the features are only numerical, and we do node splitting for classification.

<hr>
https://github.com/ostad-ai/Machine-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Machine-Learning/background-knowledge

In [1]:
# Import the required module
import numpy as np
import pandas as pd

In [2]:
# Calculate the Gini Impurity for a numpy array
def gini_impurity(y):
    classes, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    return 1 - np.sum(probabilities ** 2)

In [3]:
# Find the best binary split for data matrix X with labels in y
# X is the data matrix in row vectors
# so each row is a sample of data
# each component of a row vector is a feature
# this functions returns the best feature, threshold, and gini
def find_best_split(X, y):
    best_gini = float('inf')
    best_feature, best_threshold = None, None
    
    for feature_idx in range(X.shape[1]):
        thresholds = np.unique(X[:, feature_idx])
        for threshold in thresholds:
            left_mask = X[:, feature_idx] <= threshold
            right_mask = ~left_mask
            
            if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
                continue
            
            gini_left = gini_impurity(y[left_mask])
            gini_right = gini_impurity(y[right_mask])
            weighted_gini = (len(y[left_mask]) * gini_left + \
                        len(y[right_mask]) * gini_right) / len(y)
            
            if weighted_gini < best_gini:
                best_gini = weighted_gini
                best_feature = feature_idx
                best_threshold = threshold
                
    return best_feature, best_threshold, best_gini

In [4]:
# Define the toy dataset
data = {
    'Age':       [20, 25, 30, 35, 40, 45, 50, 55, 60, 65],  # Age 
    'Income ($k)': [25, 70, 32, 80, 36, 90, 40, 95, 45, 100],  # Income
    'Buys':      ['No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes']
}

df = pd.DataFrame(data)
display(df)

Unnamed: 0,Age,Income ($k),Buys
0,20,25,No
1,25,70,No
2,30,32,No
3,35,80,Yes
4,40,36,Yes
5,45,90,Yes
6,50,40,No
7,55,95,No
8,60,45,Yes
9,65,100,Yes


In [5]:
# Get the best feature and best threshold for the toy dataset

# Extract data matrix (X) and labels (y)
X = df[['Age', 'Income ($k)']].values
y = df['Buys'].values

# Run the split-finding function
best_feature_idx, best_threshold,best_gini = find_best_split(X, y)

# Map feature index to name
feature_names = ['Age', 'Income ($k)']
best_feature = feature_names[best_feature_idx]
print(f'--- Having two features {feature_names} ----')
print(f"Best split feature: {best_feature}")
print(f"Best threshold: {best_threshold}")
print(f"Best Gini Impurity: {best_gini:.4f}")

--- Having two features ['Age', 'Income ($k)'] ----
Best split feature: Age
Best threshold: 30
Best Gini Impurity: 0.2857


In [6]:
# Let's see the data splitted by the best feature and threshold
display(df[df[best_feature]<=best_threshold])
display(df[df[best_feature]>best_threshold])

Unnamed: 0,Age,Income ($k),Buys
0,20,25,No
1,25,70,No
2,30,32,No


Unnamed: 0,Age,Income ($k),Buys
3,35,80,Yes
4,40,36,Yes
5,45,90,Yes
6,50,40,No
7,55,95,No
8,60,45,Yes
9,65,100,Yes


In [7]:
# Extra
# Let's see what would be the gini impurity, if we had the "income" feature only, 
X = df[['Income ($k)']].values
y = df['Buys'].values

# Run the split-finding function
best_feature_idx, best_threshold,best_gini = find_best_split(X, y)

# Map feature index to name
feature_names = ['Income ($k)']
best_feature = feature_names[best_feature_idx]
print(f'---- When we have only feature: {feature_names[0]} ----')
print(f"Best threshold: {best_threshold}")
print(f"Best Gini Impurity: {best_gini:.4f}")

---- When we have only feature: Income ($k) ----
Best threshold: 32
Best Gini Impurity: 0.3750
