Problem 1: Decision Tree for Classification (with Gini Index)
Problem Statement
You are tasked with implementing a decision tree classifier from scratch using NumPy, using the Gini index as the splitting criterion. The decision tree will classify synthetic 2D data points into two classes based on feature thresholds. The implementation will include tree construction, prediction, and evaluation using accuracy.
Mathematical Definition:

Gini Index for a node:
$$\text{Gini} = 1 - \sum_{i=1}^c p_i^2$$
where $ p_i $ is the proportion of class $ i $ in the node, and $ c $ is the number of classes.
Split Criterion: Choose the feature and threshold that minimize the weighted Gini index of child nodes:
$$\text{Gini}_{\text{split}} = \frac{n_{\text{left}}}{n} \text{Gini}_{\text{left}} + \frac{n_{\text{right}}}{n} \text{Gini}_{\text{right}}$$

Prediction: Assign the majority class of the leaf node.

Requirements

Implement a DecisionTreeClassifier class with methods for:

fit: Build the tree using recursive splitting.
predict: Classify new data points.


Use the Gini index to select splits.
Handle binary classification with 2D synthetic data.
Evaluate accuracy on a test set.

Constraints

Use only NumPy for data manipulation and tree logic.
No scikit-learn or other ML libraries.
Max tree depth of 3 to prevent overfitting.
Handle batch inputs for prediction.

In [None]:
import numpy as np
np.random.seed(42)

X = np.random.rand(100, 2) * 10

y = (X[:, 0] + X[:, 1] > 10).astype(int)

# Define the Decision Tree Classifier
class DecisionTreeClassifier:
    # Purpose: Define a decision tree classifier for binary classification.
    # Theory: Decision trees recursively split the feature space into regions based on feature thresholds, using Gini index to optimize splits.
    
    def __init__(self, max_depth=3):
        # Purpose: Initialize the decision tree with a maximum depth.
        # Theory: max_depth limits tree growth to prevent overfitting. A depth of 3 balances complexity and generalization for this small dataset.
        
        self.max_depth = max_depth
        # Purpose: Store the maximum depth as an instance variable.
        # Theory: Used to control recursion during tree building.
        
        self.tree = None
        # Purpose: Initialize the tree structure as None.
        # Theory: The tree will be built during fit, represented as a dictionary of nodes.
    
    def gini_index(self, y):
        # Purpose: Compute the Gini index for a set of labels.
        # Theory: Gini = 1 - Σ(p_i^2), where p_i is the proportion of class i. Measures node impurity, with 0 indicating a pure node.
        
        classes, counts = np.unique(y, return_counts=True)
        # Purpose: Count occurrences of each class in y.
        # Theory: np.unique returns unique classes and their counts, enabling proportion calculation.
        
        probs = counts / len(y)
        # Purpose: Compute class proportions.
        # Theory: Proportions are used to calculate Gini index.
        
        return 1 - np.sum(probs ** 2)
        # Purpose: Calculate and return the Gini index.
        # Theory: Lower Gini indicates purer nodes (e.g., Gini = 0 for single-class nodes).
    
    def best_split(self, X, y):
        # Purpose: Find the best feature and threshold to split the data.
        # Theory: Evaluates all possible splits to minimize weighted Gini index of child nodes.
        
        m, n = X.shape
        # Purpose: Get number of samples (m) and features (n).
        # Theory: Used to iterate over samples and features for split evaluation.
        
        if len(np.unique(y)) == 1:
            return None, None
        # Purpose: Return None if node is pure (single class).
        # Theory: No split needed if all samples belong to one class.
        
        best_gini = float('inf')
        best_feature = None
        best_threshold = None
        # Purpose: Initialize variables to track the best split.
        # Theory: Tracks the split with the lowest weighted Gini index.
        
        for feature in range(n):
            # Purpose: Iterate over each feature.
            # Theory: Each feature is tested for possible thresholds.
            
            thresholds = np.unique(X[:, feature])
            # Purpose: Get unique values of the feature as potential thresholds.
            # Theory: Only unique values need testing to avoid redundant splits.
            
            for threshold in thresholds:
                # Purpose: Test each threshold for the feature.
                # Theory: Splits data into left (≤ threshold) and right (> threshold) subsets.
                
                left_idx = X[:, feature] <= threshold
                right_idx = ~left_idx
                # Purpose: Create boolean masks for left and right child nodes.
                # Theory: Masks partition data based on the threshold.
                
                if np.sum(left_idx) == 0 or np.sum(right_idx) == 0:
                    continue
                # Purpose: Skip splits that produce empty nodes.
                # Theory: At least one sample is needed in each child node.
                
                gini_left = self.gini_index(y[left_idx])
                gini_right = self.gini_index(y[right_idx])
                # Purpose: Compute Gini index for left and right child nodes.
                # Theory: Measures impurity of each child node.
                
                weighted_gini = (np.sum(left_idx) * gini_left + np.sum(right_idx) * gini_right) / m
                # Purpose: Calculate weighted Gini index for the split.
                # Theory: Weights by number of samples in each child node.
                
                if weighted_gini < best_gini:
                    best_gini = weighted_gini
                    best_feature = feature
                    best_threshold = threshold
                # Purpose: Update best split if weighted Gini is lower.
                # Theory: Tracks the split that minimizes impurity.
        
        return best_feature, best_threshold
        # Purpose: Return the best feature index and threshold.
        # Theory: Used to construct the tree node or stop if no valid split is found.
    
    def build_tree(self, X, y, depth=0):
        # Purpose: Recursively build the decision tree.
        # Theory: Splits data at each node until stopping criteria (depth, purity) are met.
        
        n_samples = len(y)
        # Purpose: Get number of samples at current node.
        # Theory: Used to check stopping conditions.
        
        if depth >= self.max_depth or len(np.unique(y)) == 1 or n_samples < 2:
            # Purpose: Stop recursion if max depth reached, node is pure, or too few samples.
            # Theory: Prevents overfitting and ensures valid leaf nodes.
            
            majority_class = np.bincount(y).argmax()
            # Purpose: Determine majority class for leaf node.
            # Theory: Leaf nodes predict the most common class.
            
            return {'leaf': True, 'class': majority_class}
            # Purpose: Return a leaf node with the predicted class.
            # Theory: Leaf nodes store the class for predictions.
        
        feature, threshold = self.best_split(X, y)
        # Purpose: Find the best split for the current node.
        # Theory: Uses Gini index to select feature and threshold.
        
        if feature is None:
            majority_class = np.bincount(y).argmax()
            return {'leaf': True, 'class': majority_class}
        # Purpose: Create a leaf if no valid split is found.
        # Theory: Handles cases where no split reduces impurity.
        
        left_idx = X[:, feature] <= threshold
        right_idx = ~left_idx
        # Purpose: Split data into left and right child nodes.
        # Theory: Partitions samples based on the threshold.
        
        left_tree = self.build_tree(X[left_idx], y[left_idx], depth + 1)
        right_tree = self.build_tree(X[right_idx], y[right_idx], depth + 1)
        # Purpose: Recursively build left and right subtrees.
        # Theory: Continues splitting until stopping criteria are met.
        
        return {
            'leaf': False,
            'feature': feature,
            'threshold': threshold,
            'left': left_tree,
            'right': right_tree
        }
        # Purpose: Return an internal node with split details.
        # Theory: Stores feature, threshold, and child nodes for traversal.
    
    def fit(self, X, y):
        # Purpose: Train the decision tree on input data and labels.
        # Theory: Builds the tree by recursively splitting data using Gini index.
        
        self.tree = self.build_tree(X, y)
        # Purpose: Build and store the tree.
        # Theory: Starts recursive construction from the root node.
        
        return self
        # Purpose: Return self for method chaining.
        # Theory: Follows scikit-learn API convention.
    
    def predict_single(self, x, node):
        # Purpose: Predict the class for a single sample by traversing the tree.
        # Theory: Recursively navigates to a leaf node based on feature thresholds.
        
        if node['leaf']:
            return node['class']
        # Purpose: Return the class if at a leaf node.
        # Theory: Leaf nodes store the majority class.
        
        if x[node['feature']] <= node['threshold']:
            return self.predict_single(x, node['left'])
        else:
            return self.predict_single(x, node['right'])
        # Purpose: Recursively traverse left or right child based on threshold.
        # Theory: Follows the decision path to a leaf.
    
    def predict(self, X):
        # Purpose: Predict classes for a batch of samples.
        # Theory: Applies predict_single to each sample, handling batch inputs.
        
        return np.array([self.predict_single(x, self.tree) for x in X])
        # Purpose: Return predictions as a NumPy array.
        # Theory: Ensures compatibility with evaluation metrics.

# Split data into train and test sets
train_idx = np.random.choice(100, 80, replace=False)
test_idx = np.setdiff1d(np.arange(100), train_idx)
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
# Purpose: Split data into 80% train and 20% test sets.
# Theory: Train-test split evaluates model generalization. Random sampling ensures unbiased splits.

# Train and evaluate the model
model = DecisionTreeClassifier(max_depth=3)
# Purpose: Initialize the decision tree with max depth of 3.
# Theory: Limits complexity to prevent overfitting on small dataset.

model.fit(X_train, y_train)
# Purpose: Train the model on training data.
# Theory: Builds the tree using recursive Gini-based splitting.

predictions = model.predict(X_test)
# Purpose: Generate predictions for test data.
# Theory: Traverses the tree for each test sample to assign classes.

accuracy = np.mean(predictions == y_test)
# Purpose: Compute accuracy as the proportion of correct predictions.
# Theory: Accuracy = (correct predictions) / (total samples), a standard classification metric.

print(f"Test Accuracy: {accuracy:.4f}")
# Purpose: Print the test accuracy.
# Theory: Evaluates model performance, expected to be high due to separable data.