## Coding Question: Implement a Decision Stump

A decision stump is a decision tree with a maximum depth of 1. It consists of a single split on one feature and two leaf predictions.

Your task is to implement a decision stump classifier from scratch. The stump should find the best feature and threshold to split on, using classification accuracy as the split metric.

### Requirements:

Implement a DecisionStump class with the following methods:

#### 1. fit(X, y):

Input:

- X: a 2D NumPy array of shape (n_samples, n_features)

- y: a 1D NumPy array of shape (n_samples,), containing class labels (0 or 1).

Function:

- Find the best feature and threshold to split the data such that classification accuracy is maximized.

- Store the chosen feature, threshold, and predictions for the left and right child.

predict(X):

- Input: X: a 2D NumPy array of shape (n_samples, n_features)

- Output: predictions (0 or 1) for each row, based on the learned stump.

#### 2. Use classification accuracy to evaluate potential splits:

Accuracy
= Number of correct predictions / Total samples​

#### 3. Assume binary classification (y ∈ {0, 1}).

In [None]:
### Clarification Questions
1. Input
   - data type of the features: categorical or continuous
   - How to set thresholds: for simplicity, use existing values?
2. Output
3. Loss function
   - What metric to use: Gini or Accuracy

## Find the Best Gini-Based Split for a Binary Decision Tree
Medium
Machine Learning

Implement a function that scans every feature and threshold in a small data set, then returns the split that minimises the weighted Gini impurity. Your implementation should support binary class labels (0 or 1) and handle ties gracefully.

You will write one function:

find_best_split(X: np.ndarray, y: np.ndarray) -> tuple[int, float]
- X is an n×d NumPy array of numeric features.
- y is a length-n NumPy array of 0/1 labels.

The function returns (best_feature_index, best_threshold) for the split with the lowest weighted Gini impurity.
If several splits share the same impurity, return the first that you encounter while scanning features and thresholds.

Example:

Input:

In [19]:
import numpy as np
from typing import Tuple

def gini(y_subset: np.ndarray) -> float:
    if y_subset.size == 0:
        return 0.0
    p = y_subset.mean()
    return 1.0 - (p ** 2 + (1-p) ** 2)

def find_best_split(X: np.ndarray, y: np.ndarray) -> Tuple[int, float]:
    n_samples, n_features = X.shape
    gn = float('inf')
    best_feature_index = -1
    best_threshold = float('inf')
    
    for i in range(n_features):
        for val in np.unique(X[:, i]):
            y_left = y[X[:, i] <= val, ]
            y_right = y[X[:, i] > val, ]
            gn_l = gini(y_left)
            gn_r = gini(y_right)
            weighted_gn = (len(y_left) * gn_l + len(y_right) * gn_r) / n_samples
            if weighted_gn < gn:
                gn = weighted_gn
                best_feature_index = i
                best_threshold = val
                
    return best_feature_index, best_threshold
            

In [17]:
import numpy as np
X = np.array([[2.5],[3.5],[1.0],[4.0]])
y = np.array([0,1,0,1])
print(find_best_split(X, y))

# Output: (0, 2.5)

(0, 2.5)


In [26]:
X = np.array([
    [2.5, 1.0],
    [1.0, 3.5],
    [3.0, 2.0],
    [2.0, 2.5],
    [0.5, 1.5]
])
y = np.array([1, 0, 1, 0, 0])

print(find_best_split(X, y))

(0, 2.25)


In [23]:
# loss: accuracy
import numpy as np
from typing import Tuple

def find_best_split(X: np.ndarray, y: np.ndarray) -> Tuple[int, float]:
    n_samples, n_features = X.shape
    accuracy = -1 # better initiation
    best_feature_index, best_threshold = -1, float('inf')
    
    for fea in range(n_features):
        # Clarify: how to determine thresholds
        # thresholds = np.unique(X[:, fea])
        values = np.sort(np.unique(X[:, fea]))
        thresholds = (values[:-1] + values[1:]) / 2
        for thresh in thresholds:
            left_mask = X[:, fea] <= thresh
            right_mask = ~left_mask
            
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue
            
            # Clarify: how to label prediction to 0/1
            pred_left = int(y[left_mask].mean() > 0.5) # > or >= ?
            pred_right = int(y[right_mask].mean() > 0.5)
            y_pred = np.zeros_like(y)
            y_pred[left_mask] = pred_left
            y_pred[right_mask] = pred_right
            
            acc = sum(y_pred == y) / n_samples
            if acc > accuracy:
                accuracy = acc
                best_feature_index = fea
                best_threshold = thresh
            
    return best_feature_index, best_threshold

In [25]:
X = np.array([[2.5],[3.5],[1.0],[4.0]])
y = np.array([0,1,0,1])
print(find_best_split(X, y))

(0, 3.0)
