#### 1.3 Representation
#### Add algo pic from PPT (Use Alg_rep.png)
The CART algorithm predicts a single outcome for binary classification by recursively partitioning the dataset using decision rules based on the input features. We require 4 inputs for the algorithm , including the training dataset $S$, feature subset $A$, and two values for the hyperparameters. When building the tree, if none of the stopping conditions are met, we first compute the Gini impurity for the current dataset, denoted as $\text{Gini}(S)$. Then for each feature $\mathbf{x}_i$ within the current feature subset $A$, we sorted the unique values of each feature $\mathbf{x}_i$ and generate a sequence of midpoints based on the sorted values. For each midpoint, or threshold $\theta$, the current dataset is split into two subsets, $S_1$ and $S_2$. Subset $S_1$ contains data points where $x_i \leq \theta$ , and subset $S_2$ contains data points where $x_i> \theta$, with $x_i$ representing a value within the feature $ \mathbf{x}_i$. Next, we calculate the Gini impurity for both subsets, $S_1$ and $S_2$, as well as the weighted Gini impurity, denoted as $\text{Gini}(S_1,S_2,S)$. We measure the gain as the difference between the node Gini impurity $\text{Gini}(S)$ and the weighted Gini impurity and choose the feature $\mathbf{x}_j$ and threshold combination that will maximize this gain. Based on the chosen feature $\mathbf{x}_j$ and threshold $\theta$, we split the current dataset into two subset, $S_1$ and $S_2$, where subset $S_1$ contains data points where $x_i \leq \theta$ , and subset $S_2$ contains data points where $x_i> \theta$. Using the strategy, we will recursively build the tree until one of the stopping conditions is met. Since in the Heart Disease dataset, all features are encoded as numeric values, this algorithm is enough for handling 13 numeric features. However, if the ordinal and categorical features are not encoded yet, this algorithm will not work and further modifications to the algorithm are required.

We used 5 stopping conditions to terminate the recursion while building the tree. First, if all samples in the current subset $S$ are labeled as 1, a leaf node labeled 1 will be returned. Second, if all samples are labeled as 0, a leaf node labeled 0 will be returned. These two conditions ensure that the recursion stops when the node becomes pure, meaning all samples in the subset belong to the same class, and further splitting is unnecessary. Third, if the feature subset $A$ is empty, the recursion stops, and a leaf node will be returned with its value set to the majority label of the current subset $S$. This condition acts as a safeguard to prevent further splits when no features are left to evaluate, ensuring that the tree construction terminates gracefully even if the dataset cannot be split further in a meaningful way. Fourth, if the maximum tree depth is reached, the recursion terminates, and a leaf node will be returned with its value equal to the majority class label in $S$. The maximum tree depth is determined by the hyperparameter `max_depth`. Finally, if the size of the current subset $|S|$ is smaller than a predefined minimum number of samples (`min_samples_split`) required to split, a leaf node will be returned, with its value set to the majority class label in $S$. In the last three conditions, assigning the majority class label ensures that the algorithm no only provides a reasonable prediction even when further splitting is restricted by depth constraints, but also works for multi-class classification.

# 1.4 Loss

CART has 3 ways to calculate the loss, including the Gini impurity, entropy and misclassification error. This project uses Gini impurity since it's the default criterion in Scikit-learn's `DecisionTreeClassifier`. We will calculate the loss recursively when building up the tree. The parent Gini impurity, or the node Gini impurity is calculated as $$\text{Gini}(S)=1-\sum^K_{k=1}p_k^2\text{,}$$ where $K$ is the number of unique classes in the entire dataset, and $p_k$ is the proportion of samples in the subset that belong to class $k$. In a binary classification problem, $$\text{Gini}(S)=1-\sum_{k=1}^2[P(y=k|S)]^2=1-(p_1^2+p_2^2)\text{,}$$ 

The weighted Gini impurity is calculated as $$\text{Gini}(S_1,S_2,S)=\frac{|S_1|}{|S|}\cdot\text{Gini}(S_1)+\frac{|S_2|}{|S|}\cdot\text{Gini}(S_2)\text{,}$$ where $\text{Gini}(S_1)$ and $\text{Gini}(S_2)$ are the Gini impurity of the two splitted subsets $S_1$ and $S_2$, and we use $|\cdot|$ to measure the cardinality, which is the number of samples within one dataset.

Then, the gain from split is measured as the difference between the parent Gini impurity and the weighted Gini impurity. We then have
$$\text{Gain}(S_1,S_2,S)=\text{Gini}(S)-\text{Gini}(S_1,S_2,S)\text{.}$$

# 1.5 Optimizer: Pruning

Pruning is used to optimize the DecisionTree classifier. In Scikit-learn, the DecisionTree classifier uses a parameter called cost-complexity pruning (ccp_alpha) to control the trade-off between the complexity of the tree and the loss. The reason why we need this is that pruning helps improve the generalization by cutting off unnecessary branches, thus preventing overfitting. During the pruning process, the algorithm will compute a cost-complexity measure for all the subtrees and add a penalty equal to ccp_alpha times the number of leaves in the subtrees. 

The pruning process starts from top to bottom, which means that it goes from the root and recursively selects each subtree and their children subtrees to perform this optimization. As shown in the graph below[6]. In the end, it should simplify the tree and reduce overfitting. The global cost function is written as $R_\alpha(T) = R(T) + \alpha * |T|$, T is the subtree, R(T) is the loss of the tree, and $|T|$ is the number of terminal nodes. The local pruning rule states that if $\alpha > \frac{R(t)-R(T_t)}{|T_t|-1}$, we can change this equation to this: prune $T_t$ if $(|T_t|-1) * \alpha > R(t) - R(T_t)$ We iterate through different values of $\alpha$ to find the optimal value that best improves the performance of the classifier. 

We also use cross-validation, specifically 5-fold cross-validation, for evaluating model performance by splitting the training data into 5 subsets. Initially, we split the entire dataset into 80% train and 20% test, then for the training dataset, we apply the 5-fold cross-validation to ensure robustness. In each iteration of the cross-validation process, it chooses the i-th fold as validation and the rest as training. The model is trained and validated iteratively across all folds to achieve reliable performance measures.

# Fun fact

Our team name is UC Providence because we all graduated from the University of California, though from different campuses. Yixun graduated from UC Santa Barbara, David graduated from UC Irvine, and Liang graduated from UC San Diego. Interestingly, We didn't know we all went to the UC schools until our first meeting, and that's how we came up with the name.

In [1]:
# Dependencies
import numpy as np
from sklearn.datasets import load_iris
from graphviz import Digraph
from sklearn import tree
import matplotlib.pyplot as plt
import random
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, KFold

In [2]:
class Node:
    '''
    Constructor for the Node class
    '''
    def __init__(self, left=None, right=None, label=None, feature=None, threshold=None, parent_gini=None, node_gini=None, num_samples=None, class_counts=None):
        self.left = left # to a left node
        self.right = right # to a right node
        self.label = label
        self.feature = feature
        self.threshold = threshold
        self.parent_gini = parent_gini
        self.node_gini = node_gini
        self.num_samples = num_samples # data points in this node
        self.class_counts = class_counts

    def is_leaf(self):
        '''
        check if the node is a leaf node
        '''
        return self.label is not None

In [3]:
class CART:
    '''
    Decision Tree classifier by UC Providence
    '''
    def __init__(self, max_depth=None, min_samples_split=2, ccp_alpha=0.01, random_state=0):
        if max_depth is None:
            self.max_depth = 20
        else:
            self.max_depth = max_depth # set the max depth
        self.min_samples_split = min_samples_split
        self.tree = None
        self.ccp_alpha = ccp_alpha # for pruning
        self.random_state = random_state
        if random_state is not None: # make random state deterministic
            np.random.seed(random_state)
            random.seed(random_state)
        
    def fit(self, data):
        '''
        build the tree based on the data
        '''
        self.tree = self._build_tree(data, depth=0)
    
    def prune(self, ccp_alpha=0):
        '''
        prune the tree based on the ccp_alpha
        '''
        self._prune_tree(self.tree, ccp_alpha)
    
    def predict(self, data):
        '''
        Helper function to predict the data
        '''
        X = data[:, :-1] # get the features
        return np.array([self._predict_row(self.tree, row) for row in X])

    def loss(self, data):
        '''
        Helper function to calculate the loss
        '''
        preds = self.predict(data)
        true_labels = data[:, -1] # last column is the label
        return np.sum(preds != true_labels) / len(true_labels)

    def accuracy(self, data):
        '''
        Helper function to calculate the accuracy
        '''
        return 1 - self.loss(data)
    
    def _gini_for_node(self, data):
        '''
        Get the gini index for a node
        params data: the data in the node
        return: the gini index
        '''
        labels = data[:, -1] # get the last column which is the label
        _, counts = np.unique(labels, return_counts=True)
        probs = counts / len(labels)
        parent_gini = 1 - np.sum(probs ** 2) # calculate the gini index
        return parent_gini

    def _gini_for_split(self, data, left, right):
        '''
        Get the gini index for a split
        params data: the data in the node
        params left: the left split
        params right: the right split
        return: the gini index
        '''
        # calc the total size
        total_size = len(data)
        left_size = len(left)
        right_size = len(right)
        # calc the gini index for the left and right
        left_gini = self._gini_for_node(left)
        right_gini = self._gini_for_node(right)
        # calc the weighted gini index
        weighted_gini = (left_size / total_size) * left_gini + (right_size / total_size) * right_gini
        return weighted_gini

    def _split(self, data, feature_index, threshold):
        '''
        Split the data based on the feature and threshold
        params data: the data
        params feature_index: the feature to split on
        params threshold: the threshold to split on
        return: the left and right split
        '''
        left = data[data[:, feature_index] <= threshold]
        right = data[data[:, feature_index] > threshold]
        return left, right

    def _find_best_split(self, data):
        '''
        Find the best split for the data, traverse through each column and each average value of the values in the column to find the best split.
        params data: the dataset
        return: the best gain and the best split
        '''
        best_gain = float("-inf")
        best_split = None
        best_split_list = [] # for ties
        parent_gini = self._gini_for_node(data) # calc the gini index for the parent node
        n_features = data.shape[1] - 1
        for feature in range(n_features): # traverse through each feature
            unique_values = np.unique(data[:, feature])
            sorted_values = np.sort(unique_values)
            thresholds = (sorted_values[1:] + sorted_values[:-1]) / 2 # get the average of the values

            if len(thresholds) > 2:
                # Continuous or ordinal features
                for threshold in thresholds:
                    left, right = self._split(data, feature, threshold)
                    if len(left) == 0 or len(right) == 0:
                        continue # skip if the split is empty
                    weighted_gini = self._gini_for_split(data, left, right) # calc the weighted gini index
                    gain = parent_gini - weighted_gini
                    if gain > best_gain: # if the gain is better than the best gain
                        best_gain = gain
                        best_split_list = [{
                            "feature": feature,
                            "threshold": threshold,
                            "gini_for_split": weighted_gini,
                            "parent_gini": parent_gini,
                            "gain": gain,
                            "left": left,
                            "right": right,
                            "type": "continuous"
                        }]
                    elif np.isclose(gain, best_gain):    
                        best_gain = gain
                        best_split_list.append({
                            "feature": feature,
                            "threshold": threshold,
                            "gini_for_split": weighted_gini,
                            "parent_gini": parent_gini,
                            "gain": gain,
                            "left": left,
                            "right": right,
                            "type": "continuous"
                        }) # if tied, then append to the list
            else:
                # Only one threshold for binary features
                for threshold in thresholds:
                    left, right = self._split(data, feature, threshold)
                    if len(left) == 0 or len(right) == 0:
                        continue
                    weighted_gini = self._gini_for_split(data, left, right)
                    gain = parent_gini - weighted_gini
                    if gain > best_gain:    
                        best_gain = gain
                        # same for binary features but with different type
                        best_split_list = [{
                            "feature": feature,
                            "threshold": threshold,
                            "gini_for_split": weighted_gini,
                            "parent_gini": parent_gini,
                            "gain": gain,
                            "left": left,
                            "right": right,
                            "type": "binary"
                        }]
                    elif np.isclose(gain, best_gain):    
                        best_gain = gain
                        best_split_list.append({
                            "feature": feature,
                            "threshold": threshold,
                            "gini_for_split": weighted_gini,
                            "parent_gini": parent_gini,
                            "gain": gain,
                            "left": left,
                            "right": right,
                            "type": "binary"
                        })
        if len(best_split_list) > 1:
            # if the best split list has more than one best split then we sort it by feature
            #print("Multiple best splits found")
            #x = sorted(best_split_list, key=lambda x: x["feature"])
            # for i in x:
            #     print(i["feature"], i["threshold"], i["gini_for_split"], i["parent_gini"], i["gain"])
            best_split = sorted(best_split_list, key=lambda x: x["feature"])[-1]
        else: # else we just get the first best split
            best_split = best_split_list[0]
        return best_gain, best_split

    def _majority_class(self, data):
        '''
        Get the majority class in the data
        '''
        labels = data[:, -1] # get the labels
        unique_labels, counts = np.unique(labels, return_counts=True)
        return unique_labels[np.argmax(counts)]

    def _build_tree(self, data, depth=0):
        '''
        Build the tree recursively
        params data: the data
        params depth: the depth of the tree
        return: the node and its attributes
        '''
        labels = data[:, -1] # label is the last column
        num_samples = len(labels)
        parent_gini = self._gini_for_node(data)

        # Stopping conditions
        # Having a pure node
        if len(np.unique(labels)) == 1:
            return Node(label=labels[0], parent_gini=parent_gini, num_samples=num_samples)
        # Max depth reached
        if self.max_depth is not None and depth >= self.max_depth:
            return Node(label=self._majority_class(data), parent_gini=parent_gini, num_samples=num_samples)
        # Minimum samples split reached
        if num_samples < self.min_samples_split:
            return Node(label=self._majority_class(data), parent_gini=parent_gini, num_samples=num_samples)
        # No split found
        best_gain, best_split = self._find_best_split(data)
        if best_gain == 0:
            return Node(label=self._majority_class(data), parent_gini=parent_gini, num_samples=num_samples)

        # Put left and right data into the tree
        if best_split["type"] == "binary":
            remaining_left = best_split["left"]
            remaining_right = best_split["right"]
        else:
            remaining_left = best_split["left"]
            remaining_right = best_split["right"]
        # Recursion
        left_tree = self._build_tree(remaining_left, depth + 1)
        right_tree = self._build_tree(remaining_right, depth + 1)
        return Node(
            left=left_tree,
            right=right_tree,
            feature=best_split["feature"],
            threshold=best_split["threshold"], 
            parent_gini=parent_gini,
            num_samples=num_samples
        )

    def _predict_row(self, node, row):
        '''
        recursively predict the row
        params node: the node
        params row: the row
        return: the prediction
        '''
        if node.is_leaf():
            return node.label
        else:
            if row[node.feature] <= node.threshold:
                return self._predict_row(node.left, row)
            else:
                return self._predict_row(node.right, row)
    
    def _count_leaves(self, node):
        '''Helper function to count the number of leaves in a subtree'''
        if node.is_leaf():
            return 1
        else:
            return self._count_leaves(node.left) + self._count_leaves(node.right)

In [69]:
# Tests for Node
def test_Node_1():
    # Check if the node is a leaf node
    leaf_node = Node(label=1)
    assert leaf_node.is_leaf(), "Leaf node should be a leaf."
    # Check if the node is a decision node
    decision_node = Node(left="LeftNode", right="RightNode", feature=2, threshold=0.5)
    assert not decision_node.is_leaf(), "Decision node should not be a leaf."
    print("Node tests 1 passed.")

def test_Node_2():
    # Check if the node is a leaf node
    leaf_node = Node(label="hellow")
    assert leaf_node.is_leaf(), "Leaf node should be a leaf."
    # Check if the node is a decision node
    decision_node = Node(left="LeftNode", right="RightNode", feature=2, threshold=0.5)
    assert not decision_node.is_leaf(), "Decision node should not be a leaf."
    print("Node tests 2 passed.")

def test_fit_1():
    data = np.array([[1, 0], [2, 1], [3, 0], [4, 1]])
    cart = CART(max_depth=1)
    cart.fit(data)
    print("test fit 1 passed")
    assert cart.tree is not None, "Tree should not be None after fitting."

def test_fit_2():
    data = np.array([]).reshape(0, 8)
    cart = CART()
    try:
        cart.fit(data)
        assert False, "Fitting empty data should raise an error."
    except:
        pass
    print("test fit 2 passed")

def test_loss_1():
    data = np.array([[1, 0], [2, 1], [3, 0], [4, 1]])
    cart = CART(max_depth=4)
    cart.fit(data)
    loss = cart.loss(data)
    assert loss == 0, "Loss calculation is incorrect."
    print("test loss 1 passed")

def test_loss_2():
    train_data = np.array([[1, 0], [2, 1]])
    test_data = np.array([[3, 0], [4, 1]])
    cart = CART(max_depth=1)
    cart.fit(train_data)
    loss = cart.loss(test_data)
    assert loss == 0.5, "Loss calculation is incorrect."
    print("test loss 2 passed")

def test_accuracy_1():
    data = np.array([[1, 0], [2, 1], [3, 0], [4, 1]])
    cart = CART(max_depth=4)
    cart.fit(data)
    acc = cart.accuracy(data)
    assert acc == 1, "Accuracy calculation is incorrect."
    print("test accuracy 1 passed")

def test_accuracy_2():
    train_data = np.array([[1, 0], [2, 1]])
    test_data = np.array([[3, 0], [4, 1]])
    cart = CART(max_depth=1)
    cart.fit(train_data)
    acc = cart.accuracy(test_data)
    assert acc == 0.5, "Accuracy calculation is incorrect."
    print("test accuracy 2 passed")

# Tests for loss and accuracy
def test_loss_acc():
    tree = Node(left=Node(label=1), right=Node(label=0), feature=0, threshold=1.5)
    cart = CART()
    cart.tree = tree
    data = np.array([
        [1, 2, 1],
        [2, 3, 0],
        [0.5, 1, 1],
        [3, 4, 0]
    ])
    # Check if loss = 0 and accuracy = 1
    assert cart.loss(data) == 0, "Loss calculation is incorrect."
    assert cart.accuracy(data) == 1, "Accuracy calculation is incorrect."
    print("Loss and accuracy tests passed.")

def test_predict_1():
    # Check if a single row prediction is correct
    tree = Node(left=Node(label=1), right=Node(label=0), feature=0, threshold=1.5)
    cart = CART()
    row = np.array([1, 2])  # Expected: left -> 1
    pred = cart._predict_row(tree, row)
    assert pred == 1, "Prediction for single row is incorrect."
    # Check predictions for a dataset
    cart.tree = tree
    data = np.array([
        [1, 2],
        [2, 3]
    ])
    preds = cart.predict(data)
    assert np.array_equal(preds, [1, 0]), "Batch predictions are incorrect."
    print("Predict test 1 passed.")

def test_predict_2():
    train_data = np.array([[1, 0],
                           [2, 1]])
    test_data = np.array([[6], [101]])
    cart = CART(max_depth=2)
    cart.fit(train_data)
    try:
        cart.predict(test_data)
        assert False, "Predictions should raise an error if test data has different number of features."
    except:
        pass
    print("Predict test 2 passed.")

# Tests for _find_best_split
def test_find_best_split():
    data = np.array([
        [1, 2.5, 0],
        [2, 3.5, 1],
        [1.5, 2, 0]
    ])
    cart = CART()
    best_split = cart._find_best_split(data)
    # Check a valid split is found and the split is correct
    # Should not split on the target but split on one of the continuous features
    assert best_split is not None, "Best split should not be None."
    assert best_split["type"] == "continuous", "Best split should be continuous."
    assert best_split["feature"] != 2, "Best split should not be on the target."
    data = np.array([
        [1, 2.5, 0],
        [1, 2.5, 0],
        [1, 2.5, 0]
    ])
    cart = CART()
    best_split = cart._find_best_split(data)
    # Check that no split should be found if all features are the same
    assert best_split is None, "Best split should be None."
    print("Find best split tests passed.")

# Tests for Gini impurity calculations
def test_gini_1():
    data = np.array([
        [1, 2.5, 0],
        [2, 3.5, 1]
    ])
    cart = CART()
    gini = cart._gini_for_node(data)
    assert abs(gini - 0.5) < 1e-6, "Gini impurity for node is incorrect."
    left = data[:1]
    right = data[1:]
    gini = cart._gini_for_split(data, left, right)
    assert abs(gini - 0) < 1e-6, "Gini impurity for split is incorrect."
    print("Gini test 1 passed.")

def test_gini_2():
    data = np.array([
        [1, 0], [2, 0]
    ])
    cart = CART()
    gini = cart._gini_for_node(data)
    assert gini == 0, "Gini impurity for node is incorrect."
    left = data[:1]
    right = data[1:]
    gini = cart._gini_for_split(data, left, right)
    assert abs(gini) < 1e-9, "Gini impurity for split is incorrect."
    print("Gini test 2 passed.")

def test_split_1():
    data = np.array([
        [1, 2.5, 0],
        [1.5, 2, 0],
        [2, 3.5, 1]
    ])
    cart = CART()
    left, right = cart._split(data, 0, 1.25)
    # Check if the split is correct
    assert len(left) == 1, "Left split should have 1 row."
    assert len(right) == 2, "Right split should have 2 rows."
    print("Split test 1 passed.")

def test_split_2():
    data = np.array([
        [1, 2.5, 0],
        [1.5, 2, 0],
        [2, 3.5, 1]
    ])
    cart = CART()
    left, right = cart._split(data, 1, -1)
    # Check if the split is correct
    assert len(left) == 0, "Left split should have 0 rows."
    assert len(right) == 3, "Right split should have 3 row."
    print("Split test 2 passed.")

def test_best_split_1():
    data = np.array([
        [1, 2.5, 0],
        [1.5, 2, 0],
        [2, 3.5, 1]
    ])
    cart = CART()
    best_gain, best_split = cart._find_best_split(data)
    best_gain = round(best_gain, 5)
    assert best_gain == 0.44444, "Best gain is incorrect."
    assert best_split["feature"] == 1, "Best split feature is incorrect."
    assert best_split["threshold"] == 3, "Best split threshold is incorrect."
    print("Best split test 1 passed.")

def test_best_split_2():
    data = np.array([
        [2, 2, 0],
        [2.5, 2.5, 0],
        [3.5, 3.5, 0]
    ])
    cart = CART()
    best_gain, best_split = cart._find_best_split(data)
    best_gain = round(best_gain, 5)
    assert best_gain == 0, "Best gain is incorrect."
    assert best_split["feature"] == 1, "Best split feature is incorrect."
    assert best_split["threshold"] == 3, "Best split threshold is incorrect."
    print("Best split test 2 passed.")

def test_majority_class_1():
    data = np.array([
        [1, 2.5, 0],
        [1.5, 2, 0],
        [2, 3.5, 1]
    ])
    cart = CART()
    majority = cart._majority_class(data)
    assert majority == 0, "Majority class is incorrect."
    print("Majority class test 1 passed.")

def test_majority_class_2():
    data = np.array([
        [1, 2.5, 1],
        [1.5, 2, 1],
        [2, 3.5, 1]
    ])
    cart = CART()
    majority = cart._majority_class(data)
    assert majority == 1, "Majority class is incorrect."
    print("Majority class test 2 passed.")

def test_build_tree_1():
    data = np.array([
        [1, 2.5, 0],
        [1.5, 2, 0],
        [2, 3.5, 1],
    ])
    cart = CART(max_depth=2)
    tree = cart._build_tree(data)
    # Check if the tree is built correctly
    assert tree.feature == 1, "Root feature is incorrect."
    assert tree.threshold == 3, "Root threshold is incorrect."
    assert tree.left.label == 0, "Left leaf label is incorrect."
    assert tree.right.label == 1, "Right leaf label is incorrect."
    print("Build tree test 1 passed.")

def test_build_tree_2():
    data = np.array([
        [1, 2.5, 0],
        [2, 3.5, 0],
        [1.5, 2, 0]
    ])
    cart = CART(max_depth=2)
    tree = cart._build_tree(data)
    # Check if the tree is built correctly
    assert tree.label == 0, "Root label is incorrect."
    print("Build tree test 2 passed.")

In [70]:
# Below two functions are testing the Node to take numerical and string labels, and they should be leaf nodes
# Edge case: numerical and string labels
test_Node_1()
test_Node_2()

# Below two functionare testing the fit function. Tets if fit builds a tree from small dataset or empty dataset
# Edge case: small and empty dataset
test_fit_1()
test_fit_2()

# Below functions are testing the loss function.
# Edge case: loss function with 0 loss and 0.5 loss and multi-dimensional data
test_loss_1()
test_loss_2()
test_loss_acc() # this test both loss and accuracy for multi-dimensional data

# Below functions are testing the accuracy function.
# Edge case: accuracy function with 1 accuracy and 0.5 accuracy
test_accuracy_1()
test_accuracy_2()

# testing for predict
# Edge case: single row prediction and batch prediction and unseen data
test_predict_1()
test_predict_2()

# Below functionare testing the _gini_for_node and _gini_for_split function.
# Edge case: gini for node and split with 0 gini and multi-dimensional data
test_gini_1()
test_gini_2()

# Below function is testing the _split function.
# Edge case: split function with 1 row and 0 row, with a threshold being negative or below every value
test_split_1()
test_split_2()

# Below function is testing the _find_best_split function.
# Edge case: best split with 0 gain and 0.44444 gain, and there should be a tie, choose later feature.
test_best_split_1()
test_best_split_2()

# Below function are for majority class
# Edge case: multiple class vs a single class
test_majority_class_1()
test_majority_class_2()

# Below functions are for testing build_tree
# Edge case: build a simple tree with depth 2 and a tree with all the same features
# also tested if the data is not in order
test_build_tree_1()
test_build_tree_2()

# Below functions are for pruning the tree
def test_prune_1():
    pass
def test_prune_2():
    pass

Node tests 1 passed.
Node tests 2 passed.
test fit 1 passed
test fit 2 passed
test loss 1 passed
test loss 2 passed
Loss and accuracy tests passed.
test accuracy 1 passed
test accuracy 2 passed
Predict test 1 passed.
Predict test 2 passed.
Gini test 1 passed.
Gini test 2 passed.
Split test 1 passed.
Split test 2 passed.
Best split test 1 passed.
Best split test 2 passed.
Majority class test 1 passed.
Majority class test 2 passed.
Build tree test 1 passed.
Build tree test 2 passed.


We don't need to unit test for prune because those two are a function that calss _prune. So we can just test those functions instead. And predict_row and count leaves are not tested because they are just function that recursively gets the label or could leaves number.

6. meanxai, 2023. [MXML-2-09] Decision Trees [9/11] - CART, Cost Complexity Pruning (CCP). YouTube. Available at: https://www.youtube.com/watch?v=my3ljAS5UUM&t=845s (Accessed: 12 December 2024)