# **CART**

CART is not the same as ID3.

CART typically performs binary splits. For continuous features, it finds the optimal threshold to split the data, creating binary splits. It also handle categorical features natively. 

CART uses Gini impurity for classification problems. For pruning, it includes a cost-complexity pruning method that allows for post-pruning to improve generalization and avoid overfitting. 

## **Representation** NOTE: MODIFY THIS TO ALIGN WITH THE ACTUAL CODE!!!!!!!!

Note: entropy or training error can also be used as the splitting criterion, but it's more common to use Gini impurity.

$$\text{CART}(S, A)$$
$\text{Inputs: training set }S\text{, feature subset }A\subseteq [d]$<br>
$\textbf{if }\text{all examples in }S\text{ are labeled by }1\text{, return a leaf }1$<br>
$\textbf{if }\text{all examples in }S\text{ are labeled by }0\text{, return a leaf }0$<br>
$\textbf{if }A=\emptyset\text{, return a leaf whose value }=\text{ majority of labels in }S$<br>
$\textbf{else}\text{:}$<br>
$\quad \textbf{foreach}\text{ feature }i\text{ in }A\text{:}$<br>
$\quad\quad\textbf{if }x_i\text{ is continuous: }$<br>
$\quad\quad\quad\text{Generate a sequence of thresholds }\theta \text{ based on the sorted values of }x_i$<br>
$\quad\quad\quad\textbf{foreach}\text{ threshold }\theta\text{ in the sequence:}$<br>
$\quad\quad\quad\quad S_1 = \{(\textbf{x},y)\in S:x_i\le\theta\}$<br>
$\quad\quad\quad\quad S_2 = \{(\textbf{x},y)\in S:x_i>\theta\}$<br>
$\quad\quad\quad\quad \text{Gini}(S,i)=\frac{|S_1|}{|S|}\cdot\text{Gini}(S_1)+\frac{|S_2|}{|S|}\cdot\text{Gini}(S_2)$<br>
$\quad\quad\textbf{if }x_i\text{ is categorical: }$<br>
$\quad\quad\quad\text{Let categories }= \text{ unique values of }x_i$<br>
$\quad\quad\quad\textbf{foreach}\text{ binary split }(a,b)\text{ of categories:}$<br>
$\quad\quad\quad\quad S_1 = \{(\textbf{x},y)\in S:x_i\in a\}$<br>
$\quad\quad\quad\quad S_2 = \{(\textbf{x},y)\in S:x_i\in b\}$<br>
$\quad\quad\quad\quad \text{Gini}(S,i)=\frac{|S_1|}{|S|}\cdot\text{Gini}(S_1)+\frac{|S_2|}{|S|}\cdot\text{Gini}(S_2)$<br>
$\quad \text{Let } j = \text{argmin}_{i\in A}\text{Gini}(S,i)$<br>
$\quad \textbf{if }j \text{ is continuous:}$<br>
$\quad\quad S_1 = \{(\textbf{x},y)\in S:x_j\le\theta\}$<br>
$\quad\quad S_2 = \{(\textbf{x},y)\in S:x_j>\theta\}$<br>
$\quad \textbf{if }j \text{ is categorical:}$<br>
$\quad\quad S_1 = \{(\textbf{x},y)\in S:x_j\in a\}$<br>
$\quad\quad S_2 = \{(\textbf{x},y)\in S:x_j\in b\}$<br>
$\quad\text{Let }T_1\text{ be the tree returned by CART}(S_1,A\backslash\{j\})$<br>
$\quad\text{Let }T_2\text{ be the tree returned by CART}(S_2,A\backslash\{j\})$<br>
$\quad\text{Return the tree}$


## **Loss**
$$\text{Gini}(S)=1-\sum^K_{k=1}p_k^2$$
We split two subsets namely $S_1$ and $S_2$, and we have the Gini impurity of $S_1$ and $S_2$ as:
$$\text{Gini}(S_1)=1-\sum^K_{k=1}p_k^2$$
$$\text{Gini}(S_2)=1-\sum^K_{k=1}p_k^2$$
Where:
- $p_k$ is the proportion of samples in the subset that belong to class $k$.
- $K$ is the number of unique labels in the dataset.

In a binary classification problem, $K=2$ and we have only $p_1$ and $p_2$. Then:
$$Gini(S)=1-(p_1^2+p_2^2)$$
Letting $a=p_1$, we have:
$$\text{Gini}(S)=1-(a^2+(1-a)^2)=2a(1-a)$$
After splitting, the weighted Gini impurity is calculated as:
$$\text{Gini}(S,i)=\frac{|S_1|}{|S|}\cdot\text{Gini}(S_1)+\frac{|S_2|}{|S|}\cdot\text{Gini}(S_2)$$

## **Optimizer**

We will use pruning to avoid overfitting. Pruning in CART is done using cost-complexity pruning, where a penalty term is added.

The cost-complexity function for a tree $T$ is:
$$R_\alpha(T)=R(T)+\alpha\cdot|T|$$
Where:

- $R(T)$ is the impurity or misclassification cost of the tree on $S$.
- $|T|$ is the number of leaves in the tree.
- $\alpha > 0$ is a regularization parameter.

We can use metrics like Gini impurity, entropy, or training error to calculate $R(T)$. For any given tree, $R(T)$ is the sum of the misclassification costs of all leaves in the tree. For each leaf $t$ in $T$, the misclassification cost $R(t)$ is defined as:
$$R(t)=\frac{1}{N_t}\sum_{i\in t}\ell(y_i,\hat y_{i,t})$$
Where:
-  $N_t$ is the number of samples in leaf $t$
- $\ell(y_i,\hat y_{i,t})$ is the loss for one sample, where $y_i$ is the true label and $\hat y_{i,t}$ is the predicted label for leaf $t$.

For simplicity, we will use the 0-1 loss as the loss function here.
$$
\ell(y_i, \hat{y}_t) = \begin{cases} 
1, & \text{if } y_i \neq \hat{y}_{i,t} \\ 
0, & \text{if } y_i = \hat{y}_{i,t} 
\end{cases}
$$

Then, $R(T)$ is defined as:
$$R(T)=\sum_{t\in\text{leaves}(T)}R(t)$$
Where $\text{leaves}(T)$ is the set of all leaves in $T$.

We will minimize $R_\alpha(T)$ to find the one that best balances between complexity and loss and use cross-validation to select the best $\alpha$.


## **Coding**
### EDA
- No missing values in the dataset, and this is a balanced dataset.
- `sex` and `exange` are binary.
    - `exange`: 0 means no exercise-induced angina and 1 means exercise-induced angina.
- `cp`, `fbs`, `restecg`, `slop`, `ca` and `thal` are ordinal. But there is no need to use the ordinal encoder since they are represented with numbers.
    - `fbs` is binary, 0 means normal and 1 links to diabetes.
    - `restecg`: 0 is the best (normal) and 2 is the worst (linked to severe cardiovascular issues).
    - `cp`: hard to define the order, need to pay attention to this variable.
    - `slope`: 0 is the best and 2 is the worst.
    - `ca`: 0 is the best and 4 is the worst.
    - `thal`: 1 is the best and 3 is the worst.
- **Note: later use correlation matrix...**


### Using scikit-learn

In [86]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer 
from sklearn.model_selection import KFold, train_test_split, GridSearchCV, ParameterGrid
from sklearn.metrics import confusion_matrix, classification_report, recall_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

In [87]:
data = pd.read_csv("heart.csv")
X = data.drop(columns=["target"])
y = data["target"]

In [88]:
# Preprocessor
cat_ftrs = ["sex", "exang"]
num_ftrs = ["age", "trestbps", "chol", "thalach", "oldpeak", "cp", "fbs", "restecg", "slope", "ca", "thal"]

categorical_transformer = Pipeline(steps=[
    ("onehot", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
    ("scaler", StandardScaler())
])
numerical_transformer = Pipeline(steps=[
    ("scaler", StandardScaler())
])
preprocessor = ColumnTransformer([
    ("cat", categorical_transformer, cat_ftrs),
    ("num", numerical_transformer, num_ftrs)
])

In [89]:
# ML pipeline
def MLpipe_kfold(X, y, random_states, preprocessor, ML_algo, param_grid, n_splits=5):
    test_scores = []
    best_models = []
    for i, random_state in enumerate(random_states):
        X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)
        kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
        pipe = make_pipeline(preprocessor, ML_algo)
        grid = GridSearchCV(pipe, param_grid=param_grid, cv=kf, n_jobs=-1, return_train_score=True, 
                            verbose=True, scoring="accuracy")
        grid.fit(X_other, y_other)
        results = pd.DataFrame(grid.cv_results_)
        best_models.append(grid)
        y_test_pred = best_models[-1].predict(X_test)
        test_accuracy = accuracy_score(y_test, y_test_pred)
        test_scores.append(test_accuracy)
    return test_scores, best_models

In [90]:
from sklearn.tree import DecisionTreeClassifier

random_states = [2060]
ML_algo = DecisionTreeClassifier(random_state=2060, criterion="gini")
param_grid = {
    "decisiontreeclassifier__max_depth": [None, 3, 5, 10, 20],
    "decisiontreeclassifier__min_samples_split": [2, 5, 10, 20],
    "decisiontreeclassifier__min_samples_leaf": [1, 2, 5, 10],
    "decisiontreeclassifier__max_leaf_nodes": [5, 10]
}
test_scores, best_models = MLpipe_kfold(X, y, random_states, preprocessor, ML_algo, param_grid, n_splits=10)
print("Average Testing Accuracy:", np.mean(test_scores))

Fitting 10 folds for each of 160 candidates, totalling 1600 fits
Average Testing Accuracy: 0.819672131147541


  _data = np.array(data, dtype=dtype, copy=copy,


In [91]:
import pandas as pd
import numpy as np
import copy
import math

df = pd.read_csv("heart.csv")
data.head() # 303 rows x 13 features, variable target is the target variable

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### Implementation
#### Splitting

In [26]:
def train_test_split(data, test_size=0.4, random_state=2060):
    if random_state is not None:
        np.random.seed(random_state)
    n_samples = data.shape[0]
    indices = np.random.permutation(n_samples) # shuffling
    test_size = int(n_samples * test_size)
    test_indices = indices[:test_size]
    train_indices = indices[test_size:]
    train_data = data[train_indices]
    test_data = data[test_indices]
    return train_data, test_data

#### CART

In [68]:
class Node:
    """
    Represents a node in the decision tree. Each node can be one of the following:
    - Decision node: splits the data.
    - Leaf node: predicts the label.

    Parameters:
    - left: none for leaf nodes.
    - right: none for leaf nodes.
    - label: none for decision nodes.
    - feature: none for leaf nodes.
    - threshold: none for leaf nodes or categorical splits.
    """
    def __init__(self, left=None, right=None, label=None, feature=None, threshold=None):
        self.left = left
        self.right = right
        self.label = label
        self.feature = feature
        self.threshold = threshold

    def is_leaf(self):
        """
        Check if a node is a leaf node.
        """
        return self.label is not None

In [None]:
class CART:
    def __init__(self, max_depth=10, min_samples_split=10):
        """
        Parameters:
        - max_depth: int
            The maximum depth of the tree.
        - min_samples_split: int
            The minimum number of samples required to split an internal node.
        - tree: Node
            The node of the decision tree.
        """
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.tree = None
    
    def fit(self, data):
        """
        Fit the decision tree to the training data.

        Parameters:
        - data: np.ndarray
            The training dataset where the last column is the target variable.
        """
        self.tree = self._build_tree(data, depth=0)

    def _build_tree(self, data, depth):
        """
        Recursively build the tree.

        Parameters:
        - data: np.ndarray
            The training dataset at the current node, where the last column is the target variable.
        - depth: int
            The current depth of the tree.
        
        Returns:
        - Node:
            The node of the decision tree.
        """
        labels = data[:, -1]
        # Stopping conditions
        # All labels are the same
        if len(np.unique(labels)) == 1:
            return Node(label=labels[0])
        # Max depth or minimum split size is reached
        # TODO: check if we need to add the = case?
        if depth > self.max_depth or len(data) < self.min_samples_split:
            major = np.bincount(labels.astype(int)).argmax()
            return Node(label=major) # return the majority class

        # Find the best split
        best_split = self._find_best_split(data)
        # No valid split
        if not best_split:
            major = np.bincount(labels.astype(int)).argmax()
            return Node(label=major) # return the majority class
        # Remove the splitted categorical feature
        if best_split["type"] == "categorical":
            remaining_data_left = np.delete(best_split["left"], best_split["feature"], axis=1)
            remaining_data_right = np.delete(best_split["right"], best_split["feature"], axis=1)
        else:
            remaining_data_left = best_split["left"]
            remaining_data_right = best_split["right"]
        # Recursion
        left_tree = self._build_tree(remaining_data_left, depth + 1)
        right_tree = self._build_tree(remaining_data_right, depth + 1)
        return Node(
            left=left_tree,
            right=right_tree,
            feature=best_split["feature"],
            threshold=best_split["threshold"]
        )
        
    def _find_best_split(self, data):
        """
        Find the best split for the current data.

        Parameters:
        - data: np.ndarray
            The training dataset at the current node.
        
        Returns:
        - A dictionary containing the best split information.
        """
        best_gini = float("inf") # use positive infinity since we will minimize it
        best_split = None
        n_features = data.shape[1] - 1 # last column is the target
        for feature in range(n_features):
            unique_values = np.unique(data[:, feature])
            sorted_values = np.sort(unique_values)
            if len(unique_values) > 2:
                thresholds = (sorted_values[1:] + sorted_values[:-1]) / 2 # midpoints
                for threshold in thresholds:
                    left, right = self._split_continuous(data, feature, threshold)
                    if len(left) == 0 or len(right) == 0:
                        continue
                    gini = self._gini_for_split(data, left, right)
                    if gini < best_gini:
                        best_gini = gini
                        best_split = {
                            "feature": feature,
                            "threshold": threshold,
                            "left": left,
                            "right": right,
                            "type": "continuous"
                        }
            else:
                left, right = self._split_categorical(data, feature)
                if len(left) == 0 or len(right) == 0:
                    continue
                gini = self._gini_for_split(data, left, right)
                if gini < best_gini:
                    best_gini = gini
                    best_split = {
                        "feature": feature,
                        "threshold": threshold,
                        "left": left,
                        "right": right,
                        "type": "categorical"
                    }
        return best_split
    
    # TODO: add a line to handle cases when len(data) = 0 in Gini?

    def _gini_for_node(self, data):
        """
        Calculate the Gini impurity for a node.
        """
        labels = data[:, -1] # the last column
        unique_labels, counts = np.unique(labels, return_counts=True)
        probs = counts / len(data) # two probabilities of being in two classes
        gini = 1 - np.sum(probs ** 2)
        return gini

    def _gini_for_split(self, data, left, right):
        """
        Calculate the Gini impurity for a split.
        """
        total_size = len(data)
        left_size = len(left)
        right_size = len(right)
        gini_left = gini_for_node(left)
        gini_right = gini_for_node(right)
        gini = (left_size / total_size) * gini_left + (right_size / total_size) * gini_right
        return gini

    def _split_continuous(self, data, feature_index, threshold):
        """
        Split the data based on a continuous feature.

        Parameters:
        - data: np.ndarray
            Dataset to split.
        - feature_index: int
            Index of the feature to split.
        
        Returns:
        - Two subsets of the data.
        """
        left = data[data[:, feature_index] <= threshold]
        right = data[data[:, feature_index] > threshold]
        return left, right

    def _split_categorical(self, data, feature_index):
        """
        Split the data based on a categorical feature.

        Parameters:
        - data: np.ndarray
            Dataset to split.
        - feature_index: int
            Index of the feature to split.

        Returns:
        - Two subsets of the data.
        """
        values = np.unique(data[:, feature_index])
        threshold = np.mean(values)
        left = data[data[:, feature_index] <= threshold]
        right = data[data[:, feature_index] > threshold]
        return left, right

    def _predict_row(self, node, row):
        """
        Predict the label for a single row.
        """
        if node.is_leaf():
            return node.label
        # Categorical split
        if node.threshold is None:
            return self._predict_row(node.left, row) if row[node.feature] == 0 else self._predict_row(node.right, row)
        else:  
            return self._predict_row(node.left, row) if row[node.feature] <= node.threshold else self._predict_row(node.right, row)
        
    def predict(self, test_data):
        """
        Predict the labels for the test data.
        """
        return np.array([self._predict_row(self.tree, row) for row in test_data])
    
    def loss(self, data):
        """
        Calculate the loss for the test data.
        """
        preds = self.predict(data[:, :-1])
        true_labels = data[:, -1]
        return np.sum(preds != true_labels) / len(true_labels)
    
    def accuracy(self, data):
        """
        Calculate the accuracy for the test data.
        """
        return 1 - self.loss(data)
    
    def visualize(self):
        """
        Visualize the decision tree.
        """
        if self.tree is None:
            print("Empty tree.")
        else:
            print("--- START PRINT TREE ---")
            self._visualize_tree(self.tree)
            print("--- END PRINT TREE ---")

    def _visualize_tree(self, node, depth=0):
        """
        Recursively visualize the decision tree.
        """
        indent = "  " * depth
        if node.is_leaf():
            print(f"{indent}Predict -> {node.label}")
        else:
            if node.threshold is None:
                print(f"{indent}split attribute = {node.feature}; categorical")
            else:
                print(f"{indent}split attribute = {node.feature}; threshold = {node.threshold:.3f}")
            print(f"{indent}Left:")
            self._visualize_tree(node.left, depth + 1)
            print(f"{indent}Right:")
            self._visualize_tree(node.right, depth + 1)

## Unit Tests

In [None]:
# Tests for Gini calculations and splitting
data = np.array([
    [10, 0, 0],
    [20, 1, 0],
    [30, 0, 1],
    [40, 1, 1],
    [50, 0, 0]
])


--- START PRINT TREE ---
split attribute = 0; threshold = 25.000
Left:
  Predict -> 0
Right:
  split attribute = 0; threshold = 45.000
  Left:
    Predict -> 1
  Right:
    Predict -> 0
--- END PRINT TREE ---


## Main

In [None]:
# Load and split the data
data = np.loadtxt("heart.csv", delimiter=",", skiprows=1)
train_data, test_data = train_test_split(data, test_size=0.4, random_state=2060)

In [85]:
model = CART(max_depth=10, min_samples_split=10)
model.fit(train_data)
train_accuracy = model.accuracy(train_data)
test_accuracy = model.accuracy(test_data)
print(f"Training accuracy: {train_accuracy:.3f}")
print(f"Testing accuracy: {test_accuracy:.3f}")

model.visualize()

Training accuracy: 0.934
Testing accuracy: 0.793
--- START PRINT TREE ---
split attribute = 11; threshold = 0.500
Left:
  split attribute = 12; threshold = 2.500
  Left:
    split attribute = 9; threshold = 2.700
    Left:
      split attribute = 7; threshold = 92.500
      Left:
        Predict -> 0.0
      Right:
        split attribute = 3; threshold = 158.000
        Left:
          split attribute = 9; threshold = 1.700
          Left:
            Predict -> 1.0
          Right:
            Predict -> 1
        Right:
          Predict -> 1
    Right:
      Predict -> 0
  Right:
    split attribute = 7; threshold = 143.500
    Left:
      split attribute = 9; threshold = 0.250
      Left:
        Predict -> 1.0
      Right:
        Predict -> 0.0
    Right:
      split attribute = 2; threshold = 0.500
      Left:
        Predict -> 0
      Right:
        split attribute = 0; threshold = 39.000
        Left:
          Predict -> 0.0
        Right:
          split attribute = 0; thr