## DATA2060 Final Project

Model: **CART for classification**

Team Members:
- Muxin Fu
- Yixiao Zhang
- Jingming Xu
- Mingrui Chen

### 0. Introduction

#### 0.1 Overview  
The Classification and Regression Tree (CART) algorithm is a nonparametric supervised learning method that builds a binary decision tree for classification tasks. At each step, the algorithm selects a feature and threshold that create two child nodes with lower class impurity, using criteria such as Gini impurity or entropy. Through this recursive partitioning, CART represents the classifier as a set of piecewise-constant regions, where each leaf corresponds to a predicted class label. Because the sequence of splits directly mirrors the decision-making process, CART offers a transparent and intuitive model structure.

#### 0.2 Advantages  
CART offers several notable strengths that contribute to its widespread use as a baseline classifier.  
First, the model is highly interpretable: each internal node corresponds to a clear “if–then” condition based on a single feature, allowing the entire decision path to be easily traced and communicated. This transparency is particularly valuable in settings where model explanations are required.  

Second, CART is able to capture nonlinear relationships and feature interactions without relying on explicit transformations or parametric assumptions. Its recursive splitting procedure enables the model to adapt flexibly to irregular or complex decision boundaries, providing expressive power beyond that of linear models.  

Moreover, CART requires minimal preprocessing. It can accommodate both numerical and categorical variables, is robust to monotonic feature scaling, and implicitly performs feature selection by choosing splits only on informative variables. These characteristics make CART convenient to implement and reliable across a wide range of practical applications.

#### 0.3 Disadvantages  
Despite its advantages, CART also presents several limitations that must be considered.  
Most importantly, the model is prone to overfitting when allowed to grow without constraints. As emphasized in the bias–complexity trade-off discussed in the course reading, increasing model flexibility reduces approximation error but raises estimation error, causing deep, unpruned trees to exhibit high variance and poor generalization.  

CART also tends to be unstable: small perturbations in the training data can alter early splits, resulting in substantially different tree structures. This sensitivity undermines the model’s reliability, especially in contexts requiring stable predictions.  

Finally, because CART relies exclusively on axis-aligned splits, it may need many successive partitions to approximate diagonal or curved decision boundaries, leading to unnecessarily deep and complex trees. These shortcomings motivate the use of pruning techniques and more advanced ensemble methods, such as Random Forests and Gradient Boosting, which address variance and stability issues more effectively.


### 1. Representation

### 2. Loss

In the classification setting, losses are the **mearuses of impurity**.  CART minimizes impruity and the loss is defined per split. Generally speaking, **Gini** and **Entropy** are good measures.

To compute Loss, we need: 
* Impurity measure, 
* Split loss based on choosen impurity measure.

In the scikit-learn, this is determined by the parameter **criterion**: *{“gini”, “entropy”, “log_loss”}, default=”gini”* 



#### 2.1 **Impurity Function**

For a $K$-class classification problem, consider node $i$ containing a subset of samples

$$S_i = \{(x_j, y_j)\}_{j \in \mathcal{I}_i}, \qquad N_i = |S_i|.$$

The number of samples in node $i$ that belong to class $k$ is

$$n_{i,k} = \sum_{j \in \mathcal{I}_i} \mathbf{1}(y_j = k).$$

The class proportion of class $k$ in node $i$ is

$$p_{i,k} = \frac{n_{i,k}}{N_i}, \qquad k = 1, \dots, K.$$
$$\sum_{k=1}^K p_{i,k} = 1,\text{and  } p_{i,k} \ge 0 \quad \text{for } k = 1, \dots, K.$$

##### 2.1.1 **Gini**


- The Gini impurity of node $i$ is:   
$$G_i = 1 - \sum_{k=1}^K p_{i,k}^2.$$

##### 2.1.2 **Entropy**

- The entropy impurity of node $i$ is
$$H_i = - \sum_{k=1}^K p_{i,k} \log p_{i,k},$$

- And we assume $0 \log 0 = 0$.

#### 2.2 **Split Loss**

Given a candidate split $\theta$ applied at node $i$, the dataset $S_i$ is partitioned into a left subset $S_i^{\text{left}}(\theta)$ and a right subset $S_i^{\text{right}}(\theta)$:

$$
S_i^{\text{left}}(\theta) = \{(x_j, y_j) \in S_i \mid x_{j, f} \le t\},
$$

$$
S_i^{\text{right}}(\theta) = S_i \setminus S_i^{\text{left}}(\theta),
$$

where $\theta = (f, t)$ denotes the split feature index $f$ and the threshold value $t$.

Let the number of samples in the left and right subsets be

$$
N_i^{\text{left}} = |S_i^{\text{left}}(\theta)|, \qquad 
N_i^{\text{right}} = |S_i^{\text{right}}(\theta)|.
$$

Their corresponding class proportions are computed in the same way as in Section 2.1.


##### 2.2.1 **Weighted Child Impurity**

Given an impurity function $C(\cdot)$ (e.g., Gini or entropy), the **split loss** at node $i$ for candidate split $\theta$ is defined as the weighted sum of the left and right child impurities:

$$
L(S_i, \theta) 
= 
\frac{N_i^{\text{left}}}{N_i} 
\, C\!\left(S_i^{\text{left}}(\theta)\right)
\;+\;
\frac{N_i^{\text{right}}}{N_i}
\, C\!\left(S_i^{\text{right}}(\theta)\right).
$$

Here:

- $C\!\left(S_i^{\text{left}}(\theta)\right)$ is the impurity (Gini or entropy) of the left child node.
- $C\!\left(S_i^{\text{right}}(\theta)\right)$ is the impurity of the right child node.


##### 2.2.2 **Optimal Split Selection**

The optimal split parameter is chosen by minimizing the split loss:

$$
\theta^{*} = \arg\min_{\theta} \; L(S_i, \theta).
$$

And this will be futher explained in the next part, Optimizer on how to actually implement it.

- ***Reference***: scikit-learn mathematical formulation https://scikit-learn.org/stable/modules/tree.html#tree-mathematical-formulation

### 3. Optimizer

### 3.1 What is Optimized in CART

CART performs a **greedy, recursive partitioning** - at each node, it selects the best split that maximizes information gain (or equivalently minimizes impurity).

So the optimizer is essentially a **greedy search algorithm** that finds:

$$
\arg\min_{(f,t)} \; \text{Impurity}(S_{\text{left}}) + \text{Impurity}(S_{\text{right}})
$$

where $f$ is the feature and $t$ is the threshold.


#### 3.1.1 Objective Function


CART minimizes an **impurity measure** (loss function) such as:
- Gini Index:
$$ G(S) = 1 - \sum_{k=1}^{K}p_k^2 $$
- Entropy:
$$ H(S) = - \sum_{k=1}^{K}p_klog(p_k)$$

At each node:
$$
\text{Gain}(S, f, t) = \text{Impurity}(S) 
- \frac{|S_{\text{left}}|}{|S|} \, \text{Impurity}(S_{\text{left}}) 
- \frac{|S_{\text{right}}|}{|S|} \, \text{Impurity}(S_{\text{right}})
$$

The algorithm chooses the feature $f*$ and threshold $t*$ that maximize this gain.

#### 3.1.2 Pseudo-code

```python
Inputs: dataset S, feature set F, impurity measure Impurity()

best_gain ← 0  
best_feature, best_threshold ← None  

for each feature f in F:  
 for each possible threshold t in f:  
  Split S into S_left and S_right using (f, t)  
  if either split is empty: continue  
  gain ← Impurity(S) 
     - (|S_left| / |S|) * Impurity(S_left)
     - (|S_right| / |S|) * Impurity(S_right)  
  if gain > best_gain:  
   best_gain ← gain  
   best_feature ← f  
   best_threshold ← t  

return (best_feature, best_threshold)

