# Advanced Tree Learning Concepts: Mathematical Foundations

## Table of Contents
1. [Gradient-based One-Side Sampling (GOSS)](#goss)
2. [Dropouts meet Multiple Additive Regression Trees (DART)](#dart)
3. [Leaf-wise vs Level-wise Growth](#leaf-vs-level)
4. [Feature Bundling and Optimization](#feature-bundling)
5. [Mathematical Foundations](#mathematical-foundations)

---

## Gradient-based One-Side Sampling (GOSS) {#goss}

### Intuitive Concept
GOSS is a sampling technique that intelligently selects training samples based on their gradients. The key insight is that **samples with larger gradients contribute more to learning**, while samples with small gradients are already well-fitted.

Think of it like a teacher focusing more attention on students who are struggling (large gradients) rather than those who already understand the material (small gradients).

### Mathematical Formulation

Let $\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}$ be our training dataset.

For each sample $i$, we have gradient $g_i$ and Hessian $h_i$:
- $g_i = \frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}$
- $h_i = \frac{\partial^2 L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)^2}$

**GOSS Algorithm:**

1. **Sort samples by gradient magnitude**: $|g_1| \geq |g_2| \geq \ldots \geq |g_n|$

2. **Select top-a samples**: $A = \{i : i \leq a \cdot n\}$ (large gradients)

3. **Randomly sample from remaining**: $B \sim \text{Random}(\{a \cdot n + 1, \ldots, n\}, b \cdot n)$ (small gradients)

4. **Compute weighted information gain**:
   $$\text{Gain}_{GOSS} = \frac{1}{|A| + \frac{1-a}{b}|B|} \left[ \frac{(\sum_{i \in A} g_i + \frac{1-a}{b}\sum_{i \in B} g_i)^2}{\sum_{i \in A} h_i + \frac{1-a}{b}\sum_{i \in B} h_i} \right]$$

The factor $\frac{1-a}{b}$ **amplifies** the contribution of small-gradient samples to compensate for under-sampling.

### Why GOSS Works

**Theorem (GOSS Approximation):** Under mild conditions, GOSS provides an unbiased estimate of the full-data information gain:

$$\mathbb{E}[\text{Gain}_{GOSS}] \approx \text{Gain}_{Full}$$

with variance reduction of approximately $\frac{a + b(1-a)^2}{1}$ compared to random sampling.

---

## Dropouts meet Multiple Additive Regression Trees (DART) {#dart}

### Intuitive Concept
DART addresses **overfitting in boosting** by applying dropout techniques to trees. Instead of always adding new trees to all previous trees, DART randomly "drops out" some previous trees during training.

This is like studying for an exam by sometimes ignoring your previous notes to avoid over-relying on them.

### Mathematical Framework

Standard gradient boosting:
$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \gamma_m h_m(\mathbf{x})$$

**DART modification:**
$$F_m(\mathbf{x}) = \sum_{k \in \mathcal{K}_m} \tilde{\gamma}_k h_k(\mathbf{x}) + \gamma_m h_m(\mathbf{x})$$

Where:
- $\mathcal{K}_m \subset \{1, 2, \ldots, m-1\}$ is the set of **non-dropped trees**
- $\tilde{\gamma}_k$ are **normalized weights** for non-dropped trees

### DART Algorithm

1. **Dropout Selection**: For each iteration $m$, randomly select trees to drop:
   $$\mathcal{D}_m \sim \text{Binomial}(\{1, \ldots, m-1\}, p_{drop})$$
   
2. **Weight Normalization**: Normalize weights of remaining trees:
   $$\tilde{\gamma}_k = \gamma_k \cdot \frac{|\mathcal{K}_m|}{m-1} \quad \text{for } k \in \mathcal{K}_m$$

3. **New Tree Training**: Train $h_m$ on residuals from the reduced ensemble:
   $$r_i^{(m)} = y_i - \sum_{k \in \mathcal{K}_m} \tilde{\gamma}_k h_k(\mathbf{x}_i)$$

### Theoretical Justification

**Regularization Effect**: DART introduces implicit regularization through:

$$\mathbb{E}[\text{Model Complexity}] = (1-p_{drop}) \cdot \text{Standard Complexity}$$

**Variance Reduction**: The expected prediction variance:
$$\text{Var}[F_{DART}(\mathbf{x})] \leq (1-p_{drop}^2) \cdot \text{Var}[F_{GBM}(\mathbf{x})]$$

---

## Leaf-wise vs Level-wise Growth {#leaf-vs-level}

### Visual Representation

```
Level-wise (Breadth-First):
       Root
      /    \
    N1      N2      ← Level 1: Split both nodes
   /  \    /  \
  N3  N4  N5  N6    ← Level 2: Split all nodes

Leaf-wise (Best-First):
       Root
      /    \
    N1      N2
   /  \      
  N3  N4     
 /  \        
N5  N6       ← Only split the most beneficial leaf
```

### Mathematical Analysis

**Information Gain Comparison:**

For level-wise splitting, we maximize:
$$\Delta_{\text{level}} = \sum_{i \in \text{Level } L} \max_{\text{split } s_i} \text{Gain}(s_i)$$

For leaf-wise splitting:
$$\Delta_{\text{leaf}} = \max_{\text{leaf } j} \max_{\text{split } s_j} \text{Gain}(s_j)$$

**Theorem**: $\Delta_{\text{leaf}} \geq \Delta_{\text{level}}$ (leaf-wise is locally optimal)

### Computational Complexity

| Method | Time per Level | Memory | Overfitting Risk |
|--------|----------------|---------|------------------|
| Level-wise | $O(2^L \cdot n \cdot d)$ | $O(2^L)$ | Lower |
| Leaf-wise | $O(k \cdot n \cdot d)$ | $O(k)$ | Higher |

Where $L$ is depth, $k$ is number of leaves, $n$ is samples, $d$ is features.

### Regularization for Leaf-wise Growth

To prevent overfitting, leaf-wise methods use:

1. **Minimum samples per leaf**: $\text{samples}_{\text{leaf}} \geq \lambda_1$
2. **Maximum depth**: $\text{depth} \leq \lambda_2$  
3. **Minimum gain threshold**: $\text{Gain} \geq \lambda_3$

---

## Feature Bundling and Optimization {#feature-bundling}

### Exclusive Feature Bundling (EFB)

**Core Insight**: Many features are mutually exclusive (sparse), so they can be bundled together.

**Mathematical Formulation:**

Features $f_i$ and $f_j$ are bundled if:
$$\text{Conflict}(f_i, f_j) = \frac{|\{k : f_i(\mathbf{x}_k) \neq 0 \text{ and } f_j(\mathbf{x}_k) \neq 0\}|}{n} < \theta$$

**Bundle Construction Algorithm:**
1. Build conflict graph $G = (V, E)$ where $V$ are features, $E$ are conflicts
2. Find graph coloring (NP-hard, use greedy approximation)
3. Each color class becomes a bundle

**Bundle Merging:**
For features $f_1, f_2, \ldots, f_k$ in a bundle:
$$f_{\text{bundle}}(\mathbf{x}) = \sum_{i=1}^k \text{offset}_i \cdot f_i(\mathbf{x})$$

Where $\text{offset}_i$ ensures no value collision.

---

## Mathematical Foundations {#mathematical-foundations}

### Gradient Boosting Framework

The general objective function:
$$\mathcal{L}(\Theta) = \sum_{i=1}^n L(y_i, F(\mathbf{x}_i)) + \sum_{m=1}^M \Omega(h_m)$$

Where:
- $L(\cdot, \cdot)$ is the loss function
- $\Omega(\cdot)$ is regularization
- $F(\mathbf{x}) = \sum_{m=1}^M \gamma_m h_m(\mathbf{x})$

### Second-order Approximation

Taylor expansion of loss around current prediction:
$$L(y_i, F_{m-1}(\mathbf{x}_i) + h_m(\mathbf{x}_i)) \approx L(y_i, F_{m-1}(\mathbf{x}_i)) + g_i h_m(\mathbf{x}_i) + \frac{1}{2}h_i h_m^2(\mathbf{x}_i)$$

This leads to the optimal leaf weight:
$$w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}$$

And information gain:
$$\text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda}\right] - \gamma$$

### Convergence Analysis

**Theorem (Boosting Convergence)**: Under conditions:
1. $\|h_m\|_\infty \leq M$ (bounded weak learners)
2. Edge condition: $\gamma_m \geq \gamma > 0$

The training error decreases exponentially:
$$\text{Error}_m \leq \text{Error}_0 \exp(-2\gamma^2 m)$$

### Bias-Variance Decomposition

For ensemble $F(\mathbf{x}) = \sum_{m=1}^M \alpha_m h_m(\mathbf{x})$:

$$\mathbb{E}[(y - F(\mathbf{x}))^2] = \text{Bias}^2[F(\mathbf{x})] + \text{Var}[F(\mathbf{x})] + \sigma^2$$

Where:
- **Bias**: $\mathbb{E}[F(\mathbf{x})] - f^*(\mathbf{x})$
- **Variance**: $\mathbb{E}[(F(\mathbf{x}) - \mathbb{E}[F(\mathbf{x})])^2]$

**Key Insights:**
- More trees → Lower bias, Higher variance
- DART and regularization → Control variance
- GOSS → Maintain bias while reducing computation

---

## Implementation Considerations

### Memory Optimization
- **Histogram-based splitting**: $O(k)$ instead of $O(n)$ for split finding
- **Feature bundling**: Reduces feature space dimensionality
- **Gradient quantization**: Compress gradients to save memory

### Parallel Optimization
- **Data parallelism**: Split data across workers
- **Feature parallelism**: Split features across workers  
- **Hybrid approach**: Combine both for optimal scaling

### Hyperparameter Sensitivity

| Parameter | Effect | Typical Range |
|-----------|--------|---------------|
| `learning_rate` | Bias-variance tradeoff | [0.01, 0.3] |
| `max_depth` | Model complexity | [3, 10] |
| `min_samples_leaf` | Overfitting control | [1, 100] |
| `goss_top_rate` | Sample efficiency | [0.1, 0.5] |
| `dart_drop_rate` | Regularization strength | [0.01, 0.2] |

These advanced techniques work synergistically to create highly efficient and accurate tree-based models while maintaining interpretability and preventing overfitting.