# Guide to Gradient Boosting Algorithms

## Mathematical Foundation

### The Core Principle

Gradient boosting constructs a strong predictor by sequentially adding weak learners that minimize a differentiable loss function. The ensemble model takes the form:

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \nu \cdot h_m(\mathbf{x})$$

where:
- $F_m(\mathbf{x})$ is the ensemble prediction after $m$ iterations
- $h_m(\mathbf{x})$ is the $m$-th weak learner (typically a decision tree)
- $\nu$ is the learning rate (shrinkage parameter)

### Loss Function Framework

For a dataset $\{(\mathbf{x}_i, y_i)\}_{i=1}^n$, we minimize the total loss:

$$\mathcal{L}_{total} = \sum_{i=1}^n L(y_i, F(\mathbf{x}_i))$$

The key insight is that each new weak learner should fit the **negative gradients** of the loss function:

$$r_{im} = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}$$

### Common Loss Functions

| Loss Function | $L(y, F)$ | First Derivative | Second Derivative |
|---------------|-----------|------------------|-------------------|
| **MSE** | $\frac{1}{2}(y - F)^2$ | $F - y$ | $1$ |
| **MAE** | $\|y - F\|$ | $\text{sign}(F - y)$ | $0$ (undefined at $F=y$) |
| **LogLoss** | $-[y \log(p) + (1-y) \log(1-p)]$ | $p - y$ | $p(1-p)$ |
| **Huber** | $\begin{cases} \frac{1}{2}(y-F)^2 & \text{if } \|y-F\| \leq \delta \\ \delta\|y-F\| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}$ | $\begin{cases} F - y & \text{if } \|y-F\| \leq \delta \\ \delta \cdot \text{sign}(F-y) & \text{otherwise} \end{cases}$ | $\begin{cases} 1 & \text{if } \|y-F\| \leq \delta \\ 0 & \text{otherwise} \end{cases}$ |

where $p = \frac{1}{1 + e^{-F}}$ for LogLoss.

### Basic Algorithm Structure

```
ALGORITHM: Gradient Boosting
INPUT: Training data {(x_i, y_i)}, loss function L, number of iterations M
OUTPUT: Ensemble model F_M(x)

1. Initialize: F_0(x) = arg min_γ Σ L(y_i, γ)

2. FOR m = 1 to M:
   a) Compute pseudo-residuals:
      r_im = -∂L(y_i, F_{m-1}(x_i)) / ∂F_{m-1}(x_i)
   
   b) Train weak learner h_m:
      h_m = TRAIN_TREE({(x_i, r_im)})
   
   c) Find optimal step size:
      γ_m = arg min_γ Σ L(y_i, F_{m-1}(x_i) + γ·h_m(x_i))
   
   d) Update ensemble:
      F_m(x) = F_{m-1}(x) + ν·γ_m·h_m(x)

3. RETURN F_M(x)
```

---

## First-Order vs Second-Order Methods

### First-Order Methods (Traditional Gradient Boosting)

**Mathematical Basis:**
Uses only the first derivative of the loss function. Each tree is trained to predict the negative gradients:

$$h_m(\mathbf{x}) \approx -g_i = -\frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}$$

**Advantages:**
- Simple and robust implementation
- Works with any differentiable loss function
- Lower computational cost per iteration

**Disadvantages:**
- Linear convergence rate
- May require more iterations to converge
- Less principled approach to optimization

```
ALGORITHM: First-Order Gradient Boosting
FOR each iteration m:
    1. gradients[i] = -∂L(y[i], current_prediction[i]) / ∂F
    2. tree = TRAIN_TREE(features, gradients)
    3. predictions += learning_rate * tree.predict(features)
```

### Second-Order Methods (Newton's Method)

**Mathematical Basis:**
Uses both first and second derivatives (Hessian) for more principled optimization. Based on Newton's method for optimization:

$$F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) - \frac{g_i}{h_i}$$

where:
- $g_i = \frac{\partial L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}$ (first derivative)
- $h_i = \frac{\partial^2 L(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)^2}$ (second derivative)

**Information Gain Formula:**
For split evaluation, the gain becomes:

$$\text{Gain} = \frac{1}{2}\left[\frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda}\right]$$

**Optimal Leaf Weight:**
The optimal prediction for a leaf containing samples $I$ is:

$$w^* = -\frac{\sum_{i \in I} g_i}{\sum_{i \in I} h_i + \lambda}$$

**Advantages:**
- Quadratic convergence near optimum
- More principled split selection
- Natural regularization through Hessian
- Better numerical properties

**Disadvantages:**
- Higher computational cost (~2x per iteration)
- More complex implementation
- Requires computing second derivatives

```
ALGORITHM: Second-Order Gradient Boosting
FOR each iteration m:
    1. gradients[i] = ∂L(y[i], current_prediction[i]) / ∂F
    2. hessians[i] = ∂²L(y[i], current_prediction[i]) / ∂F²
    3. tree = TRAIN_TREE_WITH_NEWTON(features, gradients, hessians)
       // Uses both g_i and h_i for split finding and leaf weights
    4. predictions += tree.predict(features)
```

### Convergence Comparison

The convergence behavior differs significantly:

- **First-order:** $\mathcal{O}(1/k)$ linear convergence
- **Second-order:** $\mathcal{O}(1/k^2)$ quadratic convergence near optimum

---

## Tree Building Strategies

### The Split-Finding Problem

At each node, we need to find the optimal split that maximizes information gain. For feature $j$ and threshold $\theta$:

$$\text{Gain}(j, \theta) = G(\text{parent}) - \frac{|I_L|}{|I|} G(I_L) - \frac{|I_R|}{|I|} G(I_R)$$

where $G(I)$ is an impurity measure for sample set $I$.

### Exact Method (Pre-sorted Algorithm)

**Core Idea:** Evaluate ALL possible split points between consecutive feature values.

**Process:**
1. Sort all samples by each feature value
2. For each feature, try all $n-1$ possible splits
3. Maintain running sums of gradients/Hessians for efficient gain calculation
4. Select split with maximum gain across all features

**Time Complexity:** $\mathcal{O}(n \times d \times \log n)$ per tree level
**Space Complexity:** $\mathcal{O}(n \times d)$ for pre-sorted arrays

```
ALGORITHM: Exact Split Finding
INPUT: Feature values X[:, j], gradients g, hessians h

1. sorted_indices = SORT_BY_FEATURE(X[:, j])
2. sorted_gradients = g[sorted_indices]
3. sorted_hessians = h[sorted_indices]

4. best_gain = -∞
5. G_left = 0, H_left = 0
6. G_total = SUM(sorted_gradients)
7. H_total = SUM(sorted_hessians)

8. FOR i = 0 to n-2:
   a) G_left += sorted_gradients[i]
   b) H_left += sorted_hessians[i]
   c) IF sorted_values[i] == sorted_values[i+1]: CONTINUE
   
   d) G_right = G_total - G_left
   e) H_right = H_total - H_left
   
   f) gain = CALCULATE_GAIN(G_left, H_left, G_right, H_right)
   g) IF gain > best_gain:
        best_gain = gain
        best_threshold = (sorted_values[i] + sorted_values[i+1]) / 2

9. RETURN best_threshold, best_gain
```

### Histogram Method (Binning Algorithm)

**Core Idea:** Pre-discretize continuous features into bins, only evaluate splits at bin boundaries.

**Mathematical Framework:**
Instead of $n-1$ possible splits, we evaluate only $B-1$ splits where $B$ is the number of bins (typically 255).

**Binning Strategies:**
1. **Equal-width binning:** $\text{bin}_k = \left[\frac{k(x_{\max} - x_{\min})}{B}, \frac{(k+1)(x_{\max} - x_{\min})}{B}\right]$
2. **Quantile binning:** Bins contain equal number of samples
3. **Feature-specific binning:** Different strategies per feature type

**Histogram Construction:**
For each bin $b$, maintain accumulated statistics:

$$H_b^{(g)} = \sum_{i: x_i \in \text{bin}_b} g_i, \quad H_b^{(h)} = \sum_{i: x_i \in \text{bin}_b} h_i$$

**Time Complexity:** $\mathcal{O}(B \times d)$ per tree level
**Space Complexity:** $\mathcal{O}(B \times d)$ for histograms

```
ALGORITHM: Histogram-Based Split Finding
INPUT: Feature values X[:, j], gradients g, hessians h, bin_boundaries

1. n_bins = LENGTH(bin_boundaries) + 1
2. hist_gradients = ZEROS(n_bins)
3. hist_hessians = ZEROS(n_bins)

4. FOR each sample i:
   a) bin_idx = FIND_BIN(X[i, j], bin_boundaries)
   b) hist_gradients[bin_idx] += g[i]
   c) hist_hessians[bin_idx] += h[i]

5. best_gain = -∞
6. G_left = 0, H_left = 0
7. G_total = SUM(hist_gradients)
8. H_total = SUM(hist_hessians)

9. FOR k = 0 to n_bins-2:
   a) G_left += hist_gradients[k]
   b) H_left += hist_hessians[k]
   c) G_right = G_total - G_left
   d) H_right = H_total - H_left
   
   e) gain = CALCULATE_GAIN(G_left, H_left, G_right, H_right)
   f) IF gain > best_gain:
        best_gain = gain
        best_threshold = bin_boundaries[k]

10. RETURN best_threshold, best_gain
```

### Performance Trade-offs

| Method | Accuracy | Speed | Memory | Best Use Case |
|--------|----------|-------|---------|---------------|
| **Exact** | Optimal | Slow | High | Small datasets (<100K), maximum accuracy |
| **Histogram** | ~99% of optimal | Fast | Low | Large datasets (>100K), production systems |

---

## Split Families: Axis-Aligned, Oblique, Hybrid, and More

Real-world boosting systems differ as much by **how they split** as by their loss and sampling. This section catalogs the main split families, their math, algorithms, and trade-offs—so you can choose intentionally (or mix them).

### Taxonomy at a Glance

| Family                                | Predicate form                                                         | Typical training          | Strengths                            | Caveats                                     |
| ------------------------------------- | ---------------------------------------------------------------------- | ------------------------- | ------------------------------------ | ------------------------------------------- |
| **Axis-Aligned (Axial)**              | $x_j \le \theta$                                                       | exact / histogram         | Fast, simple, stable                 | Needs more depth to capture interactions    |
| **Oblique (Linear)**                  | $\mathbf{w}^\top \mathbf{x} \le \theta$                                | greedy local optimization | Shallow trees, captures interactions | Costly; needs regularization                |
| **Sparse Oblique**                    | $\mathbf{w}$ sparse                                                    | L1/feature gating         | Better interpretability; speed       | Tuning sparsity                             |
| **Random Projection / Rotated**       | $\tilde{\mathbf{x}} = R\mathbf{x}$, then axial                         | random / PCA / LDA        | Cheap interaction capture            | Adds randomness; weaker control             |
| **Quadratic / Kernelized**            | $\mathbf{x}^\top Q \mathbf{x} + \mathbf{w}^\top \mathbf{x} \le \theta$ | local second-order fit    | Very expressive                      | Overfits; expensive                         |
| **Categorical (Multiway / 1-v-Rest)** | $x_j \in S$                                                            | exact / greedy set search | Native categorical handling          | Set search is exponential; needs heuristics |
| **Constrained (Monotone/Rule)**       | split + constraint filter                                              | feasibility checks        | Enforces domain structure            | May skip “best” split                       |

### Axis-Aligned (Axial) Splits

Same framework as in your **Exact** and **Histogram** methods. For second-order boosting, the **gain** for a split $(j,\theta)$ is:

$$
\text{Gain} = \tfrac{1}{2}\!\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{G^2}{H+\lambda}\right] - \gamma
$$

with the usual accumulated gradient $G_\bullet$ and Hessian $H_\bullet$, and regularizers $(\lambda,\gamma)$. See your existing pseudocode.

**When to use:** default for large-scale, high-dimensional data; pairs perfectly with **histogram binning** and **GOSS**.

### Oblique (Linear) Splits

Replace a single-feature threshold by a **hyperplane**:

$$
\mathbf{w}^\top \mathbf{x} \le \theta \quad \Rightarrow \quad \text{left},\ \text{else right}.
$$

If you **fix** $(\mathbf{w},\theta)$, you can reuse the same gain formula by treating $z_i=\mathbf{w}^\top \mathbf{x}_i$ as a pseudo-feature and splitting at $\theta$. The challenge is to **learn $(\mathbf{w},\theta)$** at each node efficiently.

#### Common Training Recipes

1. **Gradient-Weighted Logistic Split (fast and robust)**
   Fit a linear classifier to route samples left/right using the **signed first-order signal** (e.g., residuals for first-order GBDT or $\operatorname{sign}(g_i)$ for second-order) as soft labels.

   * Objective (at node $\mathcal{I}$):

     $$
     \min_{\mathbf{w},b}\ \sum_{i\in \mathcal{I}} \underbrace{\alpha_i}_{\text{|grad| or Hessian-weight}} \cdot \log\!\big(1+\exp(-y_i(\mathbf{w}^\top \mathbf{x}_i + b))\big) + \lambda_1\|\mathbf{w}\|_1+\tfrac{\lambda_2}{2}\|\mathbf{w}\|_2^2
     $$

     with $y_i\in\{-1,+1\}$ indicating preferred side (e.g., sign of gradient), and $\alpha_i$ the per-sample weight (e.g., $|g_i|$ or $h_i$).
   * After optimizing $(\mathbf{w},b)$, set $\theta=-b$. Evaluate **Newton gain** as usual over the routing induced by $\mathbf{w}^\top \mathbf{x}\le\theta$.

2. **Gain-Direct Optimization (coordinate/line search)**
   Optimize $\mathbf{w}$ to **maximize** the second-order gain directly. In practice:

   * Start with $\mathbf{w}$ from a quick linear fit (recipe 1).
   * Do a few **coordinate updates**: tweak one coefficient, recompute the projected $z_i$, re-bin $z$ (histogram), sweep thresholds, keep the best. Repeat for $K$ passes over a small **feature subset**.

3. **Fisher/LDA or SVM Warm-Starts**
   Use LDA (if class labels available) or a linear SVM trained to separate high-loss vs low-loss samples (e.g., top-|g| vs others). Then fine-tune with #2.

#### Oblique Split (Histogram) – Pseudocode

```
INPUT: Node indices I, data X, grads g, hess h
PARAMS: feature_subsample m, max_iter K, bins B, λ1, λ2

1. J = SAMPLE_FEATURES(d, m)                    // feature gating for sparsity/speed
2. w = 0; b = 0                                  // initialize hyperplane
3. // Warm-start via gradient-weighted logistic
4. (w, b) = FIT_LOGISTIC(X[I, J], y=sign(g[I]), weights=abs(g[I]),
                         l1=λ1, l2=λ2, iters=small)

5. FOR iter = 1..K:
      // Project and histogram once per iter
      z = X[I, J] @ w + b
      (hist_g, hist_h, edges) = BUILD_HISTOGRAM(z, g[I], h[I], B)
      (θ_idx, gain) = SWEEP_BINS_FOR_BEST_GAIN(hist_g, hist_h, λ=λ2)
      // Optional: small coordinate refinement
      FOR j in J (few coords only):
           Δ = LINE_SEARCH_ON_COORD(j)          // tweak w_j, reuse hist where possible
           if Δ improves gain: update w_j, z, gain accordingly

6. θ = edges[θ_idx]
7. RETURN (w, θ, gain)
```

**Complexity Tips**

* Use **binning on z** (the 1-D projection) to keep the sweep at $O(B)$.
* Subsample features per node (e.g., $m=\min(32,\sqrt{d})$).
* Enforce **sparsity** with $L_1$ or a hard $k$-nonzero cap for $\mathbf{w}$.

**Why/When**

* Great on **tabular** with interactions and moderate $d$.
* Often reduces **depth** and **tree count**.

### Sparse Oblique (L1 / Gated)

Same as oblique, but explicitly **limit nonzeros in $\mathbf{w}$**:

* Add $\lambda_1\|\mathbf{w}\|_1$ or constrain $\|\mathbf{w}\|_0 \le k$.
* Or apply a **two-stage gate**: rank features by one-step axial gain; keep top-$k$; fit oblique only on those.

**Practical defaults:** $k\in[4,16]$, $\lambda_1$ such that 80–90% of nodes produce $\le k$ nonzeros.

### Random Projection / Rotation Forest Splits

Form randomized features $\tilde{\mathbf{x}} = R \mathbf{x}$ and do **standard axial** splits in the rotated space.

* **R choices:** sparse random matrix, block-PCA at node, or class-aware LDA (if labels).
* **Lightweight obliques:** captures interactions **without** iterative $\mathbf{w}$ training.
* Works nicely with **histograms** and **GOSS**.

**Pseudo:**

```
R = SAMPLE_RANDOM_ROTATION(d, m)        // e.g., sparse ±1, normalized
Xtilde = X[I] @ R
best_axial_on_Xtilde()
```

### Quadratic / Kernelized Splits (Advanced)

Predicate: $\mathbf{x}^\top Q \mathbf{x} + \mathbf{w}^\top \mathbf{x} \le \theta$.
Fit locally via ridge/logistic on a **quadratic feature map** $\phi(\mathbf{x}) = [x_1,\dots,x_d, x_1^2, x_1x_2,\dots]$ with strong regularization. Very expressive, but generally overkill for boosting (prefer putting capacity in **leaf models** instead; see below).

### Categorical Splits

For a categorical feature with categories $\mathcal{C}$:

* **One-vs-Rest (binary)**: learn a subset $S\subset \mathcal{C}$ and split on $x\in S$.
  Heuristics: order categories by **target statistic** (mean residual/gradient), then scan contiguous subsets of that order (LightGBM trick).

* **Multiway**: split into $k$ children (one per category or grouped). Often replaced by **binary** for depth control.

**Ordered scan (binary)**:

1. Compute category score $s(c)=\frac{\sum_{i:x_i=c} g_i}{\sum_{i:x_i=c} h_i+\epsilon}$.
2. Sort categories by $s(c)$.
3. Sweep a prefix $S$ to maximize Newton gain.

### Constrained Splits (Monotonicity, Custom Rules)

When some features must be **monotone** ($+/-$) w\.r.t. prediction, filter split candidates to **preserve feasibility**:

* During candidate evaluation, **reject** splits whose **optimal leaf weights** would violate monotone constraints.
* Implement by checking that for any constrained feature, along any path, the **sign** of leaf updates is consistent.

### Integrating Split Families into Boosting

You can **mix** families per node with a simple, time-budgeted contender set.

```
ALGORITHM: Mixed Split Search at a Node
INPUT: I, time_budget T

1. C = []
2. // Axial contenders (fast)
3. C += TOP_K_AXIAL_SPLITS(k_axial)

4. // Oblique contenders (time-bounded)
5. if T allows:
      C += OBLIQUE_SPLIT(feature_subsample=m, iters=K, bins=B)

6. // Random projection contender (very cheap)
7. C += AXIAL_ON_RANDOM_ROTATION(m)

8. // Categorical specialized
9. C += BEST_CATEGORICAL_SPLITS()

10. RETURN argmax_{c in C} Gain(c)
```

**Leaf Models vs Splits:**
Before moving to **quadratic/complex splits**, consider **linear/GLM leaves** (a.k.a. **GBDT-LR**). They often yield similar accuracy with simpler splits.

### Practical Defaults & Tips

* **Start axial**; enable **sparse oblique** only at nodes with $|I|\ge 2{,}000$ (or depth $\le 3$).
* **Feature subsample for oblique**: $m=\min(32,\sqrt{d})$.
* **Binning**: reuse your **global histogram bins** on the **projected z** (re-binning z per node; $B=63$ or $127$ is usually enough).
* **Regularize** obliques: $\lambda_2$ tied to parent $H$ (e.g., $\lambda_2 = c \cdot \frac{H}{|I|}$ with $c\in[0.1,1]$); $\lambda_1$ tuned to hit desired sparsity.
* **Early abort** oblique search if the best oblique gain $< (1+\epsilon)$ times the best axial gain (e.g., $\epsilon=0.02$).
* **GOSS compatibility**: weight the logistic/linear fit by $\alpha_i$ (your amplified sample weights).

### Evaluation: When Do Obliques Pay Off?

* **Low depth budgets** (e.g., max\_depth $\le 4$): obliques recover accuracy lost by shallower trees.
* **Moderate $d$** (10–200) with interactions and weak linear trends.
* **Strong collinearity**: obliques reduce “feature ping-pong” across levels.

Expected impact (rules of thumb):

* Depth ↓ 25–40% for same validation loss.
* Trees ↓ 10–25% for same test error.
* Per-node compute ↑ 1.5–5× (mitigated by feature gating and binning).

### Minimal API Hooks (Implementation Notes)

* **Splitter interface** should accept $(X_I, g_I, h_I, bins)$ and return `(predicate, gain)`, where `predicate` can encode:

  * Axial: `(type="axial", j, θ)`
  * Oblique: `(type="oblique", w_sparse, θ)`
  * Categorical: `(type="cat", j, subset S)`
* **Histogram reuse**: axial uses precomputed per-feature hist; oblique builds **1-D hist over z** only.
* **Regularization/penalties**: expose $(\lambda_1,\lambda_2,\gamma)$ and a **max\_nonzeros** for oblique.

---

## Advanced Sampling: GOSS

### Motivation and Intuition

**Key Insight:** Data instances contribute unequally to the learning process. Instances with larger gradient magnitudes are more "informative" because they represent harder-to-predict cases.

**Mathematical Intuition:**
In the information gain formula:

$$\text{Gain} \propto \frac{(\sum g_i)^2}{\sum h_i + \lambda}$$

Samples with larger $|g_i|$ contribute more to the numerator, thus having greater impact on split decisions.

### GOSS Algorithm

**Goal:** Maintain most informative samples while reducing dataset size through intelligent sampling.

**Two-Sided Sampling Strategy:**
1. **Set A (Top-$a$):** Keep all samples with largest gradient magnitudes
2. **Set B (Random-$b$):** Randomly sample from remaining instances
3. **Amplification:** Multiply gradients in Set B by factor $\frac{1-a}{b}$

**Theoretical Justification:**
To maintain unbiased estimation:

$$\mathbb{E}\left[\sum_{i \in A \cup B} \tilde{g}_i\right] = \sum_{i=1}^n g_i$$

where $\tilde{g}_i$ are the amplified gradients.

**Amplification Factor Derivation:**
- Total gradient contribution from non-top samples: $(1-a) \times \sum_{i=1}^n |g_i|$
- We sample fraction $b$ of these: $b \times (1-a) \times \sum_{i=1}^n |g_i|$
- Amplification needed: $\frac{1-a}{b}$ to restore expected value

```
ALGORITHM: GOSS (Gradient-based One-Side Sampling)
INPUT: Dataset D, gradients {g_i}, parameters a, b
OUTPUT: Sampled dataset D_sampled with weights

1. n = |D|
2. sorted_indices = SORT_BY_GRADIENT_MAGNITUDE(g, descending=True)

3. // Select top-a fraction (Set A)
4. n_top = ⌊a × n⌋
5. A = sorted_indices[0:n_top]

6. // Randomly sample b fraction from remainder (Set B)
7. remaining = sorted_indices[n_top:]
8. n_other = ⌊b × |remaining|⌋
9. B = RANDOM_SAMPLE(remaining, n_other)

10. // Combine sets
11. selected_indices = A ∪ B
12. amplification_factor = (1 - a) / b

13. // Create weights
14. weights = ONES(|selected_indices|)
15. weights[|A|:] = amplification_factor  // Amplify Set B

16. RETURN D[selected_indices], weights
```

### Information Gain with GOSS

The modified gain calculation becomes:

$$\text{Gain}_{\text{GOSS}} = \frac{1}{2}\left[\frac{\left(\sum_{i \in A_L} g_i + \frac{1-a}{b}\sum_{i \in B_L} g_i\right)^2}{\sum_{i \in A_L} h_i + \frac{1-a}{b}\sum_{i \in B_L} h_i + \lambda} + \text{(right side)} - \text{(parent)}\right]$$

### GOSS Performance Analysis

**Computational Complexity:**
- Standard: $\mathcal{O}(n \times d)$ per tree
- GOSS: $\mathcal{O}((a + b) \times n \times d)$ per tree
- Typical speedup: 3-10× with minimal accuracy loss

**Parameter Guidelines:**
- $a = 0.2$ (keep top 20% of gradients)
- $b = 0.1$ (sample 10% of remaining)
- Effective dataset size: $0.2 + 0.1 = 0.3$ (70% reduction)

---

## Regularization: DART

### Problem with Sequential Boosting

Traditional gradient boosting suffers from **over-specialization**: later trees tend to correct mistakes of specific earlier trees, leading to overfitting.

**Visualization of the Problem:**
```
Standard Boosting Dependencies:
Tree1 → Tree2 → Tree3 → Tree4 → Tree5
  ↓      ↓      ↓      ↓      ↓
Each tree sees ALL previous predictions (deterministic context)
```

### DART: Dropout Regularization for Trees

**Core Idea:** Randomly "dropout" some previous trees during each iteration's training, forcing new trees to work with different ensemble contexts.

**Mathematical Formulation:**

Standard GBDT prediction:
$$F_m(\mathbf{x}) = \sum_{k=1}^m \nu \cdot T_k(\mathbf{x})$$

DART prediction:
$$F_m(\mathbf{x}) = \sum_{k \in \mathcal{K}_m} T_k(\mathbf{x}) + \nu \cdot T_m(\mathbf{x})$$

where $\mathcal{K}_m \subseteq \{1, 2, ..., m-1\}$ is the set of **non-dropped** trees.

**Normalization Factor:**
To maintain consistent prediction magnitude:

$$\text{norm} = \frac{|\{1, 2, ..., m-1\}|}{|\{1, 2, ..., m-1\} \setminus \mathcal{D}_m|} = \frac{m-1}{m-1-|\mathcal{D}_m|}$$

where $\mathcal{D}_m$ is the set of dropped trees.

**Normalized Prediction:**
$$\tilde{F}_{m-1}(\mathbf{x}) = \text{norm} \times \sum_{k \in \mathcal{K}_m} \nu \cdot T_k(\mathbf{x})$$

### DART Algorithm

```
ALGORITHM: DART (Dropouts meet Multiple Additive Regression Trees)
INPUT: Dataset D, loss L, iterations M, drop_rate p, max_drop K

1. trees = []
2. F_0(x) = initial_prediction

3. FOR m = 1 to M:
   a) // Select trees to drop
   IF RANDOM() > skip_drop_probability:
      dropped_trees = RANDOM_SUBSET(trees, drop_rate, max_drop)
   ELSE:
      dropped_trees = []
   
   b) // Compute normalized prediction from active trees
   active_trees = trees \ dropped_trees
   norm_factor = |trees| / |active_trees| if |active_trees| > 0 else 1
   
   F_normalized(x) = norm_factor × Σ_{T ∈ active_trees} T(x)
   
   c) // Train new tree on normalized residuals
   residuals = -∂L(y, F_normalized(x)) / ∂F
   T_m = TRAIN_TREE(D, residuals)
   
   d) // Add to ensemble
   trees.APPEND(T_m)

4. RETURN trees
```

### DART Hyperparameters

**Key Parameters:**

- **drop_rate** $\in [0.1, 0.5]$: Probability of dropping each tree
- **max_drop** $\in [1, 50]$: Maximum number of trees to drop per iteration
- **skip_drop** $\in [0.3, 0.7]$: Probability of skipping dropout entirely
- **normalize_type**: How to handle normalization (tree-wise vs weight-wise)

**Parameter Interaction:**
The effective regularization strength is:

$$\text{Regularization} \propto \text{drop\_rate} \times (1 - \text{skipd\_rop}) \times \frac{\text{n\_trees}}{\text{max\_drop}}$$

### DART vs Standard Boosting

**Advantages:**
- Reduces overfitting through ensemble diversification
- Better generalization on complex datasets
- Self-regularizing (less hyperparameter tuning needed)
- Typically 2-5% accuracy improvement

**Disadvantages:**
- Slower training (10-20% overhead)
- More hyperparameters to tune
- Slightly more complex to implement
- Can be less stable during early training

```
Dependency Comparison:

Standard Boosting:
T1 → T2 → T3 → T4 → T5 (each sees all previous)

DART with Dropout:
T1   T2 → T3   T4 → T5 (randomized dependencies)
```

---

## Post-Tree Pruning

### Motivation and Theory

**The Overfitting Problem:**
Decision trees in gradient boosting can grow too deep, memorizing training data patterns that don't generalize. Post-pruning addresses this by removing branches that provide minimal improvement to validation performance.

**Mathematical Foundation:**
For a subtree rooted at node $t$, define the **complexity cost**:

$C_\alpha(T_t) = \sum_{\ell \in \text{leaves}(T_t)} N_\ell \cdot \text{Error}(\ell) + \alpha \cdot |\text{leaves}(T_t)|$

where:
- $N_\ell$ = number of samples in leaf $\ell$
- $\text{Error}(\ell)$ = impurity measure for leaf $\ell$
- $\alpha$ = complexity parameter (regularization strength)
- $|\text{leaves}(T_t)|$ = number of leaves in subtree $T_t$

**Pruning Decision Rule:**
Prune subtree $T_t$ if replacing it with a single leaf reduces complexity cost:

$C_\alpha(\text{leaf}) < C_\alpha(T_t)$

### Types of Pruning

#### 1. Minimal Cost-Complexity Pruning (Weakest Link)

**Core Idea:** Find the subtree that provides the least improvement per additional leaf.

For each internal node $t$, compute the **improvement per leaf**:

$g(t) = \frac{\text{Error}(\text{leaf}_t) - \text{Error}(T_t)}{|\text{leaves}(T_t)| - 1}$

**Algorithm:**
```
ALGORITHM: Minimal Cost-Complexity Pruning
INPUT: Tree T, validation set V
OUTPUT: Optimally pruned tree T*

1. // Build sequence of nested trees
2. tree_sequence = []
3. alpha_sequence = []
4. current_tree = COPY(T)

5. WHILE |leaves(current_tree)| > 1:
   a) // Find weakest link (smallest g(t))
   min_alpha = ∞
   weakest_node = null
   
   FOR each internal node t in current_tree:
      g_t = CALCULATE_IMPROVEMENT_PER_LEAF(t)
      IF g_t < min_alpha:
         min_alpha = g_t
         weakest_node = t
   
   b) // Store current state
   tree_sequence.APPEND(COPY(current_tree))
   alpha_sequence.APPEND(min_alpha)
   
   c) // Prune weakest link
   current_tree = PRUNE_SUBTREE(current_tree, weakest_node)

6. // Select best tree using validation set
7. best_score = -∞
8. best_tree = null

9. FOR each tree in tree_sequence:
   validation_score = EVALUATE(tree, V)
   IF validation_score > best_score:
      best_score = validation_score
      best_tree = tree

10. RETURN best_tree
```

#### 2. Reduced Error Pruning

**Core Idea:** Prune nodes where replacing the subtree with a leaf doesn't increase validation error.

**Validation-Based Decision:**
For each internal node $t$:
1. Compute validation accuracy with current subtree: $A_{\text{subtree}}$
2. Compute validation accuracy if replaced by leaf: $A_{\text{leaf}}$
3. If $A_{\text{leaf}} \geq A_{\text{subtree}}$, prune the subtree

```
ALGORITHM: Reduced Error Pruning
INPUT: Tree T, validation set V
OUTPUT: Pruned tree T'

1. T' = COPY(T)
2. changed = True

3. WHILE changed:
   changed = False
   
   FOR each internal node t in T' (bottom-up):
      // Current subtree performance
      pred_subtree = PREDICT_WITH_SUBTREE(V, t)
      error_subtree = ERROR(V.labels, pred_subtree)
      
      // Leaf replacement performance  
      leaf_prediction = MAJORITY_CLASS(samples_at_node_t)
      pred_leaf = [leaf_prediction] * |samples_at_t|
      error_leaf = ERROR(V.labels[samples_at_t], pred_leaf)
      
      IF error_leaf ≤ error_subtree:
         T' = REPLACE_SUBTREE_WITH_LEAF(T', t)
         changed = True

4. RETURN T'
```

#### 3. Critical Value Pruning

**Core Idea:** Prune nodes where the split provides less improvement than a threshold.

**Mathematical Criterion:**
For node $t$ with split creating children $t_L$ and $t_R$:

$\Delta_{\text{impurity}} = I(t) - \frac{N_{t_L}}{N_t} I(t_L) - \frac{N_{t_R}}{N_t} I(t_R)$

Prune if $\Delta_{\text{impurity}} < \tau$ (critical value threshold).

**For Gradient Boosting Trees:**
Using the gain formula:

$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right]$

Prune if $\text{Gain} < \tau_{\text{gain}}$.

```
ALGORITHM: Critical Value Pruning
INPUT: Tree T, threshold τ
OUTPUT: Pruned tree T'

1. T' = COPY(T)

2. FOR each internal node t in T' (bottom-up):
   
   a) gain = CALCULATE_SPLIT_GAIN(t)
   
   b) IF gain < τ:
      T' = REPLACE_SUBTREE_WITH_LEAF(T', t)

3. RETURN T'
```

### Gradient Boosting Specific Considerations

#### Optimal Leaf Values After Pruning

When a subtree is replaced by a leaf, the optimal leaf value for second-order methods is:

$w^* = -\frac{\sum_{i \in \text{leaf}} g_i}{\sum_{i \in \text{leaf}} h_i + \lambda}$

For first-order methods, use the mean of residuals:

$w^* = \frac{1}{|\text{leaf}|} \sum_{i \in \text{leaf}} r_i$

#### Integration with Boosting Loop

**Strategy 1: Post-hoc Pruning**
```
FOR each boosting iteration m:
1. tree = BUILD_FULL_TREE(gradients, hessians)
2. pruned_tree = PRUNE_TREE(tree, validation_set)
3. ensemble.ADD(pruned_tree)
```

**Strategy 2: Online Pruning**
```
FOR each boosting iteration m:
1. tree = BUILD_TREE_WITH_EARLY_STOPPING(gradients, hessians)
2. // Tree is naturally pruned during construction
3. ensemble.ADD(tree)
```

### Pruning Metrics and Evaluation

#### Complexity Measures

**Tree Size Metrics:**
- Number of leaves: $|\text{leaves}(T)|$
- Tree depth: $\max_{\ell \in \text{leaves}} \text{depth}(\ell)$
- Total nodes: $|\text{nodes}(T)|$

**Model Complexity:**
For ensemble of $M$ trees:

$\text{Complexity} = \sum_{m=1}^M |\text{leaves}(T_m)|$

#### Performance Evaluation

**Bias-Variance Trade-off:**
Pruning typically:
- **Increases bias:** Simpler trees make more assumptions
- **Decreases variance:** Less sensitive to training data variations
- **Reduces overfitting:** Better generalization to unseen data

**Validation Strategy:**
```
ALGORITHM: Pruning Validation
INPUT: Ensemble E, validation set V, test set T

1. // Measure complexity vs performance
2. complexity_levels = [0.1, 0.2, 0.5, 1.0, 2.0, 5.0]
3. results = []

4. FOR each α in complexity_levels:
   a) pruned_ensemble = []
   
   FOR each tree in E:
      pruned_tree = PRUNE_WITH_ALPHA(tree, α)
      pruned_ensemble.ADD(pruned_tree)
   
   b) val_score = EVALUATE(pruned_ensemble, V)
   c) test_score = EVALUATE(pruned_ensemble, T)
   d) complexity = MEASURE_COMPLEXITY(pruned_ensemble)
   
   e) results.ADD({
        alpha: α,
        validation_score: val_score,
        test_score: test_score,
        complexity: complexity
      })

5. // Select optimal α
6. best_alpha = ARGMAX(r.validation_score for r in results)
7. RETURN best_alpha
```

### Advanced Pruning Techniques

#### 1. Ensemble-Aware Pruning

**Problem:** Individual tree pruning ignores ensemble interactions.

**Solution:** Prune considering the entire ensemble's performance.

```
ALGORITHM: Ensemble-Aware Pruning
INPUT: Ensemble E = {T_1, T_2, ..., T_M}, validation set V

1. FOR m = 1 to M:
   a) ensemble_minus_m = E \ {T_m}
   b) candidates = GENERATE_PRUNED_VERSIONS(T_m)
   
   c) best_score = -∞
   d) best_tree = T_m
   
   FOR each candidate in candidates:
      temp_ensemble = ensemble_minus_m ∪ {candidate}
      score = EVALUATE(temp_ensemble, V)
      
      IF score > best_score:
         best_score = score
         best_tree = candidate
   
   e) E[m] = best_tree

2. RETURN E
```

#### 2. Progressive Pruning

**Core Idea:** Gradually increase pruning strength as ensemble grows.

**Rationale:** Early trees need more complexity to capture main patterns; later trees can be simpler for fine-tuning.

```
ALGORITHM: Progressive Pruning
INPUT: Training data D, iterations M

1. ensemble = []
2. base_threshold = 0.01

3. FOR m = 1 to M:
   a) // Adaptive pruning threshold
   threshold = base_threshold * (1 + m/M)
   
   b) tree = BUILD_TREE(D, gradients, hessians)
   c) pruned_tree = PRUNE_WITH_THRESHOLD(tree, threshold)
   
   d) ensemble.ADD(pruned_tree)
   e) UPDATE_PREDICTIONS(D, pruned_tree)

4. RETURN ensemble
```

### Practical Guidelines

#### Hyperparameter Selection

**Pruning Strength:**
- **Light pruning:** $\alpha \in [0.001, 0.01]$ - minimal complexity reduction
- **Moderate pruning:** $\alpha \in [0.01, 0.1]$ - balanced approach
- **Heavy pruning:** $\alpha \in [0.1, 1.0]$ - aggressive simplification

**Validation Strategy:**
- Use separate validation set (not training or test)
- Cross-validate pruning parameters
- Monitor both accuracy and model size

#### When to Apply Pruning

**Indicators for Pruning:**
- Large gap between training and validation performance
- Trees with many leaves but poor validation scores
- Memory or inference speed constraints
- Need for model interpretability

**Pruning Decision Framework:**
```
Dataset Size?
├─ Small (<1K): Light pruning (preserve capacity)
├─ Medium (1K-100K): Moderate pruning
└─ Large (>100K): Heavy pruning acceptable

Overfitting Severity?
├─ High: Aggressive pruning + early stopping
├─ Medium: Standard cost-complexity pruning
└─ Low: Minimal or no pruning

Deployment Constraints?
├─ Mobile/Edge: Heavy pruning for efficiency
├─ Server: Balance accuracy vs complexity
└─ Research: Minimal pruning for max performance
```

#### Performance Impact

**Expected Improvements:**
- **Model size:** 30-70% reduction in tree complexity
- **Inference speed:** 20-50% faster predictions
- **Generalization:** 1-5% improvement in test accuracy (when overfitting)
- **Memory usage:** Proportional to complexity reduction

---

## Feature Importance in Gradient Boosting

### Motivation and Theory

**Why Feature Importance Matters:**
Understanding which features contribute most to predictions is crucial for:
- Model interpretability and debugging
- Feature selection and dimensionality reduction
- Business insights and decision-making
- Regulatory compliance and model auditing

**Challenge in Ensemble Methods:**
Unlike linear models where coefficients directly indicate importance, tree ensembles require aggregating importance across multiple trees and splits.

### Types of Feature Importance

#### 1. Split-Based Importance (Gain/Impurity)

**Mathematical Foundation:**
For each split in tree $T_m$ at node $t$ using feature $j$, the importance contribution is:

$I_{j,t}^{(m)} = p_t \cdot \Delta I_t$

where:
- $p_t = \frac{N_t}{N}$ is the proportion of samples reaching node $t$
- $\Delta I_t$ is the impurity reduction from the split

**For Gradient Boosting (Second-Order):**
The gain-based importance for feature $j$ in tree $m$ is:

$I_j^{(m)} = \sum_{t: \text{split on } j} \frac{N_t}{N} \cdot \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda}\right]$

**Ensemble Aggregation:**
Total importance for feature $j$ across all trees:

$I_j^{\text{total}} = \sum_{m=1}^M I_j^{(m)}$

**Normalized Importance:**
$I_j^{\text{norm}} = \frac{I_j^{\text{total}}}{\sum_{k=1}^d I_k^{\text{total}}}$

```
ALGORITHM: Split-Based Feature Importance
INPUT: Ensemble E = {T_1, T_2, ..., T_M}, features d
OUTPUT: Feature importance vector I

1. importance = ZEROS(d)

2. FOR m = 1 to M:
   FOR each internal node t in T_m:
      j = FEATURE_USED_AT_SPLIT(t)
      sample_weight = NUMBER_OF_SAMPLES(t) / TOTAL_SAMPLES
      gain = SPLIT_GAIN(t)  // From gradient boosting gain formula
      
      importance[j] += sample_weight * gain

3. // Normalize to sum to 1
4. total_importance = SUM(importance)
5. IF total_importance > 0:
   importance = importance / total_importance

6. RETURN importance
```

#### 2. Frequency-Based Importance (Split Count)

**Core Idea:** Count how often each feature is used for splits across all trees.

**Mathematical Formulation:**
$I_j^{\text{freq}} = \frac{1}{M} \sum_{m=1}^M C_j^{(m)}$

where $C_j^{(m)}$ is the count of splits using feature $j$ in tree $m$.

**Weighted by Tree Depth:**
More sophisticated version weights splits by their depth (earlier splits are more important):

$I_j^{\text{weighted-freq}} = \sum_{m=1}^M \sum_{t: \text{split on } j} \frac{1}{2^{\text{depth}(t)}}$

```
ALGORITHM: Frequency-Based Feature Importance
INPUT: Ensemble E, optional depth_weighting

1. frequency = ZEROS(d)
2. total_splits = 0

3. FOR m = 1 to M:
   FOR each internal node t in T_m:
      j = FEATURE_USED_AT_SPLIT(t)
      
      IF depth_weighting:
         weight = 1.0 / (2^DEPTH(t))
      ELSE:
         weight = 1.0
      
      frequency[j] += weight
      total_splits += weight

4. // Normalize
5. IF total_splits > 0:
   frequency = frequency / total_splits

6. RETURN frequency
```

#### 3. Permutation Importance

**Core Principle:** Measure the increase in prediction error when a feature's values are randomly permuted.

**Mathematical Definition:**
For feature $j$, permutation importance is:

$I_j^{\text{perm}} = \frac{1}{K} \sum_{k=1}^K \left[\text{Error}(\mathbf{y}, \hat{\mathbf{y}}^{(k)}_{\pi_j}) - \text{Error}(\mathbf{y}, \hat{\mathbf{y}})\right]$

where:
- $\hat{\mathbf{y}}^{(k)}_{\pi_j}$ are predictions with feature $j$ permuted in the $k$-th trial
- $K$ is the number of permutation trials (typically 5-10)

**Advantages:**
- Model-agnostic (works with any ML algorithm)
- Captures feature interactions
- Based on actual prediction performance

**Disadvantages:**
- Computationally expensive (requires recomputing predictions)
- Can be unreliable with highly correlated features

```
ALGORITHM: Permutation Feature Importance
INPUT: Model M, dataset (X, y), metric ERROR, trials K
OUTPUT: Permutation importance vector

1. baseline_predictions = M.PREDICT(X)
2. baseline_error = ERROR(y, baseline_predictions)
3. importance = ZEROS(d)

4. FOR j = 1 to d:  // Each feature
   trial_errors = []
   
   FOR k = 1 to K:  // Multiple trials
      X_permuted = COPY(X)
      X_permuted[:, j] = RANDOM_PERMUTATION(X[:, j])
      
      permuted_predictions = M.PREDICT(X_permuted)
      permuted_error = ERROR(y, permuted_predictions)
      
      trial_errors.APPEND(permuted_error)
   
   // Average importance across trials
   importance[j] = MEAN(trial_errors) - baseline_error

5. RETURN importance
```

#### 4. SHAP (SHapley Additive exPlanations) Values

**Theoretical Foundation:**
SHAP values are based on cooperative game theory, satisfying four axioms:
- **Efficiency:** $\sum_{j=1}^d \phi_j = f(\mathbf{x}) - \mathbb{E}[f(\mathbf{X})]$
- **Symmetry:** If features contribute equally, they have equal SHAP values
- **Dummy:** Features that don't affect output have zero SHAP value
- **Additivity:** For ensemble $f = f_1 + f_2$, $\phi_j^f = \phi_j^{f_1} + \phi_j^{f_2}$

**Mathematical Definition:**
The SHAP value for feature $j$ is:

$\phi_j = \sum_{S \subseteq \mathcal{F} \setminus \{j\}} \frac{|S|!(d-|S|-1)!}{d!} [f(S \cup \{j\}) - f(S)]$

where $\mathcal{F}$ is the set of all features and $S$ represents feature subsets.

**TreeSHAP for Gradient Boosting:**
For tree ensembles, TreeSHAP provides an efficient algorithm with polynomial complexity:

$\phi_j = \sum_{m=1}^M \phi_j^{(m)}$

where $\phi_j^{(m)}$ is the SHAP value for feature $j$ in tree $m$.

**Tree Path Calculation:**
For a tree with path $P$ from root to leaf:

$\phi_j^{\text{tree}} = \sum_{t \in P: \text{split on } j} \frac{\text{one\_hot}(j, t)}{\text{zero\_frac}(t)} \cdot [\text{leaf\_value}(\text{hot\_path}) - \text{leaf\_value}(\text{cold\_path})]$

```
ALGORITHM: TreeSHAP Feature Importance
INPUT: Tree T, instance x, background dataset D
OUTPUT: SHAP values φ

1. // Initialize path tracking
2. path = []  // Stores (feature, threshold, zero_fraction)
3. shap_values = ZEROS(d)

4. FUNCTION RECURSIVE_SHAP(node, path, zero_frac):
   IF node is leaf:
      // Distribute leaf value across path features
      FOR each (feature_j, threshold, z_frac) in path:
         weight = zero_frac / z_frac
         shap_values[feature_j] += weight * node.value
      RETURN
   
   // Get feature split and zero fraction
   feature_j = node.split_feature
   threshold = node.split_threshold
   
   // Calculate zero fraction (how often background data goes left)
   zero_frac_left = COUNT(D[:, feature_j] <= threshold) / |D|
   
   // Add to path
   path.APPEND((feature_j, threshold, zero_frac_left))
   
   // Recurse on children
   IF x[feature_j] <= threshold:
      RECURSIVE_SHAP(node.left, path, zero_frac * zero_frac_left)
   ELSE:
      RECURSIVE_SHAP(node.right, path, zero_frac * (1 - zero_frac_left))
   
   // Remove from path
   path.POP()

5. // Start recursion
6. RECURSIVE_SHAP(T.root, [], 1.0)

7. RETURN shap_values
```

### Gradient Boosting Specific Considerations

#### Learning Rate Effects

**Issue:** Learning rate affects the magnitude of tree contributions but not relative importance.

**Solution:** When aggregating across trees, importance should be scale-invariant:

$I_j^{\text{adjusted}} = \sum_{m=1}^M \frac{I_j^{(m)}}{\max_k I_k^{(m)}}$

#### Tree Depth and Importance

**Shallow Trees:** Feature importance tends to be more concentrated on fewer features.
**Deep Trees:** Importance is more distributed, capturing complex interactions.

**Depth-Adjusted Importance:**
Weight importance by the depth at which features appear:

$I_j^{\text{depth-adj}} = \sum_{m=1}^M \sum_{t: \text{split on } j} w_{\text{depth}(t)} \cdot \text{gain}(t)$

where $w_d = \frac{1}{\sqrt{d + 1}}$ reduces importance for deeper splits.

#### Handling Missing Values

**Approach 1:** Exclude samples with missing values from importance calculation.
**Approach 2:** Use default direction importance:

For missing value handling in XGBoost-style algorithms:

$I_j^{\text{missing}} = \sum_{\text{splits on } j} p_{\text{missing}} \cdot \text{gain} \cdot \mathbf{1}[\text{default direction chosen}]$

### Ensemble-Level Importance Aggregation

#### Simple Average
$I_j = \frac{1}{M} \sum_{m=1}^M I_j^{(m)}$

#### Weighted by Tree Performance
Weight trees by their individual contribution to ensemble performance:

$I_j = \frac{\sum_{m=1}^M w_m \cdot I_j^{(m)}}{\sum_{m=1}^M w_m}$

where $w_m$ could be:
- Tree's contribution to overall loss reduction
- Tree's validation performance
- Inverse of tree's training error

#### Time-Decay Weighting (for DART)
In DART, later trees may be more representative of final model:

$I_j = \sum_{m=1}^M e^{-\alpha(M-m)} \cdot I_j^{(m)}$

where $\alpha$ controls decay rate.

### Comparative Analysis of Methods

| Method | Computational Cost | Interpretability | Captures Interactions | Robust to Correlation |
|--------|-------------------|------------------|---------------------|----------------------|
| **Split-based** | Low | High | Limited | No |
| **Frequency** | Low | High | No | No |
| **Permutation** | High | Medium | Yes | Partially |
| **SHAP** | Medium | Very High | Yes | Yes |

### Practical Implementation Guidelines

#### Method Selection Framework

```
Use Case Decision Tree:

Model Debugging?
├─ Split-based importance (fast, intuitive)

Feature Selection?
├─ Permutation importance (performance-based)
└─ SHAP with feature clustering

Regulatory/Explanation?
├─ SHAP values (theoretically grounded)

Quick Overview?
├─ Frequency-based (fastest)
```

#### Stability and Reliability

**Cross-Validation Importance:**
```
ALGORITHM: Stable Feature Importance
INPUT: Dataset D, CV folds K

1. importance_matrix = ZEROS(K, d)

2. FOR fold k = 1 to K:
   train_k, val_k = CV_SPLIT(D, k)
   model_k = TRAIN(train_k)
   importance_matrix[k, :] = CALCULATE_IMPORTANCE(model_k, val_k)

3. // Aggregate statistics
4. mean_importance = MEAN(importance_matrix, axis=0)
5. std_importance = STD(importance_matrix, axis=0)
6. ci_lower = PERCENTILE(importance_matrix, 5, axis=0)
7. ci_upper = PERCENTILE(importance_matrix, 95, axis=0)

8. RETURN mean_importance, std_importance, ci_lower, ci_upper
```

#### Visualization and Interpretation

**Importance Ranking Plot:**
- Bar chart with error bars for uncertainty
- Separate plots for different importance types
- Include feature names and importance values

**Feature Interaction Analysis:**
For SHAP values, analyze feature interactions:

$\phi_{ij} = \frac{1}{2} \sum_{S \subseteq \mathcal{F} \setminus \{i,j\}} \frac{|S|!(d-|S|-2)!}{d!} \cdot \Delta_{ij}(S)$

where $\Delta_{ij}(S) = f(S \cup \{i,j\}) - f(S \cup \{i\}) - f(S \cup \{j\}) + f(S)$

### Common Pitfalls and Solutions

#### 1. Correlation Bias
**Problem:** Correlated features may have artificially split importance.
**Solution:** Use feature clustering or permutation importance with groups:

```
ALGORITHM: Group Permutation Importance
INPUT: Correlated feature groups G = {G_1, G_2, ..., G_k}

FOR each group G_i:
   permuted_error = 0
   FOR trial = 1 to K:
      X_perm = COPY(X)
      FOR feature j in G_i:
         X_perm[:, j] = RANDOM_PERMUTATION(X[:, j])
      
      error = EVALUATE(model, X_perm, y)
      permuted_error += error
   
   group_importance[G_i] = permuted_error / K - baseline_error
```

#### 2. Scale Sensitivity
**Problem:** Features with different scales may have biased importance.
**Solution:** Use relative importance or standardize features before training.

#### 3. Sample Size Effects
**Problem:** Importance can be unstable with small datasets.
**Solution:** Use bootstrap confidence intervals:

```
ALGORITHM: Bootstrap Importance Confidence
INPUT: Model M, data (X, y), bootstrap samples B

FOR b = 1 to B:
   X_boot, y_boot = BOOTSTRAP_SAMPLE(X, y)
   importance_boot[b] = CALCULATE_IMPORTANCE(M, X_boot, y_boot)

confidence_interval = PERCENTILE(importance_boot, [2.5, 97.5], axis=0)
```

### Advanced Topics

#### Dynamic Importance Over Training
Track how feature importance evolves during boosting:

$I_j^{(1:m)} = \sum_{k=1}^m I_j^{(k)}$

This reveals:
- Which features are learned first
- When diminishing returns occur
- Optimal stopping points for feature subsets

#### Conditional Importance
Importance of feature $j$ given feature $i$ is already in the model:

$I_{j|i} = I(\text{model with } i,j) - I(\text{model with } i)$

This helps understand feature redundancy and complementarity.

---

## Implementation Guidelines

### Algorithm Selection Framework

```
DECISION TREE: Choosing the Right Configuration

Dataset Size?
├─ Small (<10K): Use exact split finding
└─ Large (>10K): Use histogram method

Overfitting Risk?
├─ High: Enable DART regularization
└─ Low: Standard gradient boosting

Training Speed Priority?
├─ High: Enable GOSS sampling
└─ Low: Use full dataset

Accuracy Requirements?
├─ Maximum: Second-order + exact splits
└─ Balanced: Second-order + histogram + GOSS
```

### Hyperparameter Guidelines

**Core Parameters:**
```
learning_rate ∈ [0.01, 0.3]
├─ 0.01-0.05: Conservative, needs more trees
├─ 0.1: Good default
└─ 0.2-0.3: Aggressive, fewer trees needed

n_estimators ∈ [50, 3000]
├─ Depends inversely on learning_rate
└─ Use early stopping for optimal count

max_depth ∈ [3, 10]
├─ 3-6: Good for most problems
└─ 7-10: Deep trees, risk overfitting
```

**Advanced Parameters:**
```
GOSS Configuration:
├─ top_rate: 0.2 (keep 20% top gradients)
├─ other_rate: 0.1 (sample 10% others)
└─ Effective dataset: 30% of original

DART Configuration:
├─ drop_rate: 0.1 (moderate regularization)
├─ max_drop: min(50, n_trees/4)
└─ skip_drop: 0.5 (skip dropout 50% of time)

Histogram Configuration:
├─ max_bins: 255 (LightGBM default)
└─ binning_method: quantile (balanced bins)
```

### Performance Optimization Strategies

**Memory Optimization:**
```
Data Layout:
├─ Use float32 instead of float64 (halves memory)
├─ Ensure C-contiguous arrays
└─ Consider memory-mapped files for huge datasets

Tree Storage:
├─ Compress leaf values
├─ Use compact tree representations
└─ Implement tree pruning for minimal accuracy loss
```

**Computational Optimization:**
```
Parallelization:
├─ Feature-level: Evaluate splits in parallel
├─ Data-level: Batch processing for large datasets
└─ NUMA-aware: Pin threads to CPU cores

Algorithmic:
├─ Early stopping: Monitor validation loss
├─ Feature selection: Remove irrelevant features
└─ Smart initialization: Better than naive mean
```

### Convergence and Stopping Criteria

**Early Stopping Implementation:**
```
ALGORITHM: Early Stopping
patience = 50
best_score = -∞
wait_count = 0

FOR each iteration:
    validation_score = EVALUATE(validation_set)
    
    IF validation_score > best_score:
        best_score = validation_score
        wait_count = 0
        SAVE_MODEL()
    ELSE:
        wait_count += 1
        
    IF wait_count >= patience:
        BREAK  // Stop training
        
LOAD_BEST_MODEL()
```

### Debugging and Monitoring

**Key Metrics to Track:**
- Training vs validation loss curves
- Feature importance evolution
- Gradient magnitude distribution (for GOSS)
- Tree depth and leaf statistics
- Memory usage and training time per iteration

**Common Issues and Solutions:**
```
Problem: Overfitting
├─ Increase regularization (DART, L1/L2)
├─ Reduce learning_rate
├─ Limit max_depth
└─ Enable early stopping

Problem: Slow convergence
├─ Increase learning_rate
├─ Use second-order methods
├─ Check feature preprocessing
└─ Verify loss function choice

Problem: Memory issues
├─ Enable GOSS sampling
├─ Use histogram method
├─ Reduce max_bins
└─ Implement batch processing
```