## Theorem: Ensemble methods are no worse than single models

**Statement:** For any ensemble of predictors, the expected squared error is no greater than the average expected squared error of individual predictors, with equality only in degenerate cases.

---

## Setup and Notation

Let:
- $f^*(x)$ = true target function
- $\hat{f}_i(x)$ = prediction from model $i$, where $i = 1, 2, \ldots, M$
- $\hat{f}_{\text{ens}}(x) = \frac{1}{M}\sum_{i=1}^M \hat{f}_i(x)$ = ensemble prediction (simple averaging)
- $\mathbb{E}[\cdot]$ = expectation over training sets
- $\text{Var}[\cdot]$ = variance
- $\text{Cov}[\cdot, \cdot]$ = covariance

---

## Proof Part 1: Basic Variance Reduction

### Individual Model Error
For any individual model $i$:
$$\mathbb{E}[(\hat{f}_i(x) - f^*(x))^2] = \text{MSE}_i$$

### Ensemble Error
For the ensemble:
$$\mathbb{E}[(\hat{f}_{\text{ens}}(x) - f^*(x))^2] = \mathbb{E}\left[\left(\frac{1}{M}\sum_{i=1}^M \hat{f}_i(x) - f^*(x)\right)^2\right]$$

Expanding:
$$= \mathbb{E}\left[\left(\frac{1}{M}\sum_{i=1}^M (\hat{f}_i(x) - f^*(x))\right)^2\right]$$

$$= \frac{1}{M^2} \mathbb{E}\left[\left(\sum_{i=1}^M (\hat{f}_i(x) - f^*(x))\right)^2\right]$$

### Step-by-Step Expansion

**Step 1:** Let $e_i(x) = \hat{f}_i(x) - f^*(x)$ (error of model $i$)

$\mathbb{E}[(\hat{f}_{\text{ens}}(x) - f^*(x))^2] = \mathbb{E}\left[\left(\frac{1}{M}\sum_{i=1}^M e_i(x)\right)^2\right]$

**Step 2:** Expand the square using $(a + b + c + \ldots)^2 = a^2 + b^2 + c^2 + \ldots + 2ab + 2ac + 2bc + \ldots$

$= \frac{1}{M^2} \mathbb{E}\left[\left(\sum_{i=1}^M e_i(x)\right)^2\right]$

$= \frac{1}{M^2} \mathbb{E}\left[\sum_{i=1}^M e_i(x)^2 + \sum_{i=1}^M \sum_{\substack{j=1 \\ j \neq i}}^M e_i(x) e_j(x)\right]$

**Step 3:** Separate diagonal and off-diagonal terms

$= \frac{1}{M^2} \mathbb{E}\left[\sum_{i=1}^M e_i(x)^2 + \sum_{i \neq j} e_i(x) e_j(x)\right]$

**Step 4:** Apply linearity of expectation

$= \frac{1}{M^2} \left[\sum_{i=1}^M \mathbb{E}[e_i(x)^2] + \sum_{i \neq j} \mathbb{E}[e_i(x) e_j(x)]\right]$

**Step 5:** Recognize terms
- $\mathbb{E}[e_i(x)^2] = \mathbb{E}[(\hat{f}_i(x) - f^*(x))^2] = \text{MSE}_i$
- $\mathbb{E}[e_i(x) e_j(x)] = \text{Cov}[e_i(x), e_j(x)]$ (since errors may not be zero-mean)

$= \frac{1}{M^2} \left[\sum_{i=1}^M \text{MSE}_i + \sum_{i \neq j} \text{Cov}[e_i(x), e_j(x)]\right]$

---

## Proof Part 2: Bias-Variance Decomposition for Ensembles

### Individual Bias-Variance Decomposition
For model $i$:
$$\text{MSE}_i = \text{Bias}_i^2 + \text{Var}_i + \sigma^2$$

where:
- $\text{Bias}_i^2 = (\mathbb{E}[\hat{f}_i(x)] - f^*(x))^2$
- $\text{Var}_i = \mathbb{E}[(\hat{f}_i(x) - \mathbb{E}[\hat{f}_i(x)])^2]$
- $\sigma^2$ = irreducible error

### Ensemble Bias-Variance Decomposition
Let $\mu_i = \mathbb{E}[\hat{f}_i(x)]$ and $\mu_{\text{ens}} = \frac{1}{M}\sum_{i=1}^M \mu_i$.

**Ensemble Bias:**
$$\text{Bias}_{\text{ens}}^2 = (\mu_{\text{ens}} - f^*(x))^2 = \left(\frac{1}{M}\sum_{i=1}^M \mu_i - f^*(x)\right)^2$$

**Ensemble Variance:**
$$\text{Var}_{\text{ens}} = \text{Var}\left[\frac{1}{M}\sum_{i=1}^M \hat{f}_i(x)\right]$$

Using variance properties:
$$\text{Var}_{\text{ens}} = \frac{1}{M^2} \text{Var}\left[\sum_{i=1}^M \hat{f}_i(x)\right]$$

$$= \frac{1}{M^2} \left[\sum_{i=1}^M \text{Var}[\hat{f}_i(x)] + \sum_{i \neq j} \text{Cov}[\hat{f}_i(x), \hat{f}_j(x)]\right]$$

---

## Proof Part 3: The Key Result

### Case 1: Homogeneous Models (Equal Performance)
Assume all models have:
- Equal bias: $\text{Bias}_i = B$ for all $i$
- Equal variance: $\text{Var}_i = V$ for all $i$
- Pairwise correlation: $\text{Corr}[\hat{f}_i(x), \hat{f}_j(x)] = \rho$ for all $i \neq j$

Then: $\text{Cov}[\hat{f}_i(x), \hat{f}_j(x)] = \rho V$

**Ensemble Bias:**
$$\text{Bias}_{\text{ens}}^2 = B^2$$
(Averaging doesn't change bias if all models have the same bias)

**Ensemble Variance:**
$$\text{Var}_{\text{ens}} = \frac{1}{M^2}[M \cdot V + M(M-1) \cdot \rho V]$$

$$= \frac{V}{M^2}[M + M(M-1)\rho]$$

$$= \frac{V}{M}[1 + (M-1)\rho]$$

$$= \frac{V(1-\rho)}{M} + \rho V$$

### The Fundamental Result
$$\boxed{\text{Var}_{\text{ens}} = \rho V + \frac{(1-\rho)V}{M}}$$

**Interpretation:**
- **First term** $\rho V$: Irreducible variance due to correlation
- **Second term** $\frac{(1-\rho)V}{M}$: Variance reduction from averaging

### Comparison with Individual Models
Individual model MSE: $\text{MSE}_{\text{individual}} = B^2 + V + \sigma^2$

Ensemble MSE: $\text{MSE}_{\text{ensemble}} = B^2 + \rho V + \frac{(1-\rho)V}{M} + \sigma^2$

**Variance reduction factor:**
$$\frac{\text{Var}_{\text{ens}}}{\text{Var}_{\text{individual}}} = \rho + \frac{(1-\rho)}{M} \leq 1$$

**Equality holds** if and only if $M = 1$ or $\rho = 1$ (perfectly correlated models).

---

## Proof Part 4: General Case (Heterogeneous Models)

For models with different performance characteristics:

Let $\bar{V} = \frac{1}{M}\sum_{i=1}^M V_i$ and $\bar{\rho} = \frac{2}{M(M-1)}\sum_{i<j} \rho_{ij}$

**Jensen's Inequality Application:**
Since variance is a convex function of the prediction:

$$\text{Var}_{\text{ens}} \leq \bar{\rho}\bar{V} + \frac{(1-\bar{\rho})\bar{V}}{M}$$

with equality when all individual variances are equal.

---

## Proof Part 5: Optimality Conditions

### When is Ensemble Optimal?

**Theorem:** The ensemble $\hat{f}_{\text{ens}} = \sum_{i=1}^M w_i \hat{f}_i$ with weights $w_i$ is optimal when:

$$w_i = \frac{\sum_{j \neq i} \sigma_j^2 - \sum_{j \neq i} \sigma_{ij}}{\sum_{k,\ell} (\sigma_k^2 - \sigma_{k\ell})}$$

where $\sigma_i^2 = \text{Var}[\hat{f}_i]$ and $\sigma_{ij} = \text{Cov}[\hat{f}_i, \hat{f}_j]$.

**Special Case - Equal Weights:**
When all models have equal variance and equal pairwise correlation, uniform weighting ($w_i = \frac{1}{M}$) is optimal.

---

## Numerical Example

Consider 3 models with:
- Individual variance: $V = 1.0$
- Correlation: $\rho = 0.6$

**Individual model variance:** $V = 1.0$

**Ensemble variance:** 
$$\text{Var}_{\text{ens}} = 0.6 \times 1.0 + \frac{(1-0.6) \times 1.0}{3} = 0.6 + 0.133 = 0.733$$

**Variance reduction:** $\frac{0.733}{1.0} = 73.3\%$ of original variance

**Improvement:** $1 - 0.733 = 26.7\%$ variance reduction

---

## Key Insights

### 1. Correlation is the Enemy
$$\lim_{\rho \to 1} \text{Var}_{\text{ens}} = V$$
(No improvement when models are perfectly correlated)

$$\lim_{\rho \to 0} \text{Var}_{\text{ens}} = \frac{V}{M}$$
(Maximum improvement when models are uncorrelated)

### 2. More Models Help (If Uncorrelated)
$$\lim_{M \to \infty} \text{Var}_{\text{ens}} = \rho V$$

### 3. Diversity Matters More Than Accuracy
Better to ensemble 3 diverse models with 80% accuracy than 3 identical models with 90% accuracy.

---

## Practical Implications

### Ensemble Design Principles

1. **Maximize Diversity:** Use different:
   - Algorithms (tree + SVM + neural network)
   - Training data (bagging, cross-validation folds)
   - Features (random subspaces)
   - Hyperparameters

2. **Balance Accuracy and Diversity:**
   - Don't include terrible models (negative correlation with truth)
   - Don't include identical models (correlation = 1)

3. **Optimal Ensemble Size:**
   - Diminishing returns as $M$ increases
   - Computational cost grows linearly with $M$
   - Sweet spot typically around $M = 5-20$ for most problems

### Real-World Validation

This theory explains why:
- **Random Forest** works (different trees via bootstrapping + feature sampling)
- **Bagging** reduces overfitting (reduces variance)
- **Boosting** can work (sequential models focus on different errors)
- **Model stacking** is effective (different algorithm types)

---

## Conclusion

**Mathematical Guarantee:** Ensembles are **provably no worse** than individual models, with improvement guaranteed unless models are perfectly correlated.

**The ensemble advantage comes from the fundamental statistical principle:**
$$\boxed{\text{Averaging reduces variance while preserving bias}}$$

This is not just an empirical observation—it's a mathematical certainty rooted in basic properties of expectation and variance.

---

# Mathematical Theory of Bagging and Boosting

## Part I: Bootstrap Aggregating (Bagging) - Complete Mathematical Analysis

### 1.1 Bootstrap Sampling Theory

**Definition:** Given dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$, a bootstrap sample $\mathcal{D}_b$ is created by sampling $n$ points **with replacement** from $\mathcal{D}$.

**Fundamental Question:** What's the probability that any particular sample $(x_i, y_i)$ appears in bootstrap sample $\mathcal{D}_b$?

### 1.2 The 63.2% Rule - Mathematical Proof

**Theorem:** The probability that sample $i$ appears at least once in a bootstrap sample is approximately $1 - \frac{1}{e} \approx 0.632$.

**Proof:**

Let $X_{i,j}$ be the indicator random variable:
$$X_{i,j} = \begin{cases} 
1 & \text{if sample } i \text{ is selected in draw } j \\
0 & \text{otherwise}
\end{cases}$$

The probability that sample $i$ is **not** selected in any single draw:
$$P(X_{i,j} = 0) = \frac{n-1}{n}$$

The probability that sample $i$ is **not** selected in any of the $n$ draws:
$$P(\text{sample } i \text{ not in bootstrap}) = \left(\frac{n-1}{n}\right)^n$$

Therefore, the probability that sample $i$ **is** selected at least once:
$$P(\text{sample } i \text{ in bootstrap}) = 1 - \left(\frac{n-1}{n}\right)^n$$

**Taking the limit as $n \to \infty$:**
$$\lim_{n \to \infty} \left(\frac{n-1}{n}\right)^n = \lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = \frac{1}{e}$$

Therefore:
$$\boxed{P(\text{sample in bootstrap}) = 1 - \frac{1}{e} \approx 0.632}$$

**Corollary:** About 36.8% of samples are **out-of-bag** (OOB) for each bootstrap sample.

### 1.3 Variance Reduction Mechanism in Bagging

**Setup:**
- True function: $f^*(x)$
- Individual predictor: $\hat{f}_b(x)$ trained on bootstrap sample $b$
- Bagging predictor: $\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}_b(x)$

**Key Insight:** Bootstrap samples create **diversity** in the training sets, leading to different predictors.

### 1.4 Detailed Variance Analysis

**Individual Model Bias and Variance:**

For any bootstrap-trained model $b$:
$$\mathbb{E}_{\mathcal{D}}[\hat{f}_b(x)] = \mu(x) \quad \text{(expectation over all possible datasets)}$$
$$\text{Var}_{\mathcal{D}}[\hat{f}_b(x)] = \sigma^2(x)$$

**Bagging Bias:**
$$\mathbb{E}_{\mathcal{D}}[\hat{f}_{\text{bag}}(x)] = \mathbb{E}_{\mathcal{D}}\left[\frac{1}{B} \sum_{b=1}^B \hat{f}_b(x)\right] = \frac{1}{B} \sum_{b=1}^B \mathbb{E}_{\mathcal{D}}[\hat{f}_b(x)] = \mu(x)$$

**Key Result 1:** Bagging preserves bias exactly.

**Bagging Variance:**

Under the assumption that bootstrap samples are **approximately independent**:
$$\text{Var}_{\mathcal{D}}[\hat{f}_{\text{bag}}(x)] = \text{Var}_{\mathcal{D}}\left[\frac{1}{B} \sum_{b=1}^B \hat{f}_b(x)\right]$$

If bootstrap predictors were completely independent:
$$= \frac{1}{B^2} \sum_{b=1}^B \text{Var}_{\mathcal{D}}[\hat{f}_b(x)] = \frac{1}{B^2} \cdot B \cdot \sigma^2(x) = \frac{\sigma^2(x)}{B}$$

**However,** bootstrap samples are **not** independent because they're drawn from the same original dataset.

### 1.5 Accounting for Bootstrap Correlation

Let $\rho_{\text{boot}}$ be the correlation between predictions from different bootstrap samples.

**More accurate variance formula:**
$$\text{Var}_{\mathcal{D}}[\hat{f}_{\text{bag}}(x)] = \frac{\sigma^2(x)}{B} + \rho_{\text{boot}} \sigma^2(x) \left(1 - \frac{1}{B}\right)$$

As $B \to \infty$:
$$\lim_{B \to \infty} \text{Var}_{\mathcal{D}}[\hat{f}_{\text{bag}}(x)] = \rho_{\text{boot}} \sigma^2(x)$$

**Key Result 2:** Bagging can only reduce variance down to the correlation level between bootstrap predictors.

### 1.6 Why Bagging Works Best for High-Variance Models

**Theorem:** Bagging provides the most benefit for predictors with high variance and low bias.

**Proof by MSE decomposition:**

Individual model MSE:
$$\text{MSE}_{\text{individual}} = \text{Bias}^2 + \sigma^2(x) + \text{noise}^2$$

Bagging MSE:
$$\text{MSE}_{\text{bag}} = \text{Bias}^2 + \rho_{\text{boot}} \sigma^2(x) + \text{noise}^2$$

**Improvement:**
$$\text{MSE}_{\text{individual}} - \text{MSE}_{\text{bag}} = (1 - \rho_{\text{boot}}) \sigma^2(x)$$

**Conclusion:** Improvement is proportional to the original variance $\sigma^2(x)$.

---

## Part II: Boosting - Sequential Error Correction Theory

### 2.1 AdaBoost Mathematical Framework

**Core Principle:** Train weak learners sequentially, where each learner focuses on the mistakes of previous learners.

**Algorithm Setup:**
- Training data: $\{(x_i, y_i)\}_{i=1}^n$ where $y_i \in \{-1, +1\}$
- Sample weights: $w_i^{(t)}$ at iteration $t$
- Weak learner: $h_t: \mathcal{X} \to \{-1, +1\}$
- Final classifier: $H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$

### 2.2 AdaBoost Algorithm with Mathematical Justification

**Initialization:** $w_i^{(1)} = \frac{1}{n}$ for all $i$

**For $t = 1, 2, \ldots, T$:**

**Step 1:** Train weak learner $h_t$ on weighted data
$$h_t = \arg\min_h \sum_{i=1}^n w_i^{(t)} \mathbb{1}(h(x_i) \neq y_i)$$

**Step 2:** Compute weighted error
$$\epsilon_t = \frac{\sum_{i=1}^n w_i^{(t)} \mathbb{1}(h_t(x_i) \neq y_i)}{\sum_{i=1}^n w_i^{(t)}}$$

**Step 3:** Compute classifier weight
$$\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$$

**Step 4:** Update sample weights
$$w_i^{(t+1)} = w_i^{(t)} \exp(-\alpha_t y_i h_t(x_i))$$

### 2.3 Why These Formulas? - The Exponential Loss Perspective

**Key Insight:** AdaBoost can be viewed as **coordinate descent** on the exponential loss function.

**Exponential Loss:** $L(y, f(x)) = e^{-yf(x)}$

**Objective:** Minimize
$$\mathcal{L}(F) = \sum_{i=1}^n e^{-y_i F(x_i)}$$
where $F(x) = \sum_{t=1}^T \alpha_t h_t(x)$.

### 2.4 Derivation of AdaBoost Updates

**At iteration $t$, we have:** $F_{t-1}(x) = \sum_{s=1}^{t-1} \alpha_s h_s(x)$

**Goal:** Find $\alpha_t$ and $h_t$ to minimize:
$$\mathcal{L}(F_{t-1} + \alpha_t h_t) = \sum_{i=1}^n e^{-y_i(F_{t-1}(x_i) + \alpha_t h_t(x_i))}$$

$$= \sum_{i=1}^n e^{-y_i F_{t-1}(x_i)} e^{-y_i \alpha_t h_t(x_i)}$$

**Let $w_i^{(t)} = e^{-y_i F_{t-1}(x_i)}$ (these are the sample weights!)**

$$\mathcal{L}(F_{t-1} + \alpha_t h_t) = \sum_{i=1}^n w_i^{(t)} e^{-y_i \alpha_t h_t(x_i)}$$

### 2.5 Optimal $\alpha_t$ Derivation

**Separate correct and incorrect predictions:**
$$\mathcal{L} = \sum_{i: h_t(x_i) = y_i} w_i^{(t)} e^{-\alpha_t} + \sum_{i: h_t(x_i) \neq y_i} w_i^{(t)} e^{\alpha_t}$$

**Let:**
- $W_{\text{correct}} = \sum_{i: h_t(x_i) = y_i} w_i^{(t)}$
- $W_{\text{incorrect}} = \sum_{i: h_t(x_i) \neq y_i} w_i^{(t)}$

$$\mathcal{L} = W_{\text{correct}} e^{-\alpha_t} + W_{\text{incorrect}} e^{\alpha_t}$$

**Taking derivative and setting to zero:**
$$\frac{d\mathcal{L}}{d\alpha_t} = -W_{\text{correct}} e^{-\alpha_t} + W_{\text{incorrect}} e^{\alpha_t} = 0$$

**Solving:**
$$W_{\text{correct}} e^{-\alpha_t} = W_{\text{incorrect}} e^{\alpha_t}$$
$$\frac{W_{\text{correct}}}{W_{\text{incorrect}}} = e^{2\alpha_t}$$
$$\alpha_t = \frac{1}{2} \ln\left(\frac{W_{\text{correct}}}{W_{\text{incorrect}}}\right)$$

**Since $\epsilon_t = \frac{W_{\text{incorrect}}}{W_{\text{correct}} + W_{\text{incorrect}}}$:**

$$\alpha_t = \frac{1}{2} \ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$$

**This derives the AdaBoost $\alpha_t$ formula!**

### 2.6 AdaBoost Convergence Theory

**Training Error Bound:**

**Theorem:** The training error of AdaBoost after $T$ rounds is bounded by:
$$\text{Error}_{\text{train}} \leq \prod_{t=1}^T 2\sqrt{\epsilon_t(1-\epsilon_t)}$$

**Proof:**
The number of misclassified training examples is:
$$\sum_{i=1}^n \mathbb{1}(H(x_i) \neq y_i) \leq \sum_{i=1}^n e^{-y_i F(x_i)}$$

From the exponential loss minimization:
$$\sum_{i=1}^n e^{-y_i F(x_i)} = \prod_{t=1}^T Z_t$$

where $Z_t = \sum_{i=1}^n w_i^{(t)} e^{-y_i \alpha_t h_t(x_i)}$ is the normalization factor.

**Computing $Z_t$:**
$$Z_t = e^{-\alpha_t} \sum_{i: h_t(x_i)=y_i} w_i^{(t)} + e^{\alpha_t} \sum_{i: h_t(x_i) \neq y_i} w_i^{(t)}$$
$$= e^{-\alpha_t}(1-\epsilon_t) + e^{\alpha_t}\epsilon_t$$

**Substituting $\alpha_t = \frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$:**
$$Z_t = 2\sqrt{\epsilon_t(1-\epsilon_t)}$$

**Key Insight:** If each weak learner has error $\epsilon_t < \frac{1}{2}$, then $Z_t < 1$, and training error decreases exponentially!

### 2.7 Gradient Boosting - Functional Gradient Descent

**Generalization:** Instead of exponential loss, minimize any differentiable loss $L(y, f(x))$.

**Framework:**
$$F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$$

where $h_m$ approximates the negative gradient:
$$h_m(x) \approx -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}$$

### 2.8 Gradient Boosting for Regression (MSE Loss)

**Loss function:** $L(y, f(x)) = \frac{1}{2}(y - f(x))^2$

**Gradient:** $\frac{\partial L}{\partial f} = -(y - f(x)) = -r$ (negative residual)

**Algorithm:**
1. Initialize: $F_0(x) = \bar{y}$
2. For $m = 1$ to $M$:
   - Compute residuals: $r_{i,m} = y_i - F_{m-1}(x_i)$
   - Fit tree $h_m$ to residuals: $h_m = \arg\min_h \sum_i (r_{i,m} - h(x_i))^2$
   - Update: $F_m(x) = F_{m-1}(x) + \gamma_m h_m(x)$

**This is exactly fitting to residuals!**

### 2.9 Why Boosting Reduces Bias

**Theorem:** Under weak learning assumptions, boosting can reduce bias to arbitrarily small levels.

**Weak Learning Assumption:** Each $h_t$ has error $\epsilon_t \leq \frac{1}{2} - \gamma$ for some $\gamma > 0$.

**Proof Sketch:**
1. Each weak learner makes progress (however small) toward the correct answer
2. Sequential combination allows complex decision boundaries
3. Training error decreases exponentially (proven above)
4. With enough rounds, can fit training data perfectly (bias → 0)

**Trade-off:** Reducing bias may increase variance (overfitting risk).

---

## Part III: Bagging vs Boosting - Theoretical Comparison

### 3.1 Bias-Variance Trade-offs

| Method | Bias Effect | Variance Effect | Best For |
|--------|-------------|-----------------|----------|
| **Bagging** | Preserves bias | Reduces variance | High-variance, low-bias models |
| **Boosting** | Reduces bias | May increase variance | High-bias, low-variance models |

### 3.2 Mathematical Justification

**Bagging MSE:**
$$\text{MSE}_{\text{bag}} = \text{Bias}_{\text{individual}}^2 + \rho \cdot \text{Var}_{\text{individual}} + \sigma^2$$

**Boosting MSE (simplified):**
$$\text{MSE}_{\text{boost}} = \text{Bias}_{\text{reduced}}^2 + \text{Var}_{\text{increased}} + \sigma^2$$

### 3.3 Convergence Properties

**Bagging:** Converges as $B \to \infty$ to the **average** of infinite bootstrap predictors.

**Boosting:** With proper regularization, converges to the **maximum margin** classifier.

**AdaBoost Margin Theorem:** AdaBoost maximizes the minimum margin of training examples, leading to good generalization.

---

## Part IV: Practical Implications

### 4.1 When to Use Which Method

**Use Bagging when:**
- Base model has high variance (e.g., deep decision trees)
- Want to parallelize training
- Need stable, robust predictions
- Have noisy data

**Use Boosting when:**
- Base model has high bias (e.g., shallow trees, linear models)
- Can afford sequential training
- Need maximum predictive accuracy
- Data is relatively clean

### 4.2 Hyperparameter Effects

**Bagging:**
- More estimators → Lower variance (diminishing returns)
- Deeper base models → Better individual performance, still variance reduction

**Boosting:**
- More rounds → Lower bias but higher overfitting risk
- Lower learning rate → Slower convergence but better generalization
- Shallow base models → Better weak learners for boosting

---

## Conclusion

The mathematical theory shows that:

1. **Bagging works** by creating diverse predictors through bootstrap sampling, reducing variance while preserving bias
2. **Boosting works** by sequential error correction, focusing on difficult examples to reduce bias
3. Both methods have **theoretical guarantees** under appropriate assumptions
4. The choice between them depends on the **bias-variance characteristics** of your base model

These aren't just empirical techniques—they're **mathematically principled** approaches to ensemble learning with proven theoretical foundations.