In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import sys
sys.path.append('../../pyutils')
import metrics
import utils

# Introduction

Classic ensemble methods:
- Bagging / Random Forests
- Boosting
- Stacking

Ensemble learning divided in 2 task:
- Build multiple base learners
- Combine them into one predictor

# Boosting and Regularization Paths

## Penalized Regression

Let $\mathcal{T} = \{ T_k \}$ the set of all possible regresion trees on the training data. We can construct the following predictor:
$$f(x) = \sum_k \alpha_k T_k(x)$$

The $\alpha$ can be estimated by least-squares using a form of regularization:

$$\min_\alpha \sum_{i=1}^N \left( y_i - \sum_k \alpha_k T_k(x_i) \right) ^2 + \lambda J(\alpha)$$
with $J(\alpha)$ a penalty function (e.g. ridge or lasso).  

Using lasso, we get a sparse solution, with only a small fraction of all possible trees in the model.  
Given the huge number of $T_k$, solving directly is intractable. A feasible forward startegy exists that approximates lasso:

1. Initialize $\alpha_k = 0$
2. For $m=1 \to M$:
    $$(\beta^*, k^*) \leftarrow \arg \min_{\beta, k} \sum_{i=1}^N (y_i - \sum_l \alpha_lT_l(x_i) - \beta T_k(x_i))$$
    $$\alpha_k \leftarrow \alpha_k + \epsilon *\text{sign}(\beta^*)$$
3. Output $f_M(x) = \sum_k \alpha_k T_k(x)$

$M$ is inversely related to $\lambda$

## The Bet on Sparsity Principle

Boosting with shrinkage approximate a $L_1$ model. The $L_2$ penalty is much faster to compute.  
The superior performance of boosting over models such as SVM may be due to the implicit use of $L_1$ versus $L_2$. this may be due because $L_1$ is better suited to parse situations.  

Take for example a dataset of $10.000$ data points, with millions of trees.  
If the true population coefficients of the trees arose from a Gaussian, ridge regression is the better predictor, this a dense scenario.  
If only a few coefficients are nonzero, lasso is better, this a sparse scenario.  
With $L_2$, it performs poorly on both scenarios, because there is too litle data to estimate correctly all coefficients in the dense scenario.  
With $L_1$, it might performs well on the sparse scenario.  
This leads to the bet on sparsity principle: Use a procedure that does well in sparse problems, since no procedure dowes well in dense problems.  


Larger training sets allow to estimate coefficients with smaller standard error. In situations with small noise-to-signal ration (NSR), we can identify more non-zero coefficients than in situations with larger NSR.  
Increasing the size of the dictionary $\mathcal{T}$ may lead to a sparser representation, but the search problem becomes more difficult, and may lead to higher variance.

# Learning Ensembles

Importance sampled learning ensembles - Friedman, J. and Popescu, B. (2003) - [PDF](https://pdfs.semanticscholar.org/966f/fe536f84efd15c1379dad9adffe90b20676f.pdf)

Let's consider bulding models of the form:
$$f(x) = \alpha_0 + \sum_{k} \alpha_k T_k(x)$$
with $\mathcal{T}$ a dictionary of basis functions, typically trees.

A specific approach break the process into 2 steps:
- Find a finite dictionay $\mathcal{T}_L = \{ T_1(x), \text{...}, T_M(x) \}$ from the training data
- A lasso path is fit to the model with $\mathcal{T}_L$:
    $$\alpha(\lambda) = \arg \min_{\alpha} \sum_{i=1}^N L(y_i, \alpha_0 + \sum_{m=1}^M \alpha_m T_m(x_i)) + \lambda \sum_{m=1}^M |\alpha_m|$$
    
This approach saves a lot of computation times, both at training and at computation if the number of trees is small

## Learning a Good Ensemble


In order to select a good $\mathcal{T}_L$, we can use a measure of lack of relevance:
$$Q(\gamma) = \min_{c_0, c_1} \sum_{i=1}^N L(y_i, c_0 + c_1 b(x_i;\gamma))$$
with $\gamma$ the whose set of parameters of the basis function $b(x;\gamma)$. For a tree, it would be every splitting variables, split-points, and values in terminal nodes.  
If only one basis function where to be selected, it would be $\gamma^* = \arg \min_{\gamma} Q(\gamma)$.

Introducing randomness in the selection of $\gamma$ would produce less optimal values $Q(\gamma) \geq Q(\gamma^*)$.
$$\sigma = E[Q(\gamma) - Q(\gamma^*)$$
- $\sigma$ too narow suggest $b(x;\gamma_m)$ look alike and similar to $b(x;\gamma^*)$.
- $\sigma$ too wide implies poor $b(x;\gamma_m)$.

ISLE Ensenble generation use sub-sampling for introducing randomness:
1. Initialize $f_0(x)$:
    $$f_0(x) = \arg \min_c \sum_{i=1}^N L(y_i, c)$$
2. For $m=1 \to M$:
    $$\gamma_m \leftarrow \arg \min_\gamma \sum_{i \in S_m(\mu)} L(y_i, f_{m-1}(x_i) + b(x_i;\gamma))$$
    $$f_m(x) = f_{m-1}(x) + v b(x;\gamma_m)$$  
    
with $S_m(\nu)$ refers to a subsample of size $N * \mu$ ($\mu \in (0,1)$)

## Rule Ensembles

Predictive learning via rule ensembles - Friedman, J. and Popescu, B. (2008) - [PDF](https://arxiv.org/pdf/0811.1679.pdf)

We can convert a tree into a set of rules. Usually somes rules can be removed while still spamming the same tree.  
For each tree $T_m$, we construct its mini-ensemble of rules $\mathcal{T}^M_\text{RULE}$, and combine them all into a larger ensemble:
$$\mathcal{T}_\text{RULE} = \bigcup_{m=1}^M \mathcal{T}^m_\text{RULE}$$

This is treated as any other ensemble and post-processed via lasso or other procdedures.
Advantages:
- The space of models is enlarged
- Rules are easier to interpret than trees.
- $\mathcal{T}_\text{RULE}$ can by extended with each $X_j$ for example, allowing to also model linear functions.