In [1]:
import numpy as np
import matplotlib.pyplot as plt

import sys
sys.path.append('../../pyutils')
import metrics
import utils

# Adaboost.M1

The idea of boosting is to combine many weak classifiers intro a strong one.  
A weak classifier is one slight better than random guessing.

Let's define the error rate:

$$\bar{\text{err}} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i))$$
Adaboost combine $M$ weak classifiers:
$$G(x) = \text{sign} \left( \sum_{m=1}^M \alpha_m G_m(x) \right) $$

$\alpha$ is the contribution vector of the classifiers, they are learned, such that a better model as an higher $\alpha_m$.  

All classifiers are trained one by one, but with weighted examples $w_i$. At first all examples have the same weight, then at each iteration the weight of misclassified examples increase, and the others decrease.  

Algorithm $10.1$ page $339$

In [2]:
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from copy import deepcopy

X, y = load_digits().data, load_digits().target
y = (y < 5).astype(np.int32)
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=15)


logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
print('train acc:', np.mean(y_train == logreg.predict(X_train)))
print('test acc:', np.mean(y_test == logreg.predict(X_test)))


class AdaboostM1:
    
    def __init__(self, model, M):
        self.model = model
        self.M = M
        self.mods = []
        self.alpha = np.empty(M)
        
    def fit(self, X, y):
        N = len(X)
        w = np.ones(N) / N
        
        
        for m in range(self.M):
            clf = deepcopy(self.model)
            clf.fit(X, y, w)
            self.mods.append(clf)
            preds = clf.predict(X)
            err = np.sum(w * (preds != y)) / np.sum(w)
            self.alpha[m] = np.log((1 - err) / err)
            w = w * np.exp(self.alpha[m] * (preds != y))
            
    def predict(self, X):
        preds = np.zeros(len(X))
        for m in range(self.M):
            preds += self.alpha[m] * self.mods[m].predict(X)
        preds = np.round(preds / np.sum(self.alpha)).astype(np.int32)
        return preds
            
        
 
mod = DecisionTreeClassifier(max_depth=1)
clf = AdaboostM1(mod, 500)
clf.fit(X_train, y_train)
print('train acc:', np.mean(y_train == clf.predict(X_train)))
print('test acc:', np.mean(y_test == clf.predict(X_test)))

(1797, 64)
(1797,)
train acc: 0.9102296450939458
test acc: 0.8916666666666667
train acc: 0.9617258176757133
test acc: 0.9111111111111111


# Boosting Fits an Additive Model

Boosting is just a special case of additive models:
    
$$f(x) = \sum_{m=1}^M \beta_m b(x;\gamma_m)$$

$b(x;\gamma_m)$ are simple functions of argument $x$ and parameters $\gamma_m$. For boosting, each basis function is a weak classifier.

These models are trained by fitting $\beta$ and $\gamma$ minimizing a loss function over the dataset:

$$\min_{\{\beta_m, \gamma_m\}_1^M} \sum_{i=1}^N L\left(y_i, \sum_{m=1}^M \beta_m b(x_i;\gamma_m)\right)$$

for any loss function $L(y, f(x))$ such as squared-error or negative log-likelihood.

# Forward Stagewise Additive Modeling

This algorithm find an approximate solution by solving a simpler problem.   It starts with an empty model, and add a new basic function one at a time,  fitting it without modyfing the parameters of the previous ones.

The problem is a lot simpler to optimize:

$$\min_{\beta_m, \gamma_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + \beta_m b(x_i;\gamma_m))$$

# Exponential Loss and Adaboost

AdaBoost.M1 is equivalent to forward stagewise additive modeling using the exponential loss:

$$L(y, f(x)) = \exp (-yf(x))$$

The problem is:

$$(\beta_m, G_m) = \min_{\beta, G} \sum_{i=1}^N \exp [-y_i(f_{m-1}(x_i) + \beta G(x_i))]$$

# Why Exponential Loss ?

The principal attraction is computational: additive modeling with computational loss leads to a simple modular reweighting algorithm.

$$f^*(x) = \arg \min_{f(x)} E_{y|x} \exp(-yf(x)) = \frac{1}{2} \log \frac{P(y=1|x)}{P(y=-1|x)}$$

Thus AdaBoost estimates one-half of the log-odds, that justifies using the  sign operator.  

Another loss is the deviance loss:

$$l(y, f(x)) = \log (1 + e^{-2yf(x)})$$

At the population level, using either criterion leads to the same solution, but this is not true for finite datasets

# Loss Functions and Robustness

## Robust Loss functions for classification

deviance and exponential loss are both monotone decreasing functions of the margin yf(x).  
With $G(x) = \text{sign}(f(x))$, observations with positive margin are classified corretly, and those with negative magin are misclassified.  
Any loss criterion should penalize negative margin more heavily than positive ones.  

The difference between deviance and exponential loss is how much they penalize negative margins. The penalty for deviance increase linearly, where the one for exponential loss increase exponentially. In noisy settings, with misclassifications in the training data, the deviance gives better results.  

Mean Squared error increases quadratically when $yf(x) > 1$, therefore increasing error for correctly classified examples with increasing certainty. Thus MSE is a terrible choice of loss function.  

The problem generalize to K-class classification:
$$G(x) = \arg \max_{k} p_k(x)$$
with $p_k(x)$ the probability that $x$ belongs to class $k$:
$$p_k(x) = \frac{e^{f_k(x)}}{\sum_{l=1}^K e^{f_l(x)}}$$

We can use the K-class multinomial deviance loss function:

$$L(y, p(x)) = \sum_{k=1}^K I(y = k) \log p_k(x)$$

## Robust Loss functions for regression

For regression, both the squared error: $L(y, f(x)) = (y - f(x))^2$ and absolute loss $L(y, f(x)) = |y - f(x)|$ leads to the same populations results, but vary for finite datasets.  
Squared error loss places much empahish on obersation with large residuals, which if far less robust for outliers. Absolute loss performs much better in these situations.  

Another solution to resist outliers is the Huber loss:
$$
L(y, f(x)) = 
\begin{cases}
    (y - f(x))^2 & \text{if } |y - f(x)| \leq \delta \\
    2 \delta |y - f(x)| - \delta^2 & \text{otherwise}
\end{cases}
$$

# Boosting Trees

A tree can be expressed as:

$$T(x;\theta) = \sum_{j=1}^J \gamma_j I(x \in R_j)$$

The parameters are found by minimizing the empirical risk:

$$\theta = \arg \min_\theta \sum_{j=1}^J \sum_{x_i \in R_j} L(y_i, \gamma_j)$$  

The boosted tree model is a sum of such trees:
$$f_M(x) = \sum_{m=1}^M T(x;\theta_m)$$

With a forward stagewise procedure, one must solve at each step:
$$\hat{\theta_m} = \arg \min_{\theta_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + T(x_i, \theta_m))$$
For MSE, we simply need to fit a new regression tree with the residual errors.  

For binary classificaton with exponential loss, we get the following criterion for each tree:
$$\hat{\theta_m} = \arg \min_{\theta} \sum_{i=1}^N w_i^{(m)} \exp (-y_i T(x_i;\theta_m))$$
This criterion can be implemented by updating the criterion of splitting for the classical tree growing algorithms.  

Using other loss such as the absolute error, the Huber Loss, or the deviance gives most robust models, but there is no simple algorithms to optimize them. 

# Numerical Optimization via Gradient Boosting

Let's define the loss as:
$$L(f) = \sum_{i=1}^N L(y_i, f(x_i))$$

The goal is to minimize $L(f)$ with respect fo $f$, with $f$ a sum of trees.  

Let's say the parameters of $f$ are the values of $f$ at each point in the training set:
$$f = \{ f(x_1), f(x_2), \text{...}, f(x_N) \}^T$$

Numerical optimisation solve $f$ using a sum of components vectors, or sum of steps:
$$f_M = \sum_{m=0}^M h_m, \space h_m \in \mathbb{R}^N$$  

Using steepest descent, we define $h_m = -\rho_m g_m$ with $\rho_m \in \mathbb{R}$ the step size, and $g_m \in \mathbb{R}^N$ the gradient of $L$.  

$$g_{im} = \frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}$$
$$\rho_m = \arg \min_{\rho} L(f_{m-1} 0 \rho g_m)$$
The current solution is updated:
$$f_m = f_{m-1} - \rho_m g_m$$  

The process is repeated $M$ times, this is a greedy strategy

## Gradient boosting

This process is great to minize loss on the training data, but our goal is generalization.  
A solution is to build a tree $T(x;\theta_m)$ at each iteration, as close as possible to the negative gradient. Using MSE, we get the criterion:
$$\hat{\theta} = \arg \min_\theta \sum_{i=1}^N (-g_{im} - T(x_i;\theta))^2$$

Gradient Boosting Regression:

1. Initialize:

$$f_0(x) = \arg \min_\gamma \sum_{i=1}^N L(y_i, \gamma)$$

2. For $m=1$ to $M$:

    $$r_{im} = - \frac{\partial L(y_i, f_{m-1}(x_i))}{\partial f_{m-1}(x_i)}$$
    
    Fit a regression tree to targets $r_{im}$, and update $f_m(x)$:
    
    $$f_m(x) = f_{m-1}(x) + \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$$
    
3. Output $\hat{f}(x) = f_M(x)$

For other losses, we plug in different loss functions $L$.  
For K-classes classification, we need to build $K$ trees at eachh iteration.  
Two hyperparemeters are the number of iterations $M$, and the size of each tree $J_m$.

# Right-Sized Trees for Boosting

Each time a new tree is built using the usual procedure, starting by building a very large tree, then pruning it. This procedure suppose the tree built is the last one, which is a poor assumption for non-final trees. It results in tree way too large in each iteration.  
One solution is to restrict all tress to the size $J$, an hyperparameter to be fixed.  

The interation level is limited by $J$. With $J=2$, only main effects are possible, $J=3$ allow only two-variable interactions, and so on.  
The interaction level is unknow, but low in general. By experience $4 \leq J \leq 8$ works well in practice.

# Regularization

Another hyperpameter to be fixed is $M$. Each iteration recudes the risk on the training set, but may lead to overfitting. We can find the optimital $M^*$ by monitoring the risk on a validation set.  

Another regularization technique is Shrinkage, that scale the contribution of each tree by a factor $v \in [0,1]$:
$$f_m(x) = f_{m-1}(x) + v \sum_{j=1}^{J_m} \gamma_{jm} I(x \in R_{jm})$$  
Smaller values of $v$ cause more shrinkage. In practice, set $v$ very small $(< 0.1)$ and chose $M$ as above works well.  

We can also use subsampling, similar to bagging. At each iteration, we sample a fraction of the training dataset and perform the iteration on this sample. It usually produces a more accurate model.

# Interpretation

## Relative importance of Predictor Variables

For a single tree $T$, a measure of releveance of feature $X_l$ is:
$$\mathcal{I}_l^2(T) = \sum_{t=1}^{J-1} \hat{i}^2 I(v(t) = l)$$ 

With $i_t^2$ the improvement in squared error fit over that for a constant fit to the world region, where the split variable is $l$.  
For tree boosting, we simply average over all the trees:
$$\mathcal{I}_l^2 = \frac{1}{M} \sum_{i=1}^M \mathcal{I}_l^2(T_m)$$  
All values are relative, we can set the higher to $100$, and scale all other accordingly.

## Partial Dependence Plots

We can plot the dependance of a subset of variables $X_S$, by a marginal average over the other variables $X_C$:
$$f_S(X_S) = E_{X_C} f(X_s,X_C)$$
We can thuse realise several partial dependace plots, using several sets $S$.  
They can be estimated by:
$$\bar{f_S(X_S)} = \frac{1}{N} \sum_{i=1}^N f(X_S, x_{iC})$$

## Gradient boosting (Regression)

Least Absolute Error Tree Bosting Regression algorithm:

1. set $F_0(x) = \text{median} \{ y_i \}_1^N$

2. For $m=1$ to $M$:
    $$\hat{y}_i = \text{sign}(y_i - F_{m-1}(x_i))$$
    $$\{ R_{jm} \}_1^J \text{ tree with J terminal nodes trained on } \{\hat{y}_i, x_i \}_1^N$$
    $$\gamma_{jm} = \text{median} \{ y_i - F_{m-1}(x_i) : x_i \in R_{jm} \}$$
    $$F_m(x) = F_{m-1}(x) + \sum_{j=1}^J \gamma_{jm} 1(x \in R_{jm})$$
    
3. Output $\hat{F}(x) = F_M(x)$  

An alternate approach is to build tree $T(x)$ minizing LAE loss:
$$T_m(x) = \arg \min_{T} \sum_{i=1}^N |y_i - F_{m-1}(x_i) - T(x)|$$
$$F_m(x) = F_{m-1}(x) = T_m(x)$$  

The first solution is much faster because it uses the sqared error loss to build the trees.

# Gradient Boosting Paper

[Greedy Function Approximation: A Gradient boosting Machine](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf)

In [3]:
from copy import deepcopy
from sklearn.datasets import load_boston

class DTNode:
    
    def __init__(self, X, y, val):
        self.X = X
        self.y = y
        self.val = val
        self.cut = None
        self.subs = None
        
    def pred(self, x):
        if self.cut is None:
            return self.val
        elif x[self.cut[0]] <= self.cut[1]:
            return self.subs[0].pred(x)
        else:
            return self.subs[1].pred(x)
        
    def split(self, j, s, eval_fn):
        if self.cut is not None:
            raise Exception('already cut')
        
        leftp = self.X[:,j] <= s
        rightp = self.X[:,j] > s
        
        X_left, y_left = self.X[leftp], self.y[leftp]
        X_right, y_right = self.X[rightp], self.y[rightp]
        left = DTNode(X_left, y_left, eval_fn(X_left, y_left))
        right = DTNode(X_right, y_right, eval_fn(X_right, y_right))
        self.cut = (j, s)
        self.subs = (left, right)
        
    def update_vals(self, X, y, eval_fn):
        if self.cut is None:
            self.val = eval_fn(X, y)
            return
        
        p1 = X[:,self.cut[0]] <= self.cut[1]
        p2 = X[:,self.cut[0]] > self.cut[1]
        self.subs[0].update_vals(X[p1], y[p1], eval_fn)
        self.subs[1].update_vals(X[p2], y[p2], eval_fn)

        
        

def get_best_cut(node, j, val_fn, err_fn, min_leaf_size):
    X = node.X
    y = node.y
    best_s = None
    best_err = float('inf')
    
    
    for i in range(len(X) - 1):
        
        s = (X[i,j] + X[i+1,j])/2
        X_left = X[X[:,j] <= s]
        X_right = X[X[:,j] > s]
        y_left = y[X[:,j] <= s]
        y_right = y[X[:,j] > s]
        if len(y_left) < min_leaf_size or len(y_right) < min_leaf_size:
            continue
        
        
        preds_left = np.ones(len(y_left)) * val_fn(X_left, y_left)
        preds_right = np.ones(len(y_right)) * val_fn(X_right, y_right)
        err = err_fn(y_left, preds_left) + err_fn(y_right, preds_right)
        
        if err < best_err:
            best_err = err
            best_s = s
        
    return best_s, best_err
        
        
    
    
        

def split_tree(node, val_fn, err_fn, max_size, size = None):
    if size is None:
        size = [1]
    if size[0] >= max_size:
        return
    
    best_j = None
    best_s = None
    best_err = float('inf')
    
    for j in range(node.X.shape[1]):
            
        s, err = get_best_cut(node, j, val_fn, err_fn, min_leaf_size=3)
        if err < best_err:
            best_s = s
            best_j = j
            best_err = err
    
    if best_j is None:
        return
    
    node.split(best_j, best_s, val_fn)
    size[0] += 1
    split_tree(node.subs[0], val_fn, err_fn, max_size, size)
    split_tree(node.subs[1], val_fn, err_fn, max_size, size)
    
    
    
def build_tree(X, y, val_fn, err_fn, max_size):
    root = DTNode(X, y, val_fn(X, y))
    split_tree(root, val_fn, err_fn, max_size)
    return root

def val_avg(X, y):
    return np.mean(y)

def err_mse(y, preds):
    return np.sum((y - preds)**2)
    
    
class TreeRegressor:
    
    def __init__(self, max_size=4):
        self.max_size = max_size
    
    def fit(self, X, y):
        self.root = build_tree(X, y, val_avg, err_mse,
                              self.max_size)
    
    def predict(self, X):
        y = np.empty(len(X))
        for i in range(len(X)):
            y[i] = self.root.pred(X[i])
        return y
    
    
        
X, y = load_boston().data, load_boston().target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=15)

clf = TreeRegressor(max_size=4)
clf.fit(X_train, y_train)
print('train_error:', np.mean((y_train - clf.predict(X_train))**2))
print('test_error:', np.mean((y_test - clf.predict(X_test))**2))

train_error: 23.811758966794812
test_error: 33.320464682729074


In [4]:
class ConstModel:
    def __init__(self, val):
        self.root = DTNode(None, None, val)

def update_median(_, resi):
    return np.median(resi)
        
class LADTreeBost:
    
    def __init__(self, J, M):
        self.J = J
        self.M = M
        
    def fit(self, X, y):
        
        self.mods = []
        self.mods.append(ConstModel(np.median(y)))
        resid = y - np.median(y)
        
        for m in range(self.M):
            yhat = np.sign(resid)
            tree = TreeRegressor(max_size=self.J)
            tree.fit(X,yhat)
            tree.root.update_vals(X, resid, update_median)
            resid -= tree.predict(X) 
            self.mods.append(tree)
        
    def predict(self, X):
        y = np.empty(len(X))
        for i in range(len(X)):
            y[i] = self.get_pred(X[i])
        return y
    
    def get_pred(self, x):
        y = 0
        for m in self.mods:
            y += m.root.pred(x)
        return y
    

X, y = load_boston().data, load_boston().target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                   random_state=15)


clf = LADTreeBost(J=4, M=10)
clf.fit(X_train, y_train)
print('train_error:', np.mean((y_train - clf.predict(X_train))**2))
print('test_error:', np.mean((y_test - clf.predict(X_test))**2))

train_error: 16.5096349009901
test_error: 18.787406556372538
