In [1]:
import sys
sys.path.append('..')

import numpy as np
import matplotlib.pyplot as plt

import metrics
import utils

# Adaboost.M1

The idea of boosting is to combine many weak classifiers intro a strong one.  
A weak classifier is one slight better than random guessing.

Let's define the error rate:

$$\bar{\text{err}} = \frac{1}{N} \sum_{i=1}^N I(y_i \neq G(x_i))$$
Adaboost combine $M$ weak classifiers:
$$G(x) = \text{sign} \left( \sum_{m=1}^M \alpha_m G_m(x) \right) $$

$\alpha$ is the contribution vector of the classifiers, they are learned, such that a better model as an higher $\alpha_m$.  

All classifiers are trained one by one, but with weighted examples $w_i$. At first all examples have the same weight, then at each iteration the weight of misclassified examples increase, and the others decrease.  

Algorithm $10.1$ page $339$

In [2]:
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from copy import deepcopy

X, y = load_digits().data, load_digits().target
y = (y < 5).astype(np.int32)
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2,
                                                    random_state=15)


logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
print('train acc:', np.mean(y_train == logreg.predict(X_train)))
print('test acc:', np.mean(y_test == logreg.predict(X_test)))


class AdaboostM1:
    
    def __init__(self, model, M):
        self.model = model
        self.M = M
        self.mods = []
        self.alpha = np.empty(M)
        
    def fit(self, X, y):
        N = len(X)
        w = np.ones(N) / N
        
        
        for m in range(self.M):
            clf = deepcopy(self.model)
            clf.fit(X, y, w)
            self.mods.append(clf)
            preds = clf.predict(X)
            err = np.sum(w * (preds != y)) / np.sum(w)
            self.alpha[m] = np.log((1 - err) / err)
            w = w * np.exp(self.alpha[m] * (preds != y))
            
    def predict(self, X):
        preds = np.zeros(len(X))
        for m in range(self.M):
            preds += self.alpha[m] * self.mods[m].predict(X)
        preds = np.round(preds / np.sum(self.alpha)).astype(np.int32)
        return preds
            
        
 
mod = DecisionTreeClassifier(max_depth=1)
clf = AdaboostM1(mod, 500)
clf.fit(X_train, y_train)
print('train acc:', np.mean(y_train == clf.predict(X_train)))
print('test acc:', np.mean(y_test == clf.predict(X_test)))

(1797, 64)
(1797,)
train acc: 0.9102296450939458
test acc: 0.8916666666666667
train acc: 0.9617258176757133
test acc: 0.9111111111111111


# Boosting Fits an Additive Model

Boosting is just a special case of additive models:
    
$$f(x) = \sum_{m=1}^M \beta_m b(x;\gamma_m)$$

$b(x;\gamma_m)$ are simple functions of argument $x$ and parameters $\gamma_m$. For boosting, each basis function is a weak classifier.

These models are trained by fitting $\beta$ and $\gamma$ minimizing a loss function over the dataset:

$$\min_{\{\beta_m, \gamma_m\}_1^M} \sum_{i=1}^N L\left(y_i, \sum_{m=1}^M \beta_m b(x_i;\gamma_m)\right)$$

for any loss function $L(y, f(x))$ such as squared-error or negative log-likelihood.

# Forward Stagewise Additive Modeling

This algorithm find an approximate solution by solving a simpler problem.   It starts with an empty model, and add a new basic function one at a time,  fitting it without modyfing the parameters of the previous ones.

The problem is a lot simpler to optimize:

$$\min_{\beta_m, \gamma_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + \beta_m b(x_i;\gamma_m))$$

# Exponential Loss and Adaboost

AdaBoost.M1 is equivalent to forward stagewise additive modeling using the exponential loss:

$$L(y, f(x)) = \exp (-yf(x))$$

The problem is:

$$(\beta_m, G_m) = \min_{\beta, G} \sum_{i=1}^N \exp [-y_i(f_{m-1}(x_i) + \beta G(x_i))]$$

# Why Exponential Loss ?

The principal attraction is computational: additive modeling with computational loss leads to a simple modular reweighting algorithm.

$$f^*(x) = \arg \min_{f(x)} E_{y|x} \exp(-yf(x)) = \frac{1}{2} \log \frac{P(y=1|x)}{P(y=-1|x)}$$

Thus AdaBoost estimates one-half of the log-odds, that justifies using the  sign operator.  

Another loss is the deviance loss:

$$l(y, f(x)) = \log (1 + e^{-2yf(x)})$$

At the population level, using either criterion leads to the same solution, but this is not true for finite datasets

# Loss Functions and Robustness

## Robust Loss functions for classification

deviance and exponential loss are both monotone decreasing functions of the margin yf(x).  
With $G(x) = \text{sign}(f(x))$, observations with positive margin are classified corretly, and those with negative magin are misclassified.  
Any loss criterion should penalize negative margin more heavily than positive ones.  

The difference between deviance and exponential loss is how much they penalize negative margins. The penalty for deviance increase linearly, where the one for exponential loss increase exponentially. In noisy settings, with misclassifications in the training data, the deviance gives better results.  

Mean Squared error increases quadratically when $yf(x) > 1$, therefore increasing error for correctly classified examples with increasing certainty. Thus MSE is a terrible choice of loss function.  

The problem generalize to K-class classification:
$$G(x) = \arg \max_{k} p_k(x)$$
with $p_k(x)$ the probability that $x$ belongs to class $k$:
$$p_k(x) = \frac{e^{f_k(x)}}{\sum_{l=1}^K e^{f_l(x)}}$$

We can use the K-class multinomial deviance loss function:

$$L(y, p(x)) = \sum_{k=1}^K I(y = k) \log p_k(x)$$

## Robust Loss functions for regression

For regression, both the squared error: $L(y, f(x)) = (y - f(x))^2$ and absolute loss $L(y, f(x)) = |y - f(x)|$ leads to the same populations results, but vary for finite datasets.  
Squared error loss places much empahish on obersation with large residuals, which if far less robust for outliers. Absolute loss performs much better in these situations.  

Another solution to resist outliers is the Huber loss:
$$
L(y, f(x)) = 
\begin{cases}
    (y - f(x))^2 & \text{if } |y - f(x)| \leq \delta \\
    2 \delta |y - f(x)| - \delta^2 & \text{otherwise}
\end{cases}
$$

# Boosting Trees

A tree can be expressed as:

$$T(x;\theta) = \sum_{j=1}^J \gamma_j I(x \in R_j)$$

The parameters are found by minimizing the empirical risk:

$$\theta = \arg \min_\theta \sum_{j=1}^J \sum_{x_i \in R_j} L(y_i, \gamma_j)$$  

The boosted tree model is a sum of such trees:
$$f_M(x) = \sum_{m=1}^M T(x;\theta_m)$$

With a forward stagewise procedure, one must solve at each step:
$$\hat{\theta_m} = \arg \min_{\theta_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + T(x_i, \theta_m))$$
For MSE, we simply need to fit a new regression tree with the residual errors.  

For binary classificaton with exponential loss, we get the following criterion for each tree:
$$\hat{\theta_m} = \arg \min_{\theta} \sum_{i=1}^N w_i^{(m)} \exp (-y_i T(x_i;\theta_m))$$
This criterion can be implemented by updating the criterion of splitting for the classical tree growing algorithms.  

Using other loss such as the absolute error, the Huber Loss, or the deviance gives most robust models, but there is no simple algorithms to optimize them. 

# Numerical Optimization via Gradient Boosting