In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text

## The Adaboost algorithm
AdaBoost is a simple yet powerful algorithm used for binary classification tasks. It trains a series of weak learners, usually decision stumps, each of which makes predictions slightly better than random guessing. The weak learners are finally combined into a strong classifier by weighting their predictions based on their accuracy.

- Initialize weights $w_i$ for each sample as 1/N
- For every training iteration $i$:
  - Fit a stump, i.e. 1 feature decision tree classifier $T_i(x)$
  - Compute weighted misclassification error as (sum of W_i of misclassification)/(sum of all weights)
  - Compute $\alpha_i = \log(\frac{1-err_i}{err_i})$ 
  - Update weight of misclassified samples $w_j * exp(\alpha_i)$ where j = misclassified samples
- Prediction = $Sign(\Sigma \alpha_i T_i(x))$

Let's go through the algorithm step by step. But first let's create some mock data.

In [2]:
data = {
    "x_1": [1,2,3,4,2,6,7,8,9,10],
    "x_2": [1,2,0,0,2,0,2,2,2,2],
    "y": [-1,-1,-1,-1,1,1,1,1,1,1]
}

df = pd.DataFrame(data)
df

Unnamed: 0,x_1,x_2,y
0,1,1,-1
1,2,2,-1
2,3,0,-1
3,4,0,-1
4,2,2,1
5,6,0,1
6,7,2,1
7,8,2,1
8,9,2,1
9,10,2,1


As you've probably noticed, instead of 0, we used -1 to represent negative labels. Just hold on to that for now. We'll come back to this later.

## Initialize sample weights
- AdaBoost assigns a sample weight to each observation. Larger weights are assigned to misclassified samples by the current stump, informing the next stump to pay more attention to those samples
- We start off by initializing equal weights for every sample as 1/N

In [3]:
# Initialize sample weights
df['weight'] = 1/len(df) 
df

Unnamed: 0,x_1,x_2,y,weight
0,1,1,-1,0.1
1,2,2,-1,0.1
2,3,0,-1,0.1
3,4,0,-1,0.1
4,2,2,1,0.1
5,6,0,1,0.1
6,7,2,1,0.1
7,8,2,1,0.1
8,9,2,1,0.1
9,10,2,1,0.1


## Fit stumps
- Fit a decision stump, which is just a decision tree with a single node and 2 leaves. In fitting the stumps, we also take into account the sample weights so that the tree would give more emphasis on observations with larger weights.
- This is very important because if we don't assign sample weights, then our decision stumps will always have the same decision boundary.

In [4]:
# Fit a stump
stump_1 = DecisionTreeClassifier(max_depth=1)
stump_1.fit(df[['x_1', 'x_2']], df['y'], sample_weight=df['weight'])

# Visualize stump
print(export_text(stump_1))

|--- feature_0 <= 5.00
|   |--- class: -1
|--- feature_0 >  5.00
|   |--- class: 1



In [5]:
# Use the weak learner to make predictions
df['t_1_pred'] = stump_1.predict(df[['x_1', 'x_2']])

# Identify misclassified observations
df['t_1_pred_mis'] = (df['t_1_pred'] != df['y']).astype(int)

df

Unnamed: 0,x_1,x_2,y,weight,t_1_pred,t_1_pred_mis
0,1,1,-1,0.1,-1,0
1,2,2,-1,0.1,-1,0
2,3,0,-1,0.1,-1,0
3,4,0,-1,0.1,-1,0
4,2,2,1,0.1,-1,1
5,6,0,1,0.1,1,0
6,7,2,1,0.1,1,0
7,8,2,1,0.1,1,0
8,9,2,1,0.1,1,0
9,10,2,1,0.1,1,0


## Compute weighted error and alpha
- In each iteration, misclassification errors and alpha are computed which will be used to update sample weights.
- Alpha can be understood as the weight for the decision stump. It is like how much say does that decision stump have in making the final prediction. 
- We can see from the formula $\alpha = \frac{1-err}{err}$, alpha is inversely related to the error, meaning the larger the error that stump makes, the less say it has in contributing to the final decision. 

In [6]:
def compute_error_and_alpha(tree='t_1'):
    # Compute weighted error
    error = (df['weight']*df[f'{tree}_pred_mis']).sum() / df['weight'].sum()
    
    # Compute alpha based on error
    alpha = np.log((1-error)/error)
    
    return error, alpha
    

In [7]:
error_1, alpha_1 = compute_error_and_alpha(tree='t_1')
print(error_1, alpha_1)

0.1 2.1972245773362196


## Update sample weights
- Based on the errors and alpha, sample weights will be updated such that larger weights are assigned to misclassified observations.
- Specifically, we will only update weights for misclassified observations.

In [8]:
# Only update weights of misclassified observations
df['weight'] = np.where(df['t_1_pred_mis'].values==1, df['weight']*np.exp(alpha_1), df['weight'].values)
df

Unnamed: 0,x_1,x_2,y,weight,t_1_pred,t_1_pred_mis
0,1,1,-1,0.1,-1,0
1,2,2,-1,0.1,-1,0
2,3,0,-1,0.1,-1,0
3,4,0,-1,0.1,-1,0
4,2,2,1,0.9,-1,1
5,6,0,1,0.1,1,0
6,7,2,1,0.1,1,0
7,8,2,1,0.1,1,0
8,9,2,1,0.1,1,0
9,10,2,1,0.1,1,0


## Repeat the above!

In [9]:
def update_weight(df, iter=2):
    # Fit a decision stump
    stump = DecisionTreeClassifier(max_depth=1)
    stump.fit(df[['x_1', 'x_2']], df['y'], sample_weight=df['weight'])
    df[f't_{iter}_pred'] = stump.predict(df[['x_1', 'x_2']])
    
    # Identify misclassified observations
    df[f't_{iter}_pred_mis'] = (df[f't_{iter}_pred'] != df['y']).astype(int) # used as a mask
    
    # Compute error and alpha
    error, alpha = compute_error_and_alpha(tree=f't_{iter}')
    
    # Update sample weights
    df['weight'] = np.where(df[f't_{iter}_pred_mis'].values==1, df['weight']*np.exp(alpha), df['weight'].values)
    
    return df, stump, alpha

In [10]:
df, stump_2, alpha_2 = update_weight(df, iter=2)

In [11]:
df

Unnamed: 0,x_1,x_2,y,weight,t_1_pred,t_1_pred_mis,t_2_pred,t_2_pred_mis
0,1,1,-1,0.1,-1,0,-1,0
1,2,2,-1,0.8,-1,0,1,1
2,3,0,-1,0.1,-1,0,-1,0
3,4,0,-1,0.1,-1,0,-1,0
4,2,2,1,0.9,-1,1,1,0
5,6,0,1,0.8,1,0,-1,1
6,7,2,1,0.1,1,0,1,0
7,8,2,1,0.1,1,0,1,0
8,9,2,1,0.1,1,0,1,0
9,10,2,1,0.1,1,0,1,0


## Inference
After 2 training iterations, we end up with 2 decision stumps. Now how do we use them to make predictions?

We'll ensemble these decision stumps to make our final predictions. Specifically, we'll take the sum of the predictions made by each weak classifier weighted by their corresponding importance, i.e. $\alpha$, and then take the sign of the sum as the final prediction.

$$Sign(\Sigma \alpha_i T_i(x))$$

Remember at the beginning we used -1 to represent negative samples instead of 0? This is why: by using -1 for negative labels, the sign function can be used directly to convert the weighted sum of weak classifiers' outputs into a final prediction, without any additional calculations.

In [12]:
stumps = [stump_1, stump_2]
alphas = [alpha_1, alpha_2]

# Initialize prediction array with zeros
pred = np.zeros(len(df))

# Iterate over the stumps to add weighted predictions to the prediction array
for s, a in zip(stumps, alphas):
    pred += a * s.predict(df[['x_1', 'x_2']])

# Take sign of the prediction array as the final prediction
pred = np.where(pred > 0, 1, -1)

In [13]:
pred

array([-1, -1, -1, -1, -1,  1,  1,  1,  1,  1])