# Ensemble methods. Boosting


## AdaBoost

AdaBoost consists of following steps:
* initialize weights to $\frac{1}{N}$, where $N$ is the number of datapoints,
* loop until 
  $\varepsilon_{t}<\frac{1}{2}$
  or maximum number of iteration is reached,

* train classifier on ${S,w^{(t)}}$ and get a hypothesis $h_{t}(x_{n})$ for datapoints $x_{n}$,

* compute error $\varepsilon_{t}=\sum_{n=1}^{N}w_{n}^{(t)}I(y_{n}\neq h_{t}(x_{n}))$,       

* set $\alpha_{t}=\log(\frac{1-\varepsilon_{t}}{\varepsilon_{t}})$.
  
* update weights $w_{n}^{(t+1)}=\frac{w_{n}^{(t)}\exp{\alpha_{t}I(y_{n}\neq h_{t}(x_{n}))}}{Z_{t}}$,
  where $Z_{t}$ is a normalization constant,

* output $f(X)=\text{sign}(\sum_{t=1}^{T}\alpha_{t}h_{t}(x))$.
  
Example taken from Marsland, Machine Learning: https://seat.massey.ac.nz/personal/s.r.marsland/MLBook.html.


First, we need to import libraries:

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

We need a bigger data set for this example, let's implement a data generation function:

In [None]:
def generate_data(sample_number, feature_number, label_number):
    data_set = np.random.random_sample((sample_number, feature_number))
    labels = np.random.choice(label_number, sample_number)
    return data_set, labels

Variables that are used by the classifier:

In [None]:
labels = 2
dimension = 2
test_set_size = 1000
train_set_size = 5000
train_set, train_labels = generate_data(train_set_size, dimension, labels)
test_set, test_labels = generate_data(test_set_size, dimension, labels)

Weights initialization:

In [None]:
number_of_iterations = 10
weights = np.ones((test_set_size,)) / test_set_size

In [None]:
def train_model(classifier, weights):
    return classifier.fit(X=test_set, y=test_labels, sample_weight=weights)

Accuracy vector calculation for the weights: 0 - don't change the weight, 1 - change it.

In [None]:
def calculate_accuracy_vector(predicted, labels):
    result = []
    for i in range(len(predicted)):
        if predicted[i] == labels[i]:
            result.append(0)
        else:
            result.append(1)
    return result

Calculate the error rate $\varepsilon_{t}=\sum_{n=1}^{N}w_{n}^{(t)}I(y_{n}\neq h_{t}(x_{n}))$:

In [None]:
def calculate_error(weights, model):
    predicted = model.predict(test_set)
    return np.dot(weights,calculate_accuracy_vector(predicted, test_labels))

Calculate the $\alpha_{t}=\log(\frac{1-\varepsilon_{t}}{\varepsilon_{t}})$:

In [None]:
def set_alpha(error_rate):
    return np.log((1-error_rate)/error_rate)

Calculate the new weights $w_{n}^{(t+1)}=\frac{w_{n}^{(t)}\exp{\alpha_{t}I(y_{n}\neq h_{t}(x_{n}))}}{Z_{t}}$:

In [None]:
def set_new_weights(old_weights, alpha, model):
    new_weights = old_weights * np.exp(np.multiply(alpha,calculate_accuracy_vector(model.predict(test_set), test_labels)))
    Zt = np.sum(new_weights)
    return new_weights / Zt

Now, it's time to run the code and check the weights:

In [None]:
classifier = DecisionTreeClassifier(max_depth=1, random_state=1)
classifier.fit(X=train_set, y=train_labels)
alphas = []
classifiers = []
for iteration in range(number_of_iterations):
    model = train_model(classifier, weights)
    error_rate = calculate_error(weights, model)
    alpha = set_alpha(error_rate)
    weights = set_new_weights(weights, alpha, model)
    alphas.append(alpha)
    classifiers.append(model)

print(weights)

We need to generate a new data set to validate the model:

In [None]:
validate_x, validate_label = generate_data(1, dimension, labels)

Calculate the predicted label $f(X)=\text{sign}(\sum_{t=1}^{T}\alpha_{t}h_{t}(x))$:

In [None]:
def get_prediction(x):
    predictions = []
    for i in range(len(classifiers)):
        predicted = classifiers[i].predict(x)
        predictions.append(predicted)
    return np.sign(np.sum(np.dot(alphas,predictions)))

And test it for the validation data:

In [None]:
prediction = get_prediction(validate_x)

print(prediction)