# Building Adaboost

### Introduction

In the last lesson, we saw the main components of adaboost.  We saw that:

1. We make predictions through a set of classifiers taking a weighted vote for each observation.

> $H(x) = sign\bigg(\alpha_1 h_1(x) +\alpha_2 h_2(x) +\alpha_3 h_3(x) \bigg) $

2. We train each classifier by weighing observations that were previously misclassified

Then we saw that these two features interact, as 

1. The value $\alpha$ for a classifier is determined by a weighted accuracy score, and 
2. The weight of each observation is partially determined by the value of $\alpha$, with even more weight assigned to observations misclassified by generally accurate estimators (those with a large $\alpha$). 

In this lesson, we'll build out an adaboost classifier from start.

### Loading our Data

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer


cancer = load_breast_cancer()

X = pd.DataFrame(cancer['data'], columns = cancer['feature_names'])

bool_y = pd.Series(cancer['target'] == 0).astype('int')

> Now let's convert the y data to -1 for a negative observation, and + 1 for a positive observation.

In [111]:
import numpy as np
y = np.where(bool_y == 0, -1, 1)
y[:40]

array([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1, -1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1,  1])

In [112]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = .2)

In [113]:
X[:2]

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902


### Beginning our Algorithm

The first step is that we take a weighted sum of the classifiers.

> $H(x) = sign\bigg(\alpha_1 h_1(x) +\alpha_2 h_2(x) +\alpha_3 h_3(x) \bigg) $

So, we can train a set of classifiers.

In [165]:
from sklearn.tree import DecisionTreeClassifier

dtcs = [DecisionTreeClassifier(random_state = i, max_depth = 1, max_features = .1).fit(X_train, y_train) for i in range(3)]

In [166]:
errors = [1 - dtc.score(X_train, y_train) for dtc in dtcs]
errors

[0.07912087912087917, 0.08131868131868136, 0.10989010989010994]

In [167]:
import numpy as np
def alpha(error_t):
    return .5*np.log((1 - error_t)/error_t)

In [168]:
alphas = [alpha(error) for error in errors]
alphas[:10]

[1.2271759907330135, 1.2122817599402658, 1.0459320308391964]

Now we won't go too far into the formula for calculating alpha.  But the main component to see is that the smaller our error is, the larger alpha becomes.

$\alpha_t = \frac{1}{2}\ln \frac{1 - \epsilon_t}{\epsilon_t}$

We'll add in weighing the observations in the next section.  But for now, with our three models trained and our alphas calculated for each one, we can perform our weighted prediction:

> $H(x) = sign\bigg(\alpha_1 h_1(x) +\alpha_2 h_2(x) +\alpha_3 h_3(x) \bigg) $

In [80]:
def predict(dtcs, alphas, X):
    preds = np.vstack([alpha*dtc.predict(X) for dtc, alpha in zip(dtcs, alphas)])
    cum_preds = preds.sum(axis = 0)
    return np.where(cum_preds > 0, 1, -1)

In [85]:
y_hat = predict(dtcs, alphas, X_test)
y_hat

array([ 1, -1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1, -1, -1,  1,  1,
       -1, -1,  1, -1,  1, -1, -1, -1,  1,  1,  1, -1,  1, -1,  1,  1, -1,
       -1,  1,  1,  1, -1,  1,  1,  1, -1,  1, -1, -1, -1, -1,  1,  1, -1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1, -1,  1, -1, -1,  1,
        1, -1, -1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1, -1,  1,  1, -1,  1, -1, -1,  1,  1,  1, -1])

### Weighing Observations

So in the last section, we coded our weighted sum of three different estimators.  But we did not train each estimator according to the adaboost procedure.  Remember, with adaboost we train each estimator successively, weighing the samples samples more if they were previously misclassified.  Let's begin to train our classifiers with assigning weights to our observations.

For our first classifier, we'll assign the same weight to each sample.

In [108]:
import numpy as np
weight_value = 1/y_train.shape[0]

w_t = np.full(y_train.shape[0], weight_value)
w_t[:4]

array([0.0021978, 0.0021978, 0.0021978, 0.0021978])

We assign an initial weight of $w_1 = \frac{1}{n}$, so that every weight is the sample and the sum of the weights add up to one.  Then, we assign weights to each sample as we train our classifier. 

In [145]:
dtc = DecisionTreeClassifier(max_depth = 2, random_state = 1).fit(X_train, y_train, sample_weight = w)

Then we make predictions, and find the error for our classifier, but we weigh the error by the sample weight. 

In [146]:
y_hat = dtc.predict(X_train)

In [147]:
(y_hat != y_train).mean()

0.046153846153846156

In [148]:
correct_incorrect = (y_hat != y_train).astype('int')
(w_t*correct_incorrect).sum()/y_hat.shape[0]

0.00010143702451394758

In [149]:
def error(y_hat, y_actual, w_t):
    correct_incorrect = (y_hat != y_train).astype('int')
    return (w_t*correct_incorrect).sum()/y_hat.shape[0]

In [150]:
error_t1 = error(y_hat, y_train, w_t)
error_t1

0.00010143702451394758

And remember, from the error, we can calculate alpha.  With the lower the error, the higher the value of alpha.

In [169]:
# alpha(error_t1)
alpha(error_t1)

4.59798547900444

### Finding our New Weights

So far, we've completed a cycle of training and then finding the according alpha value for our decision tree classifier.  But we still haven't covered the component of how to update our weights.

Remember that we want to assign a higher weight to those that were classified incorrectly, and a lower weight to those that were classified correctly.

This is the formula we'll use.

$w_t = w_{t - 1}*e^{-\alpha*y_i*(h_{t-1})} $

Let's break this formula down.  For now, let's remove the $\alpha$ term, so that we have:

$w_t = w_{t - 1}*e^{-y_i*h(x)_{t-1}} $

Now the $y*h(x)_t-1$ is actually an indicator function.  
> * When we incorrectly classify an observation it equals $y*h(x)_{t-1} = -1*1 = -1$ or $1*-1 = -1$, and
> * When it correctly classifies an observation, it returns 1.

So then when we previously correctly classified an observation we weight it by $w_{t-1}*e^{-1}$ and when we incorrectly classify an observation we weight it by $w_{t-1}*e^{1}$. 

Finally, in the final version we add in the $\alpha$ term $w_t = w_{t - 1}*e^{-\alpha*y_i*(h_{t-1})} $.  We can think of this as accentuating our weighting effect based on the size of the $\alpha$ term, the importance of the classifier.

So correctly classified observations by a more accurate classifier are decreased in weight further, and those incorrectly classified are reduced in weight even more.

In [170]:
import pandas as pd
df = pd.DataFrame({'obs +': ['w --', 'w -'],
              'obs -': ['w ++ ', 'w +']}, 
             index = ['model +', 'model -'])
df

Unnamed: 0,obs +,obs -
model +,w --,w ++
model -,w -,w +


In [None]:
# w = w*np.exp(-alpha*y_train*y_hat)

Finally, to ensure our weights add up to one, we simply divide by each weight by the sum of the total. 

In [171]:
# w = w/w.sum()

### Complete Many Times

Ok, now let's loop through this procedure multiple times and see how we do.

In [172]:
import numpy as np

ws = []
alphas = []
y_hats = []
errors = []
dtcs = []
w = np.ones(y_train.shape[0])/y_train.shape[0]

for i in range(30):
    dtc = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train,
                                                    sample_weight = w)
    y_hat = dtc.predict(X_train)
    error_t = w[y_hat != y_train].sum()/y_train.shape[0]
    alpha = .5*np.log((1 - error_t)/error_t)
    w = w*np.exp(-alpha*y_train*y_hat)
    w = w/w.sum()
    ws.append(w)
    alphas.append(alpha)
    y_hats.append(y_hat)
    errors.append(error_t)
    dtcs.append(dtc)

In [173]:
alphas_arr = np.array(alphas)
tree_preds = np.array(y_hats) 

In [174]:
tree_preds.shape

(30, 455)

In [175]:
errors[:3]

[0.00010143702451394758, 1.6422224503675357e-06, 1.6439245142949285e-05]

In [176]:
def predict(dtrs, alphas, X):
    preds = np.vstack([alpha*dtr.predict(X) for dtr, alpha in zip(dtrs, alphas)])
    cum_preds = preds.sum(axis = 0)
    return np.where(cum_preds > 0, 1, -1)

In [177]:
predictions = predict(dtcs, alphas, X_test)
predictions[:3]

array([-1, -1, -1])

In [178]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.9912280701754386

In [179]:
from sklearn.metrics import precision_score, recall_score

precision_score(y_test, predictions), recall_score(y_test, predictions)

(1.0, 0.9761904761904762)

In [185]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(min_samples_leaf = 7, max_features = 'log2', 
                             random_state = 1, n_estimators = 20).fit(X_train, y_train)
rfc_predictions = rfc.predict(X_test)

precision_score(y_test, rfc_predictions), recall_score(y_test, rfc_predictions)

(0.9512195121951219, 0.9285714285714286)

### Wrapping Up

We can see in the above that we were quite successful in our adaboost procedure.  The main new component that we learned was how to update our weights.

$w_t = w_{t - 1}*e^{-\alpha*y_i*(h_{t-1})} $

We saw that we do this by using $y_i*h_{t-1}$, to toggle our update between:

* $e^{\alpha}$ when an observation is **incorrectly** classified, and 
* $\frac{1}{e^\alpha}$ when a weight is **correctly** classified

So this leads each successive classifier to provide more weight to observations that were previously classified incorrectly, and especially by influential classifiers.

After going through one cycle, we then trained thirty successive decision trees to train a model that outperformed our random forest.