### Boosting

Boosting can be summed up fairly quickly; it is essentially boils down to starting with a bad model, and improving upon that model by creating a new model based on it's errors.

The most effective method of doing this is by creating a dummy model (which just guesses the mean), and then passing the dummy model's errors through another simple model. Generally speaking, it's best to use a slow learning model in order to allow for small step sizes so we can narrow in on a more effective model, without overadjusting on the errors. One way to achieve this is by iterating over the dummy model with a decision tree that has only 1 split, or we can just use Sklearn's built in Boosting regressor! =)

This process is iterated however many times as chosen by the developer. Please see the code below for a quick example.

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix

In [2]:
#import, clean, and preprocess data

salaries = pd.read_csv('data/adult.data', header=None)

X_train, X_test, y_train, y_test = train_test_split(salaries.drop(14, axis=1), salaries[14],
                                                   random_state=2)

ct = ColumnTransformer([("ohe", OneHotEncoder(handle_unknown='ignore'), [1, 3, 5, 6, 7, 8, 9, 13])])
ct.fit_transform(X_train)

X_train_dums = ct.transform(X_train)
X_test_dums = ct.transform(X_test)

In [5]:
# initiate gradient booster and run default 100 iterations

gb = GradientBoostingClassifier(verbose=1, learning_rate=.2, random_state=1)

gb.fit(X_train_dums, y_train)

      Iter       Train Loss   Remaining Time 
         1           1.0046            2.79s
         2           0.9399            2.72s
         3           0.8975            2.48s
         4           0.8671            2.33s
         5           0.8443            2.23s
         6           0.8283            2.18s
         7           0.8152            2.13s
         8           0.8040            2.07s
         9           0.7972            2.02s
        10           0.7899            2.02s
        20           0.7490            1.66s
        30           0.7284            1.47s
        40           0.7173            1.24s
        50           0.7097            1.04s
        60           0.7047            0.82s
        70           0.7009            0.60s
        80           0.6972            0.39s
        90           0.6942            0.20s
       100           0.6918            0.00s


GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=1, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=1,
                           warm_start=False)

As you can see, the loss function decreases over time, which indicates an improvement @ each iteration of the model.