# Gradient boosting - handwritten number image recognition

## Import required libraries

In [92]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from xgboost import XGBClassifier

## Importing the data

In [93]:
digits = load_digits()
X = digits.data
y = digits.target

## Fitting and evaluating the model - Gradient boosting classifier

For this model, we will be using the Gradient Boosting classifier, this is a supervised learning model that belongs to a family of emsemble methods. First, we need a loss function to minimize, the specific function depends on the task, whether it is a classification or regression problem. It may also depends on whether it is a binary or multiclass classification problem. 

Gradient Boosting works by using a weak learner (decision tree in our example) as a base model. In an additive process, each step adds another base model to the ensemble, in which a gradient descent procedure is used to minimize the loss. 
This process continues until the model reaches an acceptable level or no longer improves on a validation dataset.

In [119]:
gb = GradientBoostingClassifier(max_features="sqrt", random_state=12)

The code above creates a class instance of the Gradient Boosting classifier, with the parameter `random_state` ensuring that the output is reproducible rather than a random outcome each time the code is run.

In [120]:
param_grid = {
    'learning_rate': [0.05, 0.1, 0.15, 0.2],
    'n_estimators': [40, 60, 80, 100],
    'subsample': [0.6, 0.8, 1.0],
    'min_samples_split': [20, 40, 60, 80, 100], np.linspace(20, 100, 4)
    'max_depth': [3, 4, 5],
    
}

The `param_grid` is the values of the different parameters that we test for in the cross-validation.

In [121]:
grid_search = GridSearchCV(estimator=gb, param_grid=param_grid, n_jobs=-1, cv=3, scoring='accuracy', verbose=1)
grid_search.fit(X, y)

Fitting 3 folds for each of 720 candidates, totalling 2160 fits


The `GridSearchCV` is a class which performs a search of the different hyperparameter space of an estimator to determine which is the best fit. In this case, we have 720 candidates and the model will assess the performance of each model across 3 different folds of the data.

We specify six different parameters in our class instance, four of which refer directly to the grid search, which are:
* `cv` specifies the number of folds to be used in KFold cross validation.
* `param_grid` is a dictionary of the different hyperparameters and values the class will iterate over and test for in our grid search, in this case it is pre-determined values in our `param_grid` above.
* `estimator` is the machine learning estimator we want to optimize the use of in the model. In this example, we use the `gb` class instance, but throughout this notebook we will adapt this to use different models.
* `scoring` is the metric we use to evaluate the performance of each model fit, in this case we use the `accuracy` metric.

The last two parameters are to do with the configuration and computation aspects of the grid search:
* `verbose` is a parameter used to determine how much information is provided about the progress of the code. By opting for a value of one, we receive details regarding the number of candidates, folds and total number of fits. This number can be increased to provide more insights, such as notifications on each run completion, the runtime and a model score.
* `n_jobs` specifies the number of processors to use in the computation,  a `-1` value uses all of the available processors.

In [122]:
print(f"Best n_estimators: {grid_search.best_params_['n_estimators']}")
print(f"Best learning_rate: {grid_search.best_params_['learning_rate']}")
print(f"Best subsample: {grid_search.best_params_['subsample']}")
print(f"Best min_samples_split: {grid_search.best_params_['min_samples_split']}")
print(f"Best max_depth: {grid_search.best_params_['max_depth']}")
print(f"Best accuracy score: {round(grid_search.best_score_, 5) * 100}%")

Best n_estimators: 100
Best learning_rate: 0.2
Best subsample: 0.6
Best min_samples_split: 20
Best max_depth: 5
Best accuracy score: 95.326%


## Fitting and evaluating the model - AdaBoost

AdaBoost, also known as Adaptive Boosting, is a type of gradient boosting algorithm. It differs from gradient boosting in a few ways, it is the first algorithm to use a specific loss function called the Exponential Loss function. With this function, unlike traditional gradient boosting, it will amplify any penalties for instances which are further away from the decision boundary. This can make the algorithm sensitive to outliers. 

**Exponential Loss function**

$$
L(y, f(x)) = e^{-y \cdot f(x)}
$$


Another notable difference is that the weak learners used in AdaBoost are decision stumps, which is a one-level decision tree. These are then used as part of the additive model, in each iteration the loss function provides more weight to the mistakes made by the previous model. This means the weak learners focus on classifying more difficult instances. 

This boosting process continues until the maximum number of iterations is reached or the algorithm achieves desired level of accuracy. This adaptive approach allows AdaBoost to iteratively improve its performance by emphasizing the previously misclassified instances. 

In [129]:
abc = AdaBoostClassifier(random_state=13)

In [130]:
param_grid_1 = {
    'learning_rate': [0.01, 0.05, 0.1, 0.15, 0.2, 0.25],
    'n_estimators': np.linspace(20, 200, 10, dtype=int)
}

In [131]:
grid_search_abc = GridSearchCV(estimator=ada, param_grid=param_grid_1, cv=3, scoring='accuracy')
grid_search_abc.fit(X, y)

In [132]:
print(f"Best n_estimators: {grid_search_abc.best_params_['n_estimators']}")
print(f"Best learning_rate: {grid_search_abc.best_params_['learning_rate']}")
print(f"Best Accuracy: {round(grid_search_abc.best_score_, 4) * 100}%")

Best n_estimators: 120
Best learning_rate: 0.01
Best Accuracy: 69.62%


We can see how the differences between AdaBoost and Gradient Boosting have led to a significant difference in performance. This is likely down to a few key reasons:

1. Different weak learners - decision stumps appear to constrain the individual learners complexity and ability to perform well as a  base model. It could be that more complex decision boundaries are required for this multiclassification problem and multi-level decision trees are more likely to capture these intricate patterns in the data.
2. Different loss functions - AdaBoost uses the specific loss function, however Gradient Boosting can use a wide range of different functions, this makes it more flexible and is likely to lead to an increase in performance.

While, AdaBoost does excel in this multiclass classification scenario, it is often used frequently in binary classification problems.

## Fitting and evaluating the model - XGBoost

The XGBoost classifier is a more advanced class which is implemented in C++. It includes regularization techniques, which can prevent overfitting and improve model generalization, unlike the gradient boosting classifier. It also provides a wider range of hyperparameters that can be tuned to optimize model performance. 

In [134]:
xgb = XGBClassifier(random_state=13)

In [135]:
param_grid_2 = {
    'eta': [0.05, 0.1, 0.15, 0.2],
    'n_estimators':  np.linspace(10, 200, 20, dtype=int),
    'reg_alpha': [0, 0.1, 0.5, 1],  # L1 regularization term
    'reg_lambda': [0, 0.1, 0.5, 1],  # L2 regularization term
    'subsample': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5, 6],
}

In [138]:
grid_search_xgb = GridSearchCV(estimator=xgb, param_grid=param_grid_2, cv=3, scoring='accuracy', n_jobs=-1)
grid_search_xgb.fit(X, y)

In [144]:
best_n_estimators = grid_search_xgb.best_params_['n_estimators']
best_learning_rate = grid_search_xgb.best_params_['eta']
best_reg_alpha = grid_search_xgb.best_params_['reg_alpha']
best_reg_lambda = grid_search_xgb.best_params_['reg_lambda']
best_subsample = grid_search_xgb.best_params_['subsample']
best_max_depth = grid_search_xgb.best_params_['max_depth']
best_accuracy = round(grid_search_xgb.best_score_ * 100, 4)

output = f"""
PARAMETERS
----------------------------------------
Best n_estimators: {best_n_estimators}
Best learning_rate: {best_learning_rate}
Best reg_alpha: {best_reg_alpha}
Best reg_lambda: {best_reg_lambda}
Best subsample: {best_subsample}
Best max_depth: {best_max_depth}

----------------------------------------
METRIC
----------------------------------------
Accuracy: {best_accuracy} %
"""

print(output)


PARAMETERS
----------------------------------------
Best n_estimators: 130
Best learning_rate: 0.15
Best reg_alpha: 0
Best reg_lambda: 0.1
Best subsample: 0.6
Best max_depth: 5

----------------------------------------
METRIC
----------------------------------------
Accuracy: 93.1553 %

