<a href="https://colab.research.google.com/github/prateekchandrajha/mastering-ml-algorithms/blob/main/Ch_16_Advanced_Boosting_Algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Boosting Algos

When working with specific classifier families (such
as logistic regression or neural networks), it's very easy to include an L1
 or L2
 penalty,
but it's not so easy with other estimators. For this reason, a common regularization
technique (implemented also by scikit-learn) is the downsampling of the training
dataset. Selecting P < N random data points allows the estimators to reduce the
variance and prevent overfitting. 

Alternatively, it's possible to employ a random feature selection (for gradient
tree boosting only) as in a random forest; picking a fraction of the total number
of features increases the uncertainty and avoids over-specialization. Of course,
the main drawback to these techniques is a loss of accuracy (proportional to the
downsampling/feature selection ratio) that must be analyzed in order to find the
most appropriate trade-off.

# Gradient tree boosting with scikit-learn

In this example, we want to employ a gradient tree boosting classifier (class
GradientBoostingClassifier) and check the impact of the maximum tree depth
(parameter max_depth) on performance. Considering the previous example, we start
by setting n_estimators=50 and learning_rate=0.8.

In [1]:
import numpy as np
import joblib
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
scores_md = []
eta = 0.8

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

wine = load_wine()
X, Y = wine["data"], wine["target"]
ss = StandardScaler()
Xs = ss.fit_transform(X)

In [4]:
for md in range(2, 13):
    gbc = GradientBoostingClassifier(n_estimators=50,
    learning_rate=eta,
    max_depth=md,
    random_state=1000)
    scores_md.append(np.mean(
    cross_val_score(gbc, X, Y,
    n_jobs=joblib.cpu_count(), cv=10)))


In [5]:
gbc

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.8, loss='deviance', max_depth=12,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=50,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=1000, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

As explained in the first section, the maximum depth of a decision tree is strictly
related to the possibility of interaction among features. This can be a positive
or negative aspect when the trees are employed in an ensemble. A very high
interaction level can create over-complex separation hyperplanes and reduce the
overall variance. In other cases, a limited interaction results in a higher bias.
With this particular (and simple) dataset, the gradient boosting algorithm can
achieve better performances when the max depth is two (consider that the root has
a depth equal to zero) and this is partially confirmed by both the feature importance
analysis and dimensionality reductions. 

In [6]:
import numpy as np

scores_eta = []

for eta in np.linspace(0.01, 1.0, 100):
    gbr = GradientBoostingClassifier(n_estimators=50,
    learning_rate=eta,
    max_depth=2,
    random_state=1000)
    scores_eta.append(
    np.mean(cross_val_score(gbr, X, Y,
    n_jobs=-1, cv=10)))

In [7]:
scores_eta

[0.9163398692810457,
 0.9218954248366013,
 0.9218954248366013,
 0.9277777777777778,
 0.95,
 0.9555555555555555,
 0.9555555555555555,
 0.961111111111111,
 0.961111111111111,
 0.961111111111111,
 0.961111111111111,
 0.961111111111111,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.961111111111111,
 0.9555555555555555,
 0.961111111111111,
 0.961111111111111,
 0.9444444444444444,
 0.9555555555555555,
 0.95,
 0.9555555555555555,
 0.9555555555555555,
 0.9555555555555555,
 0.95,
 0.9444444444444444,
 0.95,
 0.9666666666666668,
 0.9555555555555555,
 0.95,
 0.9555555555555555,
 0.95,
 0.961111111111111,
 0.9555555555555555,
 0.961111111111111,
 0.9666666666666668,
 0.9666666666666668,
 0.95,
 0.95,
 0.9722222222222221,
 0.9555555555555555,
 0.9555555555555555,
 0.949673202614379,
 0.9555555555555555,
 0.9777777777777779,
 0.9555555555555555,
 0.9666666666666666,
 0.9666666

# XGBoost

In [8]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
wine = load_wine()
X, Y = wine["data"], wine["target"]
X_train, X_test, Y_train, Y_test = \
 train_test_split(X, Y,test_size=0.15, random_state=1000)

At this point, we need to prepare the data in a format called DMatrix, which is
compatible with XGBoost. Luckily, the framework allows us to load almost any kind
of data structure. Therefore, we just need to instantiate the classes:

In [9]:
import xgboost as xgb
dall = xgb.DMatrix(X, label=Y,
 feature_names=wine['feature_names'])
dtrain = xgb.DMatrix(X_train, label=Y_train,
 feature_names=wine['feature_names'])
dtest = xgb.DMatrix(X_test, label=Y_test,
 feature_names=wine['feature_names'])


In [10]:
dall

<xgboost.core.DMatrix at 0x7fbab11b4710>

 XGBoost offers two valid alternatives for multiclass
problems: Softmax and Softprob. We are employing the latter, which is often known
as Softmax. In fact, the output will be a probability vector yi
 = (p(c = 1), p(c = 2), …p(c
= m)) where each term p(c = i) represents the relative probability that the right class
is i.

In [11]:
import joblib
params = {
 'n_estimators': 50,
 'max_depth': 2,
 'eta': 1.0,
 'objective': 'multi:softprob','eval_metric': 'mlogloss',
 'num_class': 3,
 'lambda': 1.0,
 'seed': 1000,
 'nthread': joblib.cpu_count(),
}

The max depth of the trees (Nc
 = 50) has been set to 2 to avoid overfitting. The
learning rate 𝜂𝜂 has been set to 1.0 and the parameter 𝜆𝜆, which controls the L2
regularization, has been kept to its default value (1.0). This choice has been made
after a simple grid search, but I invite the reader to re-implement the exercise using
the XGClassifier class, which is compatible with scikit-learn and can be analyzed
using GridSearchCV. It's always important to repeat that such large-capacity models,
when working with small datasets, can easily overfit. This behavior would be
paradoxical, because the validation accuracy could be lower than a simpler linear
model. The use of L2
 regularization prevents the model (or, at least, mitigates the
tendency) from overlearning the training set, hence its usage is always a factor to
consider.

In [13]:
nb_rounds = 20
cv_model = xgb.cv(params, dall,
 nb_rounds,
 nfold=10,
 seed=1000)
print(cv_model) #.describe()

    train-mlogloss-mean  ...  test-mlogloss-std
0              0.284897  ...           0.079161
1              0.121082  ...           0.089228
2              0.059105  ...           0.110782
3              0.032789  ...           0.098858
4              0.021392  ...           0.092548
5              0.015451  ...           0.091961
6              0.012204  ...           0.091815
7              0.010646  ...           0.092059
8              0.009978  ...           0.093110
9              0.009623  ...           0.093007
10             0.009349  ...           0.092107
11             0.009126  ...           0.093106
12             0.008947  ...           0.091982
13             0.008783  ...           0.091358
14             0.008652  ...           0.090829
15             0.008537  ...           0.090441
16             0.008443  ...           0.090323
17             0.008365  ...           0.090024
18             0.008301  ...           0.089393
19             0.008251  ...           0

In [14]:
evals = [(dtest, 'test'), (dtrain, 'train')]
model = xgb.train(params, dtrain,
 nb_rounds, evals)

[0]	test-mlogloss:0.458516	train-mlogloss:0.278144
[1]	test-mlogloss:0.287964	train-mlogloss:0.113728
[2]	test-mlogloss:0.224204	train-mlogloss:0.051498
[3]	test-mlogloss:0.180763	train-mlogloss:0.029227
[4]	test-mlogloss:0.145213	train-mlogloss:0.018459
[5]	test-mlogloss:0.140144	train-mlogloss:0.01348
[6]	test-mlogloss:0.13465	train-mlogloss:0.010576
[7]	test-mlogloss:0.143571	train-mlogloss:0.009546
[8]	test-mlogloss:0.14249	train-mlogloss:0.009148
[9]	test-mlogloss:0.139503	train-mlogloss:0.008862
[10]	test-mlogloss:0.138662	train-mlogloss:0.00886
[11]	test-mlogloss:0.138265	train-mlogloss:0.00886
[12]	test-mlogloss:0.138077	train-mlogloss:0.00886
[13]	test-mlogloss:0.137987	train-mlogloss:0.00886
[14]	test-mlogloss:0.137943	train-mlogloss:0.00886
[15]	test-mlogloss:0.137922	train-mlogloss:0.00886
[16]	test-mlogloss:0.137912	train-mlogloss:0.00886
[17]	test-mlogloss:0.137907	train-mlogloss:0.00886
[18]	test-mlogloss:0.137905	train-mlogloss:0.00886
[19]	test-mlogloss:0.137903	train-

In [15]:
from sklearn.metrics import confusion_matrix
Y_pred = model.predict(dtest)
print(confusion_matrix(Y_test,
 np.argmax(Y_pred, axis=1)))

[[ 6  0  0]
 [ 0 13  1]
 [ 0  0  7]]


In [17]:
!pip install shap

Collecting shap
[?25l  Downloading https://files.pythonhosted.org/packages/44/20/54381999efe3000f70a7f68af79ba857cfa3f82278ab0e02e6ba1c06b002/shap-0.38.1.tar.gz (352kB)
[K     |█                               | 10kB 17.1MB/s eta 0:00:01[K     |█▉                              | 20kB 22.3MB/s eta 0:00:01[K     |██▉                             | 30kB 12.5MB/s eta 0:00:01[K     |███▊                            | 40kB 9.6MB/s eta 0:00:01[K     |████▋                           | 51kB 8.3MB/s eta 0:00:01[K     |█████▋                          | 61kB 8.5MB/s eta 0:00:01[K     |██████▌                         | 71kB 8.2MB/s eta 0:00:01[K     |███████▍                        | 81kB 8.4MB/s eta 0:00:01[K     |████████▍                       | 92kB 8.9MB/s eta 0:00:01[K     |█████████▎                      | 102kB 7.8MB/s eta 0:00:01[K     |██████████▎                     | 112kB 7.8MB/s eta 0:00:01[K     |███████████▏                    | 122kB 7.8MB/s eta 0:00:01[K    

In [18]:
import shap
xg_explainer = shap.TreeExplainer(model)
shap_values = xg_explainer.shap_values(X)

In [20]:
shap_values

[array([[ 0.5820259 ,  0.        , -0.00701695, ...,  0.        ,
          0.        ,  2.2293222 ],
        [ 0.43322977,  0.        , -0.00701695, ...,  0.        ,
          0.        ,  2.3781183 ],
        [ 0.43322977,  0.        , -0.00701695, ...,  0.        ,
          0.        ,  2.4831321 ],
        ...,
        [ 0.4700444 ,  0.        , -0.00701695, ...,  0.        ,
          0.        ,  1.0358897 ],
        [ 0.10726569,  0.        , -0.00701695, ...,  0.        ,
          0.        ,  0.97070324],
        [ 0.4700444 ,  0.        , -0.01866045, ...,  0.        ,
          0.        , -1.2874091 ]], dtype=float32),
 array([[-1.5334089 ,  0.06683041,  0.        , ...,  0.07845446,
          0.        , -0.1029191 ],
        [-1.5334089 , -0.15197295,  0.        , ...,  0.07845446,
          0.        , -0.1029191 ],
        [-1.4450097 , -0.35949442,  0.        , ..., -0.00994464,
          0.        , -0.1029191 ],
        ...,
        [-1.0104495 , -0.35949442,  0. 

# Voting Classifiers

 As the concept is
very simple, our goal is to show how to combine two completely different estimators
to improve the overall cross-validation accuracy. For this reason, we have selected
a logistic regression and a non-linear classifier (an RBF SVM), which are structurally
different. In particular, while the former is a linear model, the latter is a kernel-based
classifier that can solve complex non-linear problems.

The reason why we are employing these algorithms is that we'd like to classify
correctly the majority of data points using the linear model and exploit the non-linear
abilities of the SVM to reduce the uncertainty associated with borderline points. As
already pointed out, this dataset is quite simple and it's surprising how accurate a
soft voting classifier can be compared to the complexity of other methods.
This observation has to be considered from two opposite viewpoints. The first
one is about the complexity of the datasets employed in the examples (which
often require an ensemble). We have already explained that our goal is to show
the effectiveness of the methodologies and not to apply them in real-life cases
that require long training phases. Therefore, the results previously obtained are
absolutely valid and show how such models can overcome the limits of simpler
algorithms.
On the other side, it's helpful to consider this example as an actual application of
the Occam's razor principle. Sometimes, more complex models seem to perform
better, but slight modifications of simpler ones can make them much more accurate
and cost-effective. Considering that this is a didactic book, the reader should pay
attention to this kind of compromise and learn when it makes sense to dedicate some
time to optimize simpler models instead of switching to more complex (and often
unmanageable) solutions.

In [21]:
X, Y = wine["data"], wine["target"]
ss = StandardScaler()
X = ss.fit_transform(X)


In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

svm = SVC(kernel='rbf',
 gamma=0.01,
 random_state=1000)

print('SCM score: {:.3f}'.format(
 np.mean(cross_val_score(svm, X, Y,
 n_jobs=-1, cv=10))))

lr = LogisticRegression(C=2.0,
 max_iter=5000,
 solver='lbfgs',
 multi_class='auto',
 random_state=1000)

print('Logistic Regression score: {:.3f}'.format(
 np.mean(cross_val_score(lr, X, Y,
 n_jobs=joblib.cpu_count(), cv=10))))

SCM score: 0.983
Logistic Regression score: 0.983


As expected, the logistic regression achieved a similar average CV accuracy as the
SVM (about 98.4%). Therefore, considering the different nature of the classifiers, a
hard-voting strategy is not the best choice. As we trust both classifiers and we'd like
to exploit the individual features, we have chosen a soft voting with a weight vector
set to (0.5, 0.5). In this way, no classifier is dominant and each of them will contribute
equally to the prediction. Of course, we expect the SVM to be determinant in all
those borderline cases where the linearity of the logistic regression loses the ability
to capture small deviances.


The class VotingClassifier accepts a list of tuples (name of the estimator, instance)
that must be supplied through the estimators parameter. The strategy can be specified using parameter voting (it can be either "soft" or "hard")
and the optional weights, using the parameter with the same name.

In [23]:
from sklearn.ensemble import VotingClassifier

vc = VotingClassifier(estimators=[
 ('LR', LogisticRegression(C=2.0,
 max_iter=5000,
 solver='lbfgs',
 multi_class='auto',
 random_state=1000)),
 ('SVM', SVC(kernel='rbf',
 gamma=0.01,
 probability=True,
 random_state=1000))],
 voting='soft',
 weights=(0.5, 0.5))

print('Voting classifier score: {:.3f}'.format(
 np.mean(cross_val_score(vc, X, Y,
 n_jobs=-1, cv=10))))


Voting classifier score: 0.994


Using a soft-voting strategy, the resulting estimator is able to outperform both the
logistic regression and the SVM by reducing the global uncertainty and reaching
an average CV score of about 99.4%. Indeed, the Wine dataset is almost linearly
separable, but there are a few data points that lie in the region that must always be
misclassified with a linear model. The presence of the RBF SVM enables this limit to
be overcome and helps the logistic regression when the sigmoid value is close to 0.5.
In those cases, the contribution of the SVM is enough to push the output above or
below the threshold so as to obtain a precise final classification.

in classical machine learning contexts, cross-validation is the only way to
check the behavior of a model when trained with a large random subset and tested
on the remaining subsample. Ideally, we'd like to observe the same performances,
but it can also happen that the accuracy is higher in some folds and quite a bit
lower in others. When this phenomenon is observed and the dataset is the final
one, it probably means that the model is not able to manage one or more regions
of the sample space and a boosting approach could dramatically improve the final
accuracy.