# Using Bayesian Optimization to Perform Cross Validation

Bayesian optimization is a global optimization method for noisy black-box functions. It is particularly useful for optimizing the hyperparameters of machine learning algorithms.

Bayesian optimization works by constructing a surrogate function, known as a Bayesian model, which approximates the true objective function. This model is updated at each step of the optimization procedure, allowing the algorithm to adapt to the underlying structure of the objective function and improve the selection of points to evaluate. The final result is the set of hyperparameters that maximize the value of the objective function.

One of the key advantages of Bayesian optimization is that it can handle constraints and noisy evaluations, and it often requires fewer function evaluations to find the global optimum compared to other optimization methods. It is therefore a popular choice for optimizing the performance of machine learning algorithms.

Cross validation is a model evaluation method that is commonly used in machine learning. It is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. This is important because the goal of any machine learning algorithm is to make accurate predictions on new, unseen data.

Cross validation involves dividing the original dataset into two or more subsets, performing the analysis on one subset (called the training set), and then evaluating the model on the other subset (called the test set or validation set). This procedure is repeated several times, with different subsets of the data used for training and validation, in order to get an estimate of the model's performance on unseen data.

Cross validation is a useful technique because it helps to prevent overfitting, which occurs when a model is too closely fit to the training data and does not generalize well to new data. By evaluating a model on multiple subsets of the data, it is possible to get a better sense of its performance on unseen data and make more accurate predictions.

## Import

### Modules

In [21]:
%load_ext autoreload
%autoreload 2
# %load_ext watermark
# %watermark -n -u -v -iv -w

import sys
from pathlib import Path

from datatoolkit import BayesianSearchCV
from hyperopt import hp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedShuffleSplit, RandomizedSearchCV
from sklearn.datasets import load_iris, make_classification
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
import scipy.stats as ss
import pandas as pd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Setup paths

In [5]:
PROJECT_ROOT = Path.cwd().resolve().parent
sys.path.append(str(PROJECT_ROOT))

### Scripts

In [3]:
from datatoolkit.model_selection import BayesianSearchCV

## Examples

### Random Forest Classifier

Set parameter spacem, which is a dictionary of hyperparameters and their distributions.

In [6]:
parameter_space = {
    'n_estimators': hp.uniformint('n_estimators', 100, 1000),
    'max_depth': hp.uniformint('max_depth', 1, 5),
    'min_weight_fraction_leaf':  hp.uniform('min_weight_fraction_leaf', 0, 0.5),
    'criterion': hp.choice('criterion', {'gini', 'entropy', 'log_loss'}),
            }

Set estimator andm cross validation generator

In [9]:
estimator = RandomForestClassifier()
cv = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=42)

Load data

In [10]:
X, y = load_iris(return_X_y=True)
X = X[:, :2]
X = X[y < 2]
y = y[y < 2]

Cross validate with `BayesianSearchCV`

In [11]:
bs = BayesianSearchCV(estimator=estimator, parameter_space=parameter_space, scoring=["f1_score", "roc_auc_score"], refit="f1_score", n_iter=5, cv=cv);
bs.fit(X, y)

100%|██████████| 5/5 [00:12<00:00,  2.49s/trial, best loss: 0.19554043336188565]


Analyzing results...

In [12]:
cv_results_ = pd.DataFrame.from_dict(bs.cv_results_)
cv_results_[['parameters', 'rank_score', 'average_val_f1_score']]

Unnamed: 0,parameters,rank_score,average_val_f1_score
0,"{'criterion': 'log_loss', 'max_depth': 3, 'min...",1,0.050961
1,"{'criterion': 'entropy', 'max_depth': 4, 'min_...",3,0.151356
2,"{'criterion': 'log_loss', 'max_depth': 3, 'min...",2,0.130303
3,"{'criterion': 'gini', 'max_depth': 4, 'min_wei...",4,0.165641
4,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",5,0.178817


Check if the best estimator is the same as the estimator with the best parameters.

In [13]:
assert bs.best_params_ == cv_results_.query("rank_score == 1")['parameters'].values[0]

### Cross validating a pipeline

Load dataset

In [14]:
X, y = make_classification()
y

array([1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1])

Setup pipeline

In [15]:
steps = [('pca', PCA()), ('rf', RandomForestClassifier())]
pipeline = Pipeline(steps)

Define parameter_space

In [16]:
parameter_space = {
    'rf__n_estimators': hp.uniformint('rf__n_estimators', 100, 1000),
    'rf__max_depth': hp.uniformint('rf__max_depth', 1, 5),
    'rf__min_weight_fraction_leaf':  hp.uniform('rf__min_weight_fraction_leaf', 0, 0.5),
    'rf__criterion': hp.choice('rf__criterion', {'gini', 'entropy', 'log_loss'}),
    'pca__n_components': hp.uniformint('pca__n_components', 1, X.shape[1]),
            }

Cross validation with pipeline

In [17]:
cv = BayesianSearchCV(estimator=pipeline, parameter_space=parameter_space, scoring=["f1_score", "roc_auc_score"], refit="f1_score", n_iter=5, cv=cv);
cv.fit(X, y)

100%|██████████| 5/5 [00:23<00:00,  4.67s/trial, best loss: 1.4209269505755184]


Analyze the results

In [18]:
cv_results_ = pd.DataFrame.from_dict(cv.cv_results_)
cv_results_[['parameters', 'rank_score', 'average_val_f1_score']]

Unnamed: 0,parameters,rank_score,average_val_f1_score
0,"{'pca__n_components': 12, 'rf__criterion': 'gi...",1,0.241158
1,"{'pca__n_components': 9, 'rf__criterion': 'log...",3,0.276094
2,"{'pca__n_components': 6, 'rf__criterion': 'ent...",5,0.300059
3,"{'pca__n_components': 19, 'rf__criterion': 'en...",2,0.265497
4,"{'pca__n_components': 7, 'rf__criterion': 'ent...",4,0.276094


Check if the best estimator is the same as the estimator with the best parameters.

In [19]:
assert cv.best_params_ == cv_results_.query("rank_score == 1")['parameters'].values[0]

## Benchmark with Scikit-learn's RandomizedSearchCV

In [28]:
parameters = {
    'n_estimators': getattr(ss, 'randint')(100, 1000),
    'max_depth': getattr(ss, 'randint')(1, 5),
    'min_weight_fraction_leaf':  getattr(ss, 'uniform')(0, 0.5),
    'criterion': ['gini', 'entropy', 'log_loss'],
            }

In [29]:
X, y = load_iris(return_X_y=True)
X = X[:, :2]
X = X[y < 2]
y = y[y < 2]

In [30]:
estimator = RandomForestClassifier()
cv = StratifiedShuffleSplit(n_splits=3, test_size=0.2, random_state=42)
rs = RandomizedSearchCV(estimator=estimator, param_distributions=parameters, scoring=["f1", "roc_auc"], refit="f1", n_iter=5, cv=cv);
rs.fit(X, y)

In [32]:
pd.DataFrame.from_dict(rs.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_criterion,param_max_depth,param_min_weight_fraction_leaf,param_n_estimators,params,split0_test_f1,...,split2_test_f1,mean_test_f1,std_test_f1,rank_test_f1,split0_test_roc_auc,split1_test_roc_auc,split2_test_roc_auc,mean_test_roc_auc,std_test_roc_auc,rank_test_roc_auc
0,0.498771,0.095643,0.057616,0.013807,gini,2,0.259515,259,"{'criterion': 'gini', 'max_depth': 2, 'min_wei...",0.736842,...,0.869565,0.785469,0.059707,5,0.94,0.98,1.0,0.973333,0.024944,2
1,0.695296,0.010614,0.079006,0.000595,log_loss,4,0.443288,455,"{'criterion': 'log_loss', 'max_depth': 4, 'min...",0.736842,...,0.909091,0.798644,0.078282,4,0.93,0.99,1.0,0.973333,0.030912,2
2,1.191669,0.206461,0.108912,0.001227,entropy,3,0.034194,627,"{'criterion': 'entropy', 'max_depth': 3, 'min_...",0.842105,...,0.952381,0.913952,0.050844,1,0.97,0.99,1.0,0.986667,0.012472,1
3,1.202687,0.033928,0.138497,0.003269,entropy,2,0.492667,789,"{'criterion': 'entropy', 'max_depth': 2, 'min_...",0.8,...,0.909091,0.872727,0.051426,2,0.9,0.975,1.0,0.958333,0.042492,5
4,0.598624,0.003122,0.069966,0.000365,entropy,1,0.455561,401,"{'criterion': 'entropy', 'max_depth': 1, 'min_...",0.736842,...,0.909091,0.834359,0.072142,3,0.92,0.975,1.0,0.965,0.033417,4
