For our project we would be using Hyperopt, which is one of the several automated hyperparameter tuning libraries using Bayesian optimization. Hyperopt uses Tree Parzen Estimator(TPE).
Hyperopt has a simple syntax for structuring an optimization problem, which extends beyond hyperparameter tuning to any problem that involves minimizing a function. 

Objective of the Project
we will optimize the hyperparameters of a Gradient Boosting Machine using the Hyperopt library (with the Tree Parzen Estimator algorithm). We will compare the results of random search (implemented manually) for hyperparameter tuning with the Bayesian model-based optimization method to try and understand how the Bayesian method works and what benefits it has over uninformed search methods.

In [None]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import KFold

MAX_EVALS = 30
N_FOLDS = 10

DATASET

For experimentation, we have taken the Caravan Insurance Challenge data (https://www.kaggle.com/datasets/uciml/caravan-insurance-challenge?resource=download) 

We aim to determine whether or not a potential customer will buy an insurance policy by training a model on past data. We want to train a model to predict a binary outcome on testing data. (Supervised Classification) 

In [None]:
#Load the data and split into Train and Test, the dataset already defines Test and Train in the 'ORIGIN' 
data = pd.read_csv(r'C:\Users\obero\Desktop\RUTGERS\Projects\Bayesian-Hyperparameter-Optimization\data\caravan-insurance-challenge.csv')

train = data[data['ORIGIN']=='train']
test = data[data['ORIGIN']=='test']

data.info()
train_labels = np.array(train['CARAVAN'].astype(np.int32)).reshape((-1,))
test_labels  = np.array(test['CARAVAN'].astype(np.int32)).reshape((-1,))

#data cleaning
train = train.drop(columns = ['ORIGIN','CARAVAN'])
test = test.drop(columns = ['ORIGIN','CARAVAN'])

#Conversion to numpy array for splitting in cross validation

features = np.array(train)
test_features = np.array(test)
labels = train_labels[:]

print('Train shape', train.shape)
print('Test shape',test.shape)
train.head()


Distribution of Labels

In [None]:
import matplotlib.pyplot as plt

plt.hist(labels,edgecolor='k')
plt.xlabel('Label')
plt.ylabel("Count")
plt.title("Count of Labels")

We can observe an imbalanced class problem, There are more observations where an insurance policy was not bought(0) than when the policy was bought(1). Therefore, accuracy is a poor metric to use for this task. Instead, we will use the common classification metric of Receiver Operating Characteristic Area Under the Curve (ROC AUC). Randomly guessing on a classification problem will yield an ROC AUC of 0.5 and a perfect classifier has an ROC AUC of 1.0. For a better baseline model than random guessing, we can train a default Gradient Boosting Machine and have it make predictions.

In [None]:
#Default GBM i.e. model with default hyperparameters

model = lgb.LGBMClassifier()
param = model.get_params()
param

In [None]:
from sklearn.metrics import roc_auc_score
from timeit import default_timer as timer
start = timer()
model.fit(features,labels)
train_time = timer() - start

predictions = model.predict_proba(test_features)[:,1]
auc = roc_auc_score(test_labels,predictions)

print("the baseline score on the test set is {:.4f}.".format(auc))
print("the baseline training time is {:.4f} seconds ".format(train_time))

We have to beat the baseline metric. Due to the small size of the dataset, hyperparameter tuning will have a modest but noticeable effect on the performance.

Lets start with Hyperparameter tuning 
 
First we will implement a common technique for hyperparameter optimization: random search. Each iteration, we choose a random set of model hyperparameters from a search space.

Random search uses the following four parts, which are also used in Bayesian hyperparameter optimization:

1.Domain: values over which to search

2.Optimization : pick the next values at random

3.Objective function to minimize: in this case our metric is cross validation ROC AUC

4.Results history that tracks the hyperparameters tried and the cross validation metric

Random search can be implemented in the scikit-learn library using RandomizedSearchCV, However because we are using Early stopping(to determine the optimal number of estimators), we will implement the method.

In [None]:
import random

Random Search and Bayesian optimization both search for hyperparameters from a domain. For random(or grid search) this domain is called as hyperparameter grid and uses discrete values for the hyperparameters.

In [None]:
param

Explanation of Parameters in the param_grid (Source - ChatGPT)
class_weight: Used to handle imbalanced data. It adjusts weights inversely proportional to class frequencies in the input data. Possible values are:

None: All classes have equal weights.
balanced: Automatically adjust weights inversely proportional to class frequencies.
boosting_type: Specifies the type of algorithm to use:

gbdt: Traditional Gradient Boosting Decision Tree.
goss: Gradient-based One-Side Sampling.
dart: Dropouts meet Multiple Additive Regression Trees.
num_leaves: Number of leaves in each tree. More leaves will increase the model complexity and can lead to overfitting.

learning_rate: Controls the impact of each tree on the final outcome. Lower rates mean more trees are needed to model all relations and will be more robust to overfitting.

subsample_for_bin: Number of samples for constructing bins. Using smaller bins may provide faster performance but might lead to overfitting.

min_child_samples: Minimum number of data points needed in a child (leaf). Used to control over-fitting. Higher numbers prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.

reg_alpha (L1 regularization): Regularization term on weights. Can be used for feature selection by driving coefficients of less important features to zero.

reg_lambda (L2 regularization): Regularization term on weights. Smooths the weights to avoid overfitting, penalizing weights that are too large.

colsample_bytree: The fraction of features to be used for each tree. A smaller value can lead to faster training and prevents overfitting.

In [None]:
# Hyperparameter grid
param_grid = {
    'class_weight': [None, 'balanced'],
    'boosting_type': ['gbdt', 'goss', 'dart'],
    'num_leaves': list(range(30, 150)),
    'learning_rate': list(np.logspace(np.log(0.005), np.log(0.2), base = np.exp(1), num = 1000)),
    'subsample_for_bin': list(range(20000, 300000, 20000)),
    'min_child_samples': list(range(20, 500, 5)),
    'reg_alpha': list(np.linspace(0, 1)),
    'reg_lambda': list(np.linspace(0, 1)),
    'colsample_bytree': list(np.linspace(0.6, 1, 10))
}

# Subsampling (only applicable with 'goss')
subsample_dist = list(np.linspace(0.5, 1, 100))

The learning rate is represented by the logarithmic distribution because it can vary over several orders of magnitude.

In [None]:
# Plotting the learning rate

plt.hist(param_grid['learning_rate'], color = 'red', edgecolor = 'black')
plt.xlabel('Learning Rate', size = 14); plt.ylabel('Count', size = 14); plt.title('Learning Rate Distribution', size = 18)

We can observe that smaller values of the learning rate are more common with the values between 0.005 and 0.200

We can also observe that the width of the domain seems to be wide and we are uncertain about the optimal value.

In [None]:
# plot the leaves and check for its distribution

plt.hist(param_grid['num_leaves'], color = 'm', edgecolor = 'k')
plt.xlabel('Learning Number of Leaves', size = 14); plt.ylabel('Count', size = 14); plt.title('Number of Leaves Distribution', size = 18);

The leaves are uniform in nature.

In [None]:
#The next task is to sample the set of hyperparameters from the grid using a dictionary comprehension
params = {key: random.sample(value,1)[0] for key, value in param_grid.items()}
params

If the boosting_type is not goss, add a subsample

In [None]:
params['subsample'] = random.sample(subsample_dist,1)[0] if params ['boosting_type']!='goss' else 1.0
params

The subsample is set to 1.0 if the boosting type is goss which is equivalent to not using any subsampling.

Cross Validation with Early stopping in LightGBM. 

Why is early stopping used?(Source Chatgpt)

Early stopping is a form of regularization used to avoid overfitting when training a machine learning model, particularly in the context of iterative algorithms like those used in training deep neural networks or gradient boosting models. Here’s why early stopping is important in the context of cross-validation:

Prevent Overfitting: One of the primary reasons to use early stopping is to prevent the model from overfitting. Overfitting occurs when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. By stopping the training process once the model's performance ceases to improve (or begins to degrade) on a held-out validation dataset, you can ensure the model maintains a general capability to perform well on unseen data.

Optimize Training Time: Early stopping helps in reducing unnecessary training time by stopping the training process once further training no longer leads to better results on the validation set. This makes the training process more efficient by not wasting resources.

Model Selection: In cross-validation, particularly when using techniques like k-fold cross-validation, early stopping can be used to select the iteration or model that performs the best on the validation data, not merely the last iteration. This helps in selecting a model that strikes the right balance between bias and variance.



In [None]:
#Create a lgb dataset, that is suitable for training lightGBM models.

train_set = lgb.Dataset(features,label=labels)

In [None]:
#Perform cross validation with 10 folds
r = lgb.cv(
    params=params,
    train_set=train_set,
    num_boost_round=500,
    nfold=10,
    metrics='auc',
    callbacks=[lgb.callback.early_stopping(stopping_rounds=30, verbose=False)], 
    seed=50
)
# Correct the keys for mean and standard deviation of AUC
r_best = np.max(r['valid auc-mean'])
r_best_std = r['valid auc-stdv'][np.argmax(r['valid auc-mean'])]

print('The maximum ROC AUC on the validation set was {:.5f} with std of {:.5f}.'.format(r_best, r_best_std))
print('The ideal number of iterations was {}.'.format(np.argmax(r['valid auc-mean']) + 1))


Result Dataframe

In [None]:
# Dataframe to hold cv results, Use this dataframe to compare the different results obtained from different techniques
random_results = pd.DataFrame(columns = ['loss', 'params', 'iteration', 'estimators', 'time'],
                       index = list(range(MAX_EVALS)))

Objective Function

we will take in the hyperparameters and return the validation loss. 

In the case of random search, the next values selected are not based on the past evaluation results.

In [None]:
def random_objective(params, iteration, n_folds=N_FOLDS):
    """ Function to execute CV with given hyperparameters and return metrics and timing. """
    start = timer()
    cv_results = lgb.cv(
        params=params,
        train_set=train_set,
        num_boost_round=500,
        nfold=n_folds,
        metrics='auc',
        callbacks=(
            [lgb.early_stopping(stopping_rounds=50)]),
        seed=50
    )
    end = timer()
    best_score = np.max(cv_results['valid auc-mean'])
    loss = 1 - best_score
    n_estimators = int(np.argmax(cv_results['valid auc-mean']) + 1)
    return [loss, params, iteration, n_estimators, end - start]

Random Search Implementation

In [None]:
%%capture

import random
random.seed(50)

# Iterate through the specified number of evaluations
for i in range(MAX_EVALS):
    params = {key: random.sample(value, 1)[0] for key, value in param_grid.items()}
    if params['boosting_type'] == 'goss':
        params['subsample'] = 1.0
    else:
        params['subsample'] = random.sample(subsample_dist, 1)[0]
    
    train_set = lgb.Dataset(features, label=labels) 
    results_list = random_objective(params, i)
    random_results.loc[i, :] = results_list

random_results.sort_values('loss', ascending=True, inplace=True)
random_results.head()

In [None]:
# Sort results by best validation score
random_results.sort_values('loss', ascending = True, inplace = True)
random_results.reset_index(inplace = True, drop = True)
random_results.head()

Random Search Performance

the baseline gradient boosting model achieved a score of 0.70 on the training set. We can use the best parameters from random search and evaluate them on the testing set.

In [None]:
random_results.loc[0, 'params']

The estimators key holds the average number of estimators trained with early stopping (averaged over 10 folds). We can use this as the optimal number of estimators in the gradient boosting model.

In [None]:
# Find the best parameters and number of estimators
best_random_params = random_results.loc[0, 'params'].copy()
best_random_estimators = int(random_results.loc[0, 'estimators'])
best_random_model = lgb.LGBMClassifier(n_estimators=best_random_estimators, n_jobs = -1, 
                                       objective = 'binary', **best_random_params, random_state = 50)

# Fit on the training data
best_random_model.fit(features, labels)

# Make test predictions
predictions = best_random_model.predict_proba(test_features)[:, 1]

print('The best model from random search scores {:.4f} on the test data.'.format(roc_auc_score(test_labels, predictions)))
print('This was achieved using {} search iterations.'.format(random_results.loc[0, 'iteration']))

## Bayesian Hyperparameter Optimization using Hyperopt

For Bayesian optimization, we need the following four parts:

### Objective function

Domain space

Hyperparameter optimization algorithm

History of results

We already used all of these in random search, but for Hyperopt we will have to make a few changes.

Objective Function

This objective function will still take in the hyperparameters but it will return not a list but a dictionary. The only requirement for an objective function in Hyperopt is that it has a key in the return dictionary called "loss" to minimize and a key called "status" indicating if the evaluation was successful.

If we want to keep track of the number of iterations, we can declare a global variables called ITERATION that is incremented every time the function is called. In addition to returning comprehensive results, every time the function is evaluated, we will write the results to a new line of a csv file. This can be useful for extremely long evaluations if we want to check on the progress (this might not be the most elegant solution, but it's better than printing to the console because our results will be saved!)

The most important part of this function is that now we need to return a value to minimize and not the raw ROC AUC. We are trying to find the best value of the objective function, and even though a higher ROC AUC is better, Hyperopt works to minimize a function. Therefore, a simple solution is to return 1 - ROC (we did this for random search as well for practice).

In [None]:
import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer

def objective(params, n_folds = N_FOLDS):
    """Objective function for Gradient Boosting Machine Hyperparameter Optimization"""
    
    # Keep track of evals
    global ITERATION
    
    ITERATION += 1
    out_file = 'hyperopt_results.csv'

    # Retrieve the subsample if present otherwise set to 1.0
    subsample = params['boosting_type'].get('subsample', 1.0)
    
    # Extract the boosting type
    params['boosting_type'] = params['boosting_type']['boosting_type']
    params['subsample'] = subsample
    
    # Make sure parameters that need to be integers are integers
    for parameter_name in ['num_leaves', 'subsample_for_bin', 'min_child_samples']:
        params[parameter_name] = int(params[parameter_name])
    
    start = timer()
    
    # Perform n_folds cross validation
    cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds, 
                         metrics = 'auc', seed = 50)
    
    run_time = timer() - start
    
    # Extract the best score
    best_score = np.max(cv_results['valid auc-mean'])
    
    # Loss must be minimized
    loss = 1 - best_score
    
    # Boosting rounds that returned the highest cv score
    n_estimators = int(np.argmax(cv_results['valid auc-mean']) + 1)

    # Write to the csv file ('a' means append)
    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, ITERATION, n_estimators, run_time])
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'iteration': ITERATION,
            'estimators': n_estimators, 
            'train_time': run_time, 'status': STATUS_OK}

Domain Space

Specifying the domain (called the space in Hyperopt) is a little trickier than in grid search. In Hyperopt, and other Bayesian optimization frameworks, the domain is not a discrete grid but instead has probability distributions for each hyperparameter. For each hyperparameter, we will use the same limits as with the grid, but instead of being defined at each point, the domain represents probabilities for each hyperparameter. This will probably become clearer in the code and the images!

In [None]:
from hyperopt import hp
from hyperopt.pyll.stochastic import sample

In [None]:
# Create the learning rate
learning_rate = {'learning_rate': hp.loguniform('learning_rate', np.log(0.005), np.log(0.2))}

In [None]:
import seaborn as sns
learning_rate_dist = []

# Draw 10000 samples from the learning rate domain
for _ in range(10000):
    learning_rate_dist.append(sample(learning_rate)['learning_rate'])
    
plt.figure(figsize = (8, 6))
sns.kdeplot(learning_rate_dist, color = 'red', linewidth = 2, fill = True);
plt.title('Learning Rate Distribution', size = 18); 
plt.xlabel('Learning Rate', size = 16); plt.ylabel('Density', size = 16);

The number of leaves is again a uniform distribution. Here we used quniform which means a discrete uniform (as opposed to continuous).

In [None]:
# Discrete uniform distribution
num_leaves = {'num_leaves': hp.quniform('num_leaves', 30, 150, 1)}
num_leaves_dist = []

# Sample 10000 times from the number of leaves distribution
for _ in range(10000):
    num_leaves_dist.append(sample(num_leaves)['num_leaves'])
    
# kdeplot
plt.figure(figsize = (8, 6))
sns.kdeplot(num_leaves_dist, linewidth = 2, fill = True);
plt.title('Number of Leaves Distribution', size = 18); plt.xlabel('Number of Leaves', size = 16); plt.ylabel('Density', size = 16);

Conditional Domain

In Hyperopt, we can use nested conditional statements to indicate hyperparameters that depend on other hyperparameters. For example, we know that goss boosting type cannot use subsample, so when we set up the boosting_type categorical variable, we have to set the subsample to 1.0 while for the other boosting types it's a float between 0.5 and 1.0 Let's see this with an example.

In [None]:
# Directly use the boosting type as it is a string
boosting_type = {'boosting_type': hp.choice('boosting_type', 
                                            [{'boosting_type': 'gbdt', 'subsample': hp.uniform('subsample', 0.5, 1)}, 
                                             {'boosting_type': 'dart', 'subsample': hp.uniform('subsample', 0.5, 1)},
                                             {'boosting_type': 'goss', 'subsample': 1.0}])}

# Draw a sample
params = sample(boosting_type)
params

In [None]:
# Retrieve the subsample if present otherwise set to 1.0
subsample = params['boosting_type'].get('subsample', 1.0)

# Extract the boosting type
params['boosting_type'] = params['boosting_type']['boosting_type']
params['subsample'] = subsample

params

This is because the gbm cannot use the nested dictionary so we need to set the boosting_type and subsample as top level keys. Nested conditionals allow us to use a different set of hyperparameters depending on other hyperparameters. For example, we can explore different models with completely different sets of hyperparameters by using nested conditionals. The only requirement is that the first nested statement must be based on a choice hyperparameter (the choice could be the type of model).

Complete Bayesian Domain

Now we can define the entire domain. Each variable needs to have a label and a few parameters specifying the type and extent of the distribution. For the variables such as boosting type that are categorical, we use the choice variable. Other variables types include quniform, loguniform, and uniform. For the complete list, check out the documentation for Hyperopt.

In [None]:
# Define the search space
space = {
    'class_weight': hp.choice('class_weight', [None, 'balanced']),
    'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)}, 
                                                 {'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},
                                                 {'boosting_type': 'goss', 'subsample': 1.0}]),
    'num_leaves': hp.quniform('num_leaves', 30, 150, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)
}

Example of Sampling from the Domain

Let's sample from the domain (using the conditional logic) to see the result of each draw. Every time we run this code, the results will change. (Again notice that we need to assign the top level keys to the keywords understood by the GBM).

In [None]:
# Sample from the full space
x = sample(space)

# Conditional logic to assign top-level keys
subsample = x['boosting_type'].get('subsample', 1.0)
x['boosting_type'] = x['boosting_type']['boosting_type']
x['subsample'] = subsample

x

In [None]:
x = sample(space)
subsample = x['boosting_type'].get('subsample', 1.0)
x['boosting_type'] = x['boosting_type']['boosting_type']
x['subsample'] = subsample
x

Optimization Algorithm

Although this is the most technical part of Bayesian optimization, defining the algorithm to use in Hyperopt is simple. We will use the Tree Parzen Estimator which is one method for constructing the surrogate function and choosing the next hyperparameters to evaluate.

The final part is the result history. Here, we are using two methods to make sure we capture all the results:

A Trials object that stores the dictionary returned from the objective function
Writing to a csv file every iteration
The csv file option also lets us monitor the results of an on-going experiment.

Bayesian Optimization

Now we run the optimization. First we declare the global variable that will be used to keep track of the number of iterations. Then we call fmin passing in everything we defined above and the maximum number of iterations

In [1]:
import numpy as np
import pandas as pd
import lightgbm as lgb
from hyperopt import hp, fmin, tpe, Trials, STATUS_OK
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from timeit import default_timer as timer
import random
import os
import csv
import ast

# Load and prepare the dataset
data = pd.read_csv(r'C:\Users\obero\Desktop\RUTGERS\Projects\Bayesian-Hyperparameter-Optimization\data\caravan-insurance-challenge.csv')
data.drop(columns=['ORIGIN'], inplace=True)

# Split data into training and testing sets
train, test = train_test_split(data, test_size=0.2, random_state=50)
y_train = train.pop('CARAVAN')
y_test = test.pop('CARAVAN')
X_train = train
X_test = test

# Define the hyperparameter search space
space = {
    'boosting_type': hp.choice('boosting_type', ['gbdt', 'dart', 'goss']),
    'num_leaves': hp.quniform('num_leaves', 30, 150, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_bytree', 0.6, 1.0),
    'subsample': hp.uniform('subsample', 0.5, 1),
    'force_col_wise': True
}

# Objective function for Bayesian Optimization
def objective(params):
    params['num_leaves'] = int(params['num_leaves'])
    params['subsample_for_bin'] = int(params['subsample_for_bin'])
    params['min_child_samples'] = int(params['min_child_samples'])
    train_set = lgb.Dataset(X_train, label=y_train)
    
    cv_results = lgb.cv(
        params,
        train_set,
        num_boost_round=100,
        nfold=5,
        metrics='auc',
        seed=50,
        callbacks=[lgb.early_stopping(stopping_rounds=50)]
    )
    
    best_score = np.max(cv_results['valid auc-mean'])
    loss = 1 - best_score
    return {'loss': loss, 'params': params, 'status': STATUS_OK, 'best_score': best_score}


In [2]:
# Trials object to track progress

bayes_trials = Trials()
SEED = 50

# Run Bayesian optimization
best_bayes = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=50,
    trials=bayes_trials,
    rstate=np.random.default_rng(SEED)
)

# Collecting results
bayes_results = pd.DataFrame({
    'loss': [x['loss'] for x in bayes_trials.results],
    'iteration': range(1, len(bayes_trials.results) + 1),
    'params': [x['params'] for x in bayes_trials.results]
})

print("Best hyperparameters from Bayesian optimization:", best_bayes)
print("Detailed results:", bayes_results.head())


[LightGBM] [Info] Total Bins 529                      
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                      
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                      
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                      
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                      
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Start training from score 0.058711  
[LightGBM] [Info] Start training from score 0.058711  
[LightGBM] [Info] Start training from score 0.058861  
[LightGBM] [Info] Start training from score 0.058861  
[LightGBM] [Info] Start training from score 0




[LightGBM] [Info] Total Bins 529                                                
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Start training from score 0.058711                            
[LightGBM] [Info] Start training from score 0.058711  




[LightGBM] [Info] Total Bins 578                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 73
[LightGBM] [Info] Using GOSS                                                     
[LightGBM] [Info] Total Bins 578                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 73
[LightGBM] [Info] Using GOSS                                                     
[LightGBM] [Info] Total Bins 578                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 73
[LightGBM] [Info] Using GOSS                                                     
[LightGBM] [Info] Total Bins 578                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 73
[LightGBM] [Info] Using GOSS                              




[LightGBM] [Info] Total Bins 529                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Start training from score 0.058711                             
[LightGBM] [Info] Start training from score 0.05




[LightGBM] [Info] Total Bins 505                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                 
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                 
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Start training from score 0.058711                             
[LightGBM] [Info] Start training from score 0.05




[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Using GOSS                       




[LightGBM] [Info] Total Bins 505                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Total Bins 505                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 55
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Total Bins 495                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 53
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 561                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 69
[LightGBM] [Info] Total Bins 561                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 69
[LightGBM] [Info] Total Bins 561                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 69
[LightGBM] [Info] Total Bins 561                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 69
[LightGBM] [Info] Total Bins 561                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 69
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Total Bins 480                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 51
[LightGBM] [Info] Start training from score 0.058711                              
[LightGBM] [Info] Start training from scor




[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6285, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Using GOSS                                                      
[LightGBM] [Info] Total Bins 529                                                  
[LightGBM] [Info] Number of data points in the train set: 6286, number of used features: 61
[LightGBM] [Info] Using GOSS                       

In [3]:
# Evaluate on the testing data
best_params = bayes_results.loc[bayes_results['loss'].idxmin(), 'params']
model = lgb.LGBMClassifier(n_estimators=1000, **best_params)
model.fit(X_train, y_train)

# Predict on test data
preds = model.predict_proba(X_test)[:, 1]
auc_score = roc_auc_score(y_test, preds)

print('The best model from Bayesian optimization scores {:.5f} AUC ROC on the test set.'.format(auc_score))
print('This was achieved after {} search iterations'.format(bayes_results.loc[bayes_results['loss'].idxmin(), 'iteration']))



[LightGBM] [Info] Number of positive: 462, number of negative: 7395
[LightGBM] [Info] Total Bins 505
[LightGBM] [Info] Number of data points in the train set: 7857, number of used features: 55
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.058801 -> initscore=-2.772994
[LightGBM] [Info] Start training from score -2.772994
The best model from Bayesian optimization scores 0.77092 AUC ROC on the test set.
This was achieved after 15 search iterations


In [4]:
bayes_results

Unnamed: 0,loss,iteration,params
0,0.26296,1,"{'boosting_type': 'dart', 'colsample_bytree': ..."
1,0.253365,2,"{'boosting_type': 'dart', 'colsample_bytree': ..."
2,0.282635,3,"{'boosting_type': 'goss', 'colsample_bytree': ..."
3,0.257165,4,"{'boosting_type': 'gbdt', 'colsample_bytree': ..."
4,0.253912,5,"{'boosting_type': 'gbdt', 'colsample_bytree': ..."
5,0.253398,6,"{'boosting_type': 'goss', 'colsample_bytree': ..."
6,0.251606,7,"{'boosting_type': 'gbdt', 'colsample_bytree': ..."
7,0.253373,8,"{'boosting_type': 'gbdt', 'colsample_bytree': ..."
8,0.252418,9,"{'boosting_type': 'gbdt', 'colsample_bytree': ..."
9,0.253816,10,"{'boosting_type': 'goss', 'colsample_bytree': ..."
