# Hyperparameter optimisation

In this notebook I'll use Bayesian Optimisation in order to tune hyperparameters. This technique is way better than grid search and random search. It uses different algorithms, I'm going to explore the TPE (Tree Parzen Estimators) algorithm implemented by hyperopt.

In [1]:
import numpy as np
import pandas as pd
import os
import glob
import matplotlib.pyplot as plt
import datetime

In [2]:
# Importing utils 
os.chdir('C:\\Users\\hugo_\\OneDrive\\Documentos\\DataScience\\Repos\\kaggle_credit_risk\\code')

from utils import *

# Data directory
os.chdir('C:\\Users\\hugo_\\OneDrive\\Documentos\\DataScience\\Repos\\kaggle_credit_risk\\data\\treated')

In [3]:
train = pd.read_csv('train_eng.csv')
test = pd.read_csv('test_eng.csv')

In [4]:
# Our validation metric
from sklearn.metrics import roc_auc_score

# Importing lgbm and xgboost
import lightgbm as lgb

In [5]:
ID_train = train.SK_ID_CURR.values
ID_test = test.SK_ID_CURR.values

y_train = train.TARGET.values

X_train = train.drop(['SK_ID_CURR', 'TARGET'], axis=1).values
X_test = test.drop(['SK_ID_CURR'], axis=1).values

### Defining search space
First, let us define our search space for LGBM. The hyperparameters must be defined as some sort of propabilistic distribution. They are the priors that serves as input to our bayesian optimization algorithm. The hyperopt library offers some default distributions, such as uniform, log uniform and others.

In [6]:
# Define the search space
from hyperopt import hp

space_lgbm = {
    'class_weight': hp.choice('class_weight', [None, 'balanced']),
    'boosting_type': hp.choice('boosting_type', [{'boosting_type': 'gbdt', 'subsample': hp.uniform('gdbt_subsample', 0.5, 1)}, 
                                                 {'boosting_type': 'dart', 'subsample': hp.uniform('dart_subsample', 0.5, 1)},
                                                 {'boosting_type': 'goss', 'subsample': 1.0}]),
    'num_leaves': hp.quniform('num_leaves', 30, 150, 1),
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'subsample_for_bin': hp.quniform('subsample_for_bin', 20000, 300000, 20000),
    'min_child_samples': hp.quniform('min_child_samples', 20, 500, 5),
    'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0),
    'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)
}

After defining our search space, in which the Tree Parzen Estimators algorithm is going to make some assumptions about the probabilistic surface it will optimize, let's create a file to follow up the results of the optimization.

In [16]:
from hyperopt import tpe, Trials, STATUS_OK, fmin, space_eval
import csv

# optimization algorithm
tpe_algorithm = tpe.suggest

# Keep track of results
bayes_trials = Trials()

# Getting a fraction of data to optimise parameters (low ram, sorry :T)
indexes = np.random.randint(0, X_train.shape[0], 200000) # 200000 rows
X_train_ = X_train[indexes,:]
y_train_ = y_train[indexes]

# create the lgbm Dataset
train_set = lgb.Dataset(X_train_, label = y_train_)

# Output file
out_file = 'gbm_trials.csv'
of_connection = open(out_file, 'w')
writer = csv.writer(of_connection)

# Write the headers to the file
writer.writerow(['loss', 'params', 'iteration', 'estimators', 'train_time'])
of_connection.close()

Than, in order to our algorithm to work, it is necessary to create a function it will optimize. I'll be writing specifically for LGBM, but later in this notebook I'll write a more generic function that supports other algorithms. 

This function must return at least 2 parameters: the loss, which is the objective to optimize, and a flag called STATUS_OK, which serves as a go on flag. The other parameters returned are optional, but is interesting to keep track of them.

In [17]:
from timeit import default_timer as timer

N_FOLDS = 5

def objective_lgbm(params, n_folds = N_FOLDS):
    """Objective function for Gradient Boosting Machine Hyperparameter Optimization"""
    
    # Keep track of evals
    global ITERATION
    
    ITERATION += 1
    
    # Retrieve the subsample if present otherwise set to 1.0
    subsample = params['boosting_type'].get('subsample', 1.0)
    
    # Extract the boosting type
    params['boosting_type'] = params['boosting_type']['boosting_type']
    params['subsample'] = subsample
    
    # Make sure parameters that need to be integers are integers
    for parameter_name in ['num_leaves', 'subsample_for_bin', 'min_child_samples']:
        params[parameter_name] = int(params[parameter_name])
    
    start = timer()
    
    # Perform n_folds cross validation
    cv_results = lgb.cv(params, train_set, num_boost_round = 10000, nfold = n_folds, 
                        early_stopping_rounds = 100, metrics = 'auc', seed = 50)
    
    run_time = timer() - start
    
     # Extract the best score
    best_score = np.max(cv_results['auc-mean'])
    
    # Loss must be minimized
    loss = 1 - best_score
    
    # Boosting rounds that returned the highest cv score
    n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)

    # Write to the csv file ('a' means append)
    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, params, ITERATION, n_estimators, run_time])
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': params, 'iteration': ITERATION,
            'estimators': n_estimators, 
            'train_time': run_time, 'status': STATUS_OK}

After all set up, let's call the fmin function, which will call the objective_lgbm and the search space it will covers to optimize.

This algorithms works as follows:
- Build a surrogate probability model of the objective function
- Find the hyperparameters that perform best on the surrogate
- Apply these hyperparameters to the true objective function
- Update the surrogate model incorporating the new results
- Repeat steps 2–4 until max iterations or time is reached

If you want to dig deeper into this process, read the [excelent article](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f) on the subject and the [original paper](https://app.sigopt.com/static/pdf/SigOpt_Bayesian_Optimization_Primer.pdf).

In [None]:
# Global variable
global  ITERATION

ITERATION = 0
MAX_EVALS = 100

# Run optimization
best = fmin(fn = objective_lgbm, space = space_lgbm, algo = tpe.suggest, 
            max_evals = MAX_EVALS, trials = bayes_trials, rstate = np.random.RandomState(50))

  1%|          | 1/98 [01:14<1:59:51, 74.14s/it, best loss: 0.23829185473987946]

### Finding the best parameters

In [None]:
print(space_eval(space_lgbm, best))

# Or it could be done by using the trials object

# Sort the trials with lowest loss (highest AUC) first
bayes_trials_results = sorted(bayes_trials.results, key = lambda x: x['loss'])
print(bayes_trials_results[:2])

In [None]:
results = pd.read_csv('gbm_trials.csv')

# Sort with best scores on top and reset index for slicing
results.sort_values('loss', ascending = True, inplace = True)
results.reset_index(inplace = True, drop = True)
results.head()

In [None]:
# When the results are saved into a csv file, it gets converted into a string. Let's use the ast function to convert it back.
import ast

# Convert from a string to a dictionary
ast.literal_eval(results.loc[0, 'params'])

# Evaluating best results

Now it's time to evaluate out results. Hope to get good AUC with this tunned model.


In [None]:
# Extract the ideal number of estimators and hyperparameters
best_bayes_estimators = int(results.loc[0, 'estimators'])
best_bayes_params = ast.literal_eval(results.loc[0, 'params']).copy()

# Re-create the best model and train on the training data
best_bayes_model = lgb.LGBMClassifier(n_estimators = best_bayes_estimators, n_jobs = -1, 
                                       objective = 'binary', random_state = 50, **best_bayes_params)
best_bayes_model.fit(X_train, y_train)

In [None]:
# Creating submission file

pd.DataFrame(
    {
        'SK_ID_CURR': ID_test,
        'TARGET': best_bayes_model.predict_proba(X_test)[:,1]
    }
).to_csv('submission_lgbm_tunned.csv', index = None)

# Generic Bayesian Optimisation Pipeline

In [None]:
def objective(space):
    """Objective function for Generic Model Hyperparameter Optimization"""
    
    # Keep track of evals
    global ITERATION
    
    ITERATION += 1
        
    # KFold cross-validation
    kf = StratifiedKFold(n_splits = 5)
    
    # Getting the model
    model = space['model'](**space['params'])

    # Starting
    start = timer()
    
    # Perform n_folds cross validation
    roc_aucs = []
    for train_index, val_index in kf.split(X_train[:1000,:], y_train[:1000]):
        X_train_, y_train_ = X_train[train_index, :], y_train[train_index]
        X_val_, y_val_ = X_train[val_index, :], y_train[val_index]

        # Change the model here
        model.fit(X_train_, y_train_)
        roc_aucs.append(roc_auc_score(y_val_, model.predict_proba(X_val_)[:,1]))
    
    run_time = timer() - start
    
     # Extract the best score
    best_score = np.mean(roc_aucs)
    
    # Loss must be minimized
    loss = 1 - best_score

    # Write to the csv file ('a' means append)
    of_connection = open(out_file, 'a')
    writer = csv.writer(of_connection)
    writer.writerow([loss, space, ITERATION, run_time])
    
    # Dictionary with information for evaluation
    return {'loss': loss, 'params': space, 'iteration': ITERATION,
            'train_time': run_time, 'status': STATUS_OK}

def optimize(objective, space, trials, MAX_EVALS = 120, output_file = 'BayesOpt.csv', random_state = 42):
    
    # Output file
    out_file = output_file
    of_connection = open(out_file, 'w')
    writer = csv.writer(of_connection)

    # Write the headers to the file
    writer.writerow(['loss', 'params', 'iteration', 'train_time'])
    of_connection.close()
    
    # Global variable
    global  ITERATION

    ITERATION = 0

    # Run optimization
    best = fmin(fn = objective, space = space, algo = tpe.suggest, 
                max_evals = MAX_EVALS, trials = trials, rstate = np.random.RandomState(42))
    
    return space_eval(space, best)
    