Licensed under the MIT License.

Copyright (c) 2021-2025. All rights reserved.

# Optuna Default

* About Optuna
  * Optuna Github: https://github.com/optuna/optuna
  * Optuna Examples: https://github.com/optuna/optuna-examples
* Optuna params depends on models' python libraries
  * Optuna Integrated LGBM CV params: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.lightgbm.LightGBMTunerCV.html
  * LGBM params: https://lightgbm.readthedocs.io/en/latest/Parameters.html

In [7]:
import pandas as pd
import numpy as np
import pickle
import timeit
import matplotlib.pyplot as plt


from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import cross_val_score

import optuna.integration.lightgbm as lgb
from lightgbm import LGBMClassifier
from optuna.integration import XGBoostPruningCallback
import xgboost as xgb
import optuna

In [3]:
df30 = pd.read_csv('../../crystal_ball/data_collector/structured_data/leaf.csv')
print(df30.shape)

df30.head()

(340, 16)


Unnamed: 0,species,specimen_number,eccentricity,aspect_ratio,elongation,solidity,stochastic_convexity,isoperimetric_factor,maximal_indentation_depth,lobedness,average_intensity,average_contrast,smoothness,third_moment,uniformity,entropy
0,1,1,0.72694,1.4742,0.32396,0.98535,1.0,0.83592,0.004657,0.003947,0.04779,0.12795,0.016108,0.005232,0.000275,1.1756
1,1,2,0.74173,1.5257,0.36116,0.98152,0.99825,0.79867,0.005242,0.005002,0.02416,0.090476,0.008119,0.002708,7.5e-05,0.69659
2,1,3,0.76722,1.5725,0.38998,0.97755,1.0,0.80812,0.007457,0.010121,0.011897,0.057445,0.003289,0.000921,3.8e-05,0.44348
3,1,4,0.73797,1.4597,0.35376,0.97566,1.0,0.81697,0.006877,0.008607,0.01595,0.065491,0.004271,0.001154,6.6e-05,0.58785
4,1,5,0.82301,1.7707,0.44462,0.97698,1.0,0.75493,0.007428,0.010042,0.007938,0.045339,0.002051,0.00056,2.4e-05,0.34214


In [4]:
# Using optuna, multiclass need to adjacent classes
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df30['species'] = le.fit_transform(df30['species'])

print(min(df30['species']), max(df30['species']))

0 29


In [5]:
# train, test split for df30
y30 = df30['species']
X30 = df30.drop('species', axis=1)

X_train30, X_test30, y_train30, y_test30 = train_test_split(X30, y30, test_size=0.2,
                                               random_state=10, shuffle=True, stratify=y30)

X_train30.reset_index(inplace=True, drop=True)
X_test30.reset_index(inplace=True, drop=True)
y_train30.reset_index(inplace=True, drop=True)
y_test30.reset_index(inplace=True, drop=True)

print(X_train30.shape, X_test30.shape, y_train30.shape, y_test30.shape)
print(y_train30.nunique(), y_test30.nunique())

(272, 15) (68, 15) (272,) (68,)
30 30


## LGBM for df30

In [34]:
dtrain = lgb.Dataset(X_train30, y_train30)

params = {
    "objective": "multiclass",
    "metric": "multi_logloss",
    "verbosity": -1,
    "num_class": 30,  # have to specify num_class for multiclassification
    "random_state": 10
}

tuner = lgb.LightGBMTunerCV(
    params, 
    dtrain,
    time_budget=300,
    verbose_eval=False,
    folds=StratifiedKFold(n_splits=5, shuffle=True, random_state=10)
)
    
tuner.run()

print("Best score:", tuner.best_score)  # multi_logloss
best_params = tuner.best_params
print("Best params:", best_params)

[32m[I 2021-08-22 22:02:45,989][0m A new study created in memory with name: no-name-71dc17c0-d0b1-4217-a0ad-4d0abc47ace8[0m
feature_fraction, val_score: 1.188037:  14%|######4                                      | 1/7 [00:10<01:05, 10.93s/it][32m[I 2021-08-22 22:02:56,932][0m Trial 0 finished with value: 1.188036985075834 and parameters: {'feature_fraction': 0.6}. Best is trial 0 with value: 1.188036985075834.[0m
feature_fraction, val_score: 1.188037:  29%|############8                                | 2/7 [00:22<00:55, 11.11s/it][32m[I 2021-08-22 22:03:08,453][0m Trial 1 finished with value: 1.1899125130468209 and parameters: {'feature_fraction': 0.7}. Best is trial 0 with value: 1.188036985075834.[0m
feature_fraction, val_score: 1.188037:  43%|###################2                         | 3/7 [00:34<00:45, 11.30s/it][32m[I 2021-08-22 22:03:20,209][0m Trial 2 finished with value: 1.1919542441263555 and parameters: {'feature_fraction': 0.8}. Best is trial 0 with value: 1.1

Best score: 1.1457880287093116
Best params: {'objective': 'multiclass', 'metric': 'multi_logloss', 'verbosity': -1, 'num_class': 30, 'random_state': 10, 'feature_pre_filter': False, 'lambda_l1': 0.0, 'lambda_l2': 0.0, 'num_leaves': 31, 'feature_fraction': 0.4, 'bagging_fraction': 1.0, 'bagging_freq': 0, 'min_child_samples': 20}





In [35]:
model = LGBMClassifier(objective='multiclass', metric='multi_logloss', num_class= 30, 
                        random_state= 10, feature_pre_filter= False, lambda_l1= 0.0, lambda_l2= 0.0, num_leaves= 31,
                        feature_fraction= 0.4, bagging_fraction= 1.0, bagging_freq= 0, min_child_samples= 20)
model.fit(X_train30, y_train30)
y_pred30 = model.predict(X_test30)

balanced_accuracy = balanced_accuracy_score(y_test30, y_pred30)
print(f'The balanced accuracy on testing data from optimized model is {balanced_accuracy}')

The balanced accuracy on testing data from optimized model is 0.7777777777777778


## Multiple Estimators for df100

* Tips of Optuna Search Algorithms (Samplers): https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/003_efficient_optimization_algorithms.html#which-sampler-and-pruner-should-be-used
* Optuna LGBM: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.lightgbm.LightGBMTunerCV.html
* Optuna XGBoost: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.XGBoostPruningCallback.html#optuna.integration.XGBoostPruningCallback
  * xgboost.cv: https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.cv

In [2]:
df100 = pd.read_csv('../../crystal_ball/data_collector/structured_data/100leaves.csv')
print(df100.shape)

df100.head()

(1599, 193)


Unnamed: 0,Mar_1,Mar_2,Mar_3,Mar_4,Mar_5,Mar_6,Mar_7,Mar_8,Mar_9,Mar_10,...,Tex_56,Tex_57,Tex_58,Tex_59,Tex_60,Tex_61,Tex_62,Tex_63,Tex_64,species_int
0,0.003906,0.003906,0.027344,0.033203,0.007812,0.017578,0.023438,0.005859,0.0,0.015625,...,0.0,0.001953,0.000977,0.022461,0.0,0.0,0.001953,0.0,0.027344,0
1,0.017578,0.011719,0.023438,0.019531,0.003906,0.011719,0.015625,0.0,0.0,0.03125,...,0.0,0.010742,0.0,0.007812,0.0,0.0,0.0,0.0,0.021484,0
2,0.009766,0.021484,0.019531,0.027344,0.003906,0.025391,0.023438,0.0,0.001953,0.023438,...,0.0,0.019531,0.0,0.003906,0.0,0.0,0.0,0.0,0.012695,0
3,0.015625,0.009766,0.025391,0.027344,0.001953,0.001953,0.011719,0.0,0.001953,0.013672,...,0.0,0.000977,0.0,0.021484,0.0,0.0,0.0,0.0,0.014648,0
4,0.017578,0.041016,0.017578,0.005859,0.003906,0.027344,0.017578,0.003906,0.0,0.017578,...,0.0,0.003906,0.0,0.012695,0.0,0.0,0.0,0.0,0.004883,0


In [3]:
# train, test split for df100
y100 = df100['species_int']
X100 = df100.drop('species_int', axis=1)

X_train100, X_test100, y_train100, y_test100 = train_test_split(X100, y100, test_size=0.2,
                                               random_state=10, shuffle=True, stratify=y100)

X_train100.reset_index(inplace=True, drop=True)
X_test100.reset_index(inplace=True, drop=True)
y_train100.reset_index(inplace=True, drop=True)
y_test100.reset_index(inplace=True, drop=True)

print(X_train100.shape, X_test100.shape, y_train100.shape, y_test100.shape)
print(y_train100.nunique(), y_test100.nunique())

(1279, 192) (320, 192) (1279,) (320,)
100 100


In [4]:
# define the objective function with multiple estimators
def objective(trial):
    classifier = trial.suggest_categorical('classifier', ['XGBoost', 'LightGBM'])
    
    if classifier == 'LightGBM':
        dtrain100 = lgb.Dataset(X_train100, y_train100)
        
        params = {
            "objective": "multiclass",
            "metric": "multi_logloss",
            "verbosity": -1,
            "num_class": 100, 
            "random_state": 10
        }

        tuner = lgb.LightGBMTunerCV(
            params, 
            dtrain100,
            time_budget=100,
            verbose_eval=False,
            folds=StratifiedKFold(n_splits=10, shuffle=True, random_state=10)
        )
        tuner.run()
        return tuner.best_score

    else:
        dtrain100 = xgb.DMatrix(X_train100, y_train100)
        
        params = {
            "objective": "multi:softmax",
            "metric": "mlogloss",
            "verbosity": 0,
            "num_class": 100, 
            "random_state": 10,
            "stratified": True,
            "nfold": 10
        }
        
        pruning_callback = optuna.integration.XGBoostPruningCallback(trial, "test-mlogloss")
        history = xgb.cv(params, dtrain100, callbacks=[pruning_callback], seed=10) 
        
        return history['test-mlogloss-mean'].values[-1]  # also return the best score

### Defualt Search Algorithm (Sampler) - TPE

In [5]:
start = timeit.default_timer()

study = optuna.create_study(direction="minimize")
print(f"Default sampler is {study.sampler.__class__.__name__}")
study.optimize(objective, n_trials=12)

print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
    

stop = timeit.default_timer()
print('Time: ', stop - start)

[32m[I 2021-08-27 23:58:27,098][0m A new study created in memory with name: no-name-60afe879-852f-49d1-bcc9-d0bc17dae895[0m


Default sampler is TPESampler


[32m[I 2021-08-27 23:58:32,290][0m Trial 0 finished with value: 1.503399 and parameters: {'classifier': 'XGBoost'}. Best is trial 0 with value: 1.503399.[0m
[32m[I 2021-08-27 23:58:32,290][0m A new study created in memory with name: no-name-edba627d-1e1f-40ce-9bf3-5c620c63e525[0m
feature_fraction, val_score: 0.566254:  14%|######2                                     | 1/7 [06:11<37:08, 371.42s/it][32m[I 2021-08-28 00:04:43,744][0m Trial 0 finished with value: 0.5662536860145171 and parameters: {'feature_fraction': 1.0}. Best is trial 0 with value: 0.5662536860145171.[0m
feature_fraction, val_score: 0.566254:  14%|######2                                     | 1/7 [06:11<37:08, 371.44s/it]
  0%|                                                                                           | 0/20 [00:00<?, ?it/s]
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]
  0%|                                                 

Number of finished trials: 12
Best trial:
  Value: 0.43121294760070084
  Params: 
    classifier: LightGBM
Time:  4178.8889868


In [9]:
model = LGBMClassifier(objective='multiclass', metric='multi_logloss', num_class= 100, 
                        random_state= 10)
model.fit(X_train100, y_train100)
y_pred100 = model.predict(X_test100)

balanced_accuracy = balanced_accuracy_score(y_test100, y_pred100)
print(f'The balanced accuracy on testing data from selected model is {balanced_accuracy}')

The balanced accuracy on testing data from selected model is 0.8916666666666667


### Random Search Sampler

In [11]:
sampler = optuna.samplers.RandomSampler()

start = timeit.default_timer()

study = optuna.create_study(direction="minimize", sampler=sampler)
print(f"Default sampler is {study.sampler.__class__.__name__}")
study.optimize(objective, n_trials=12)

print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))

print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))
    

stop = timeit.default_timer()
print('Time: ', stop - start)

[32m[I 2021-08-28 08:59:25,803][0m A new study created in memory with name: no-name-8c1b054b-03ec-492a-9ec9-1ea88f0efc06[0m


Default sampler is RandomSampler


[32m[I 2021-08-28 08:59:34,404][0m Trial 0 finished with value: 1.503399 and parameters: {'classifier': 'XGBoost'}. Best is trial 0 with value: 1.503399.[0m
[32m[I 2021-08-28 08:59:42,879][0m Trial 1 finished with value: 1.503399 and parameters: {'classifier': 'XGBoost'}. Best is trial 0 with value: 1.503399.[0m
[32m[I 2021-08-28 08:59:42,882][0m A new study created in memory with name: no-name-4d2ea513-4449-442b-8daf-6c89290ae25f[0m
feature_fraction, val_score: 0.447473:  14%|######2                                     | 1/7 [08:47<52:44, 527.48s/it][32m[I 2021-08-28 09:08:30,372][0m Trial 0 finished with value: 0.44747315832570217 and parameters: {'feature_fraction': 0.6}. Best is trial 0 with value: 0.44747315832570217.[0m
feature_fraction, val_score: 0.447473:  14%|######2                                     | 1/7 [08:47<52:44, 527.48s/it]
  0%|                                                                                           | 0/20 [00:00<?, ?it/s]
  0%|       

Number of finished trials: 12
Best trial:
  Value: 0.38439777744004855
  Params: 
    classifier: LightGBM
Time:  3886.280057
