## Table of Contents

This notebook provides basic examples of how to use Optuna for hyperparameter tuning. The following sections explain de step-by-step procedure:

1. [Defining the optimization problem: search space and objective](#1-defining-the-optimization-problem-search-space-and-objective)  
2. [First touch with Optuna for optimization](#2-first-touch-with-optuna-for-optimization)
3. [Analyzing the optimization results](#3-analyzing-the-optimization-results)
4. [Setting up baselines with enqueue trials](#4-setting-up-baselines-with-enqueue-trials)
5. [Use of multivariate samplers](#5-use-of-multivariate-samplers)

## Imports

In [1]:
from pathlib import Path
import sys
sys.path.insert(0, str(Path.cwd().parent))  # adjust .parent depth so 'src' is findable

In [2]:
import os
import optuna
import pandas as pd

from src.train_utils import retrieve_data_w_features

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
import optuna
import plotly

  from .autonotebook import tqdm as notebook_tqdm


## Options

In [3]:
path_data = "../data/01_raw"

## Dataset

In [4]:
df = pd.read_parquet(os.path.join(path_data, "fremotor1prem0304.parquet"))
cols_to_drop = ["IDpol", "Year", "train_set", "val_set", "test_set", "big_train_set"]

X_big_train, y_big_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="big_train_set")
X_train, y_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="train_set")
X_val, y_val = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="val_set")
X_test, y_test = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="test_set")

## 1. Defining the optimization problem: search space and objective

To use Optuna, two things must be defined:

1. The sample space from which the hyperparameters will be sampled
2. The objective function that will allow to determine that a certain solution is better or worse and the others

For this example, a simple `RandomForestRegressor` will be used on the dataset.

For the sample space, the following hyperparameters will be optimized:
- `max_depth`: discrete (int).
- `n_estimators`: discrete (int).
- `min_samples_in_leaf`: continuous (float) (can be discrete (int) too!). If continuous, it is interpreted as an amount of samples equal to `ceil(min_samples_split * n_samples)
- `max_features`: categorical (fixed categories).

🎯 The objective will be to `minimize` the validation `RMSE` 

In [53]:
X_big_train.select_dtypes(include=["float64", "float32", "int64", "int32"]).columns

Index(['DrivAge', 'BonusMalus', 'LicenceNb', 'VehAge'], dtype='object')

In [51]:
X_big_train.select_dtypes(include=["object", "category"]).columns

Index(['DrivGender', 'MaritalStatus', 'PayFreq', 'JobCode', 'VehClass',
       'VehPower', 'VehGas', 'VehUsage', 'Garage', 'Area', 'Region', 'Channel',
       'Marketing'],
      dtype='object')

In [41]:
X_big_train.dtypes

DrivAge           float64
DrivGender       category
MaritalStatus    category
BonusMalus        float64
LicenceNb         float64
PayFreq          category
JobCode          category
VehAge            float64
VehClass         category
VehPower         category
VehGas           category
VehUsage         category
Garage           category
Area             category
Region           category
Channel          category
Marketing        category
dtype: object

In [12]:
def training_objective(trial:optuna.trial.Trial) -> float:
    max_depth = trial.suggest_int("max_depth", 1, 5)
    n_estimators = trial.suggest_int("n_estimators", 50, 300)
    min_samples_split = trial.suggest_float("min_samples_split", 0.001, 0.05)
    max_features = trial.suggest_categorical("max_features", ["log2", "sqrt"])
    model = RandomForestRegressor(
        max_depth=max_depth,
        n_estimators=n_estimators,
        min_samples_split=min_samples_split,
        max_features=max_features,
        random_state=42,
        n_jobs=-1,
    )
    model.fit(X_train, y_train)
    val_predictions = model.predict(X_val)
    return root_mean_squared_error(y_true=y_val, y_pred=val_predictions)

## 2. First touch with Optuna for optimization

In [6]:
categorical_features = [
    'DrivGender', 'MaritalStatus', 'PayFreq', 'JobCode', 'VehClass',
    'VehPower', 'VehGas', 'VehUsage', 'Garage', 'Area', 'Region',
    'Channel', 'Marketing'
]
numeric_features = ['DrivAge', 'BonusMalus', 'LicenceNb', 'VehAge']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='unk')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

numeric_transformer = 'passthrough'

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('num', numeric_transformer, numeric_features)
    ]
)
preprocessor

0,1,2
,transformers,"[('cat', ...), ('num', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'constant'
,fill_value,'unk'
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'


In [20]:
def training_objective(trial: optuna.trial.Trial) -> float:
    # Suggest hyperparameters
    # max_depth = trial.suggest_int("max_depth", 1, 6)
    n_estimators = trial.suggest_int("n_estimators", 50, 300)
    min_samples_split = trial.suggest_float("min_samples_split", 0.001, 0.05)
    max_features = trial.suggest_categorical("max_features", ["log2", "sqrt"])

    # Define model
    model = RandomForestRegressor(
        # max_depth=max_depth,
        n_estimators=n_estimators,
        min_samples_split=min_samples_split,
        max_features=max_features,
        random_state=42,
        n_jobs=-1,
    )

    # Build full pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])

    # Fit and evaluate
    pipeline.fit(X_train, y_train)
    val_predictions = pipeline.predict(X_val)
    return root_mean_squared_error(y_true=y_val, y_pred=val_predictions)

In [21]:
from optuna.samplers import TPESampler

study = optuna.create_study(study_name="basic_rf_opt", direction="minimize", sampler=TPESampler(seed=42, n_startup_trials=10))
study.optimize(training_objective, n_trials=100)

[I 2025-10-17 17:25:34,237] A new study created in memory with name: basic_rf_opt
[I 2025-10-17 17:25:34,478] Trial 0 finished with value: 157.81148667635904 and parameters: {'n_estimators': 144, 'min_samples_split': 0.047585001014085894, 'max_features': 'log2'}. Best is trial 0 with value: 157.81148667635904.
[I 2025-10-17 17:25:34,694] Trial 1 finished with value: 137.12248283927266 and parameters: {'n_estimators': 89, 'min_samples_split': 0.00864373149647393, 'max_features': 'sqrt'}. Best is trial 1 with value: 137.12248283927266.
[I 2025-10-17 17:25:34,984] Trial 2 finished with value: 150.3841166251167 and parameters: {'n_estimators': 200, 'min_samples_split': 0.03569555631200623, 'max_features': 'sqrt'}. Best is trial 1 with value: 137.12248283927266.
[I 2025-10-17 17:25:35,420] Trial 3 finished with value: 138.53487115284938 and parameters: {'n_estimators': 258, 'min_samples_split': 0.011404616423235533, 'max_features': 'sqrt'}. Best is trial 1 with value: 137.12248283927266.
[I

In [22]:
# Train final model with best hyperparameters
best_params = study.best_params
final_model = RandomForestRegressor(**best_params, random_state=42, n_jobs=-1)
final_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', final_model)
])
final_pipeline.fit(X_big_train, y_big_train)
test_predictions = final_pipeline.predict(X_test)
big_train_predictions = final_pipeline.predict(X_big_train)
big_train_rmse = root_mean_squared_error(y_true=y_big_train, y_pred=big_train_predictions)
print(f"Big Train RMSE: {big_train_rmse}")
test_rmse = root_mean_squared_error(y_true=y_test, y_pred=test_predictions)
print(f"Test RMSE: {test_rmse}")

Big Train RMSE: 98.8327925597008
Test RMSE: 117.9205853197179


In [23]:
optuna.visualization.plot_optimization_history(study=study, target_name="Validation RMSE")

In [24]:
optuna.visualization.plot_param_importances(study=study)

## 3. Analyzing the optimization results

## 4. Setting up baselines with enqueue trials

## 5. Use of multivariate samplers