# Assignment

**Goal**: The goal of this assignment is to use the packages presented before, i.e. pydantic, pandera and optuna, in a real case scenario, where you want to train a model and find the best hyperparameters.

Using an open source insurance dataset, we will do the different steps:
* Validate the dataset, making sure it is prepared to train a model using pandera.
* Validate the input parameters for the bayesian optimization using pydantic.

**Table of contents**:
1. [Validate the dataset using pandera](#1-validate-the-dataset-using-pandera)
2. [Validate the parameters using pydantic](#2-validate-the-parameters-using-pydantic)
3. [Bonus](#3-bonus)

# Imports

In [1]:
from pathlib import Path
import sys
sys.path.insert(0, str(Path.cwd().parent))  # adjust .parent depth so 'src' is findable

## Packages

In [2]:
from typing import Any

import pandas as pd
import pandera.pandas as pa
from sklearn.metrics import root_mean_squared_error
from pydantic import (
    BaseModel,
    Field,
    StrictStr,
    field_validator,
)

In [3]:
from src.train_utils import (
    load_conf_parameters,
    retrieve_data_w_features,
    run_bayesian_optimization,
)

  from .autonotebook import tqdm as notebook_tqdm


## Options

In [4]:
DATA_PATH = "../data/01_raw"

In [5]:
pd.options.display.max_columns = 150

## Dataset

In [6]:
df = pd.read_parquet(f"{DATA_PATH}/fremotor1prem0304.parquet")

In [7]:
df.head(5)

Unnamed: 0,IDpol,Year,DrivAge,DrivGender,MaritalStatus,BonusMalus,LicenceNb,PayFreq,JobCode,VehAge,VehClass,VehPower,VehGas,VehUsage,Garage,Area,Region,Channel,Marketing,PremTot,test_set,val_set,big_train_set,train_set
0,1000111.1,2003.0,44.0,F,Cohabiting,50.0,3.0,Half-yearly,Private employee,10.0,Cheaper,P10,Regular,Private+trip to office,Closed zbox,A2,Headquarters,A,M1,144.1,1,0,0,0
1,1000113.1,2003.0,26.0,F,Cohabiting,85.0,2.0,Annual,Other,8.0,Cheapest,P8,Regular,Private+trip to office,Opened collective parking,A7,Headquarters,A,M2,215.3,1,0,0,0
2,1000113.1,2003.0,27.0,F,Cohabiting,106.0,2.0,Half-yearly,Other,6.0,Cheaper,P11,Regular,Private+trip to office,Opened collective parking,A7,Headquarters,A,M2,611.6,1,0,0,0
3,1000173.1,2003.0,52.0,M,Cohabiting,50.0,2.0,Half-yearly,Private employee,2.0,Cheaper,P11,Regular,Private+trip to office,Closed zbox,A7,Headquarters,A,M1,415.2,0,0,1,1
4,1000173.101,2003.0,52.0,M,Cohabiting,50.0,2.0,Half-yearly,Private employee,1.0,Cheap,P13,Regular,Private+trip to office,Closed collective parking,A7,Headquarters,A,M3,487.8,0,0,1,1


# 1. Validate the dataset using pandera

**Goal**: The goal of this section is to validate the dataframe we will use to train our model.

We want to validate the dataframe before training our model. Your goal is to make sure the columns will verify the following rules:
* **Year**: Check the year are between 2003 and 2004.
* **DrivAge**: Make sure the driver's age are possible (e.g. between 18 and 100).
* **DrivGender**: The gender is either 'M' or 'F'.
* **MaritalStatus**: Possible values "Cohabiting", "Married", "Single", "Widowed" or "Divorced".
* **BonusMalus**: The value of the bonus / malus is over 50.
* **LicenceNb**: The licence number is over 1.
* **JobCode**: The possible values are "Private employee", "Public employee", "Retiree", "Other", "Craftsman", "Farmer" or "Retailer",
* **VehAge**: Make sure the vehicule age is possible.
* **VehGas**: Either "Regular" or "Diesel".
* **Area**: Possible values are from A1 to A12 included.

**Exercise**: Update the schema to check the rules defined above.

In [8]:
######################
### YOUR CODE HERE ###
######################

However, as you can see, some features contain NaN values. To handle this issue, you have several solutions:
* In the dataframe schema from pandera, set the option of possible NaN values to True. We don't recommend this approach as many models can't handle NaN values or you want to make sure to use them properly.
* You can set the option in the pandera schema to drop rows with NaN values.
* You can handle it with the classical feature engineering technics (feature imputing, creating new category, etc.)

**Exercise**: Create a function to handle missing values.

<details>

<summary>Click to reveal tip</summary>

Use the `mode()[0]` function to get the most represented category.

In [9]:
######################
### YOUR CODE HERE ###
######################

To train the model, splitting the dataset between train, validation and test sets is important. You want to make sure each row is splitted into one and only one of the sets. To do so, you can use custom functions from pandera.

**Exercise**: Adapt the schema to check that each row is in one and only one set.
1. Add the right checks for the columns `train_set`, `val_set` and `test_set`.
2. Add a schema check to make sure that each row is assigned to one and only one set.

<details>

<summary>Click to reveal tip</summary>

Use the `checks` parameter of the schema directly.

In [10]:
######################
### YOUR CODE HERE ###
######################

We also have the `big_train_set` column which indicates rows contained in the train or validation sets.

**Exercise**: Adapt the schema to check that rows in either train or validation sets are in the big train set.

In [11]:
######################
### YOUR CODE HERE ###
######################

In [12]:
# Solution
def fill_missing_values(df: pd.DataFrame) -> pd.DataFrame:
    """Fill missing values.

    Fill the missing values in the 'JobCode' and 'MaritalStatus' columns with the mode
    (most frequent value) of their respective columns.
    
    Args:
        df (pd.DataFrame): Input DataFrame with potential missing values.
    """
    df["JobCode"] = df["JobCode"].fillna(df["JobCode"].mode()[0])
    df["MaritalStatus"] = df["MaritalStatus"].fillna(df["MaritalStatus"].mode()[0])
    return df

schema_df = pa.DataFrameSchema(
    {
        "IDpol": pa.Column(str),
        "PayFreq": pa.Column(str, checks=pa.Check.isin(["Annual", "Half-yearly", "Quarterly", "Monthly"])),
        "VehClass": pa.Column(str, checks=pa.Check.isin([
            "Cheapest", "Cheaper", "Cheap", "Medium low", "Medium", "Medium high", "Expensive", "More expensive", "Most expensive",
        ])),
        "VehPower": pa.Column(str, checks=pa.Check.isin([f"P{i}" for i in range(1, 21)])),
        "VehUsage": pa.Column(str, checks=pa.Check.isin([
            "Private+trip to office", "Professional", "Professional run",
        ])),
        "Garage": pa.Column(str, checks=pa.Check.isin([
            "Closed zbox", "Closed collective parking", "Opened collective parking", "Street",
        ])),
        ######################
        ### YOUR CODE HERE ###
        ######################
        "Year": pa.Column(int, checks=[pa.Check.ge(2003), pa.Check.le(2004)], coerce=True),
        "DrivAge": pa.Column(int, checks=[pa.Check.ge(18), pa.Check.le(100)], coerce=True),
        "DrivGender": pa.Column(str, checks=pa.Check.isin(["M", "F"])),
        "MaritalStatus": pa.Column(str, checks=pa.Check.isin([
            "Cohabiting", "Married", "Single", "Widowed", "Divorced",
        ])),
        "BonusMalus": pa.Column(int, checks=pa.Check.ge(50), coerce=True),
        "LicenceNb": pa.Column(int, checks=pa.Check.ge(1), coerce=True),
        "JobCode": pa.Column(str, checks=pa.Check.isin([
            "Private employee", "Public employee", "Retiree", "Other", "Craftsman", "Farmer", "Retailer", "Unknown",
        ])),
        "VehAge": pa.Column(int, checks=pa.Check.ge(0), coerce=True),
        "VehGas": pa.Column(str, checks=pa.Check.isin(["Regular", "Diesel"])),
        "Area": pa.Column(str, checks=pa.Check.isin([f"A{i}" for i in range(1, 13)])),
        # Sets for the model
        "train_set": pa.Column(int, checks=pa.Check.isin([0, 1])),
        "val_set": pa.Column(int, checks=pa.Check.isin([0, 1])),
        "test_set": pa.Column(int, checks=pa.Check.isin([0, 1])),
        "big_train_set": pa.Column(int, checks=pa.Check.isin([0, 1])),
    },
    checks=[
        pa.Check(
            lambda df: (df[["train_set", "val_set", "test_set"]].sum(axis=1) == 1).all(),
            error="Exactly one and only of the train_set, val_set and test_set must be 1 for each row.",
        ),
        pa.Check(
            lambda df: (
                (df["big_train_set"] == ((df["train_set"] == 1) | (df["val_set"] == 1)).astype(int))
            ).all(),
            error="big_train_set must be 1 if either train_set or valid_set is 1, and 0 otherwise."
        ),
    ]
)

df_filled = fill_missing_values(df)

df_validated = schema_df.validate(df_filled)

# 2. Validate the parameters using pydantic

**Goal**: The goal of this section is to display a real use of the pydantic validations to check that the parameters passed to your code are ones you accept.
The validations will be applied to parameters to use for the optuna modelisation.

**Exercise**: Create a pydantic class to validate the parameters given in a conf file to make sure they respect some rules to catch bugs before running your code:
1. Define the attribute `search_space` which represents the search space for the Bayesian optimization.
2. Define the attribute `categorical_feat` which represents the categorical features.
3. Define the attribute `default_params` which represents the default combination of hyperparameters to test at the beginning of the Bayesian optimization.
4. Create a custom validator to validate the `search_space`. The search space should be a dict where the keys are parameters names and the values are dict containing:
   * `sampling_type`: The type of the parameter. Should be `categorical`, `int` or `float`.
   * If the sampling type is categorical, an element `choices` containing the possible values.
   * If the sampling type is either int or float, elements `min` and `max` containing the minimum and maximum values for the search space.

For example:
```yaml
search_space: {
    "param_1": {
        "sampling_type": "categorical",
        "choices": ["cat_1", "cat_2"],
    },
    "param_2":{
        "sampling_type": "int",
        "min": min_val,
        "max": max_val,
    },
}
```

In [13]:
# Solution
class validate_input_parameters(BaseModel):
    """A pydantic class to validate the input parameters of the process."""
    target_name: StrictStr = Field(
        pattern=r"^[A-Za-z0-1\_]+$",
        description="Name of the target column.",
        frozen=True,
    )
    # ========================
    # ==== YOUR CODE HERE ====
    # ========================
    search_space: dict[str, dict[StrictStr, Any]] = Field(
        description="Search space for the Bayesian optimization.",
        frozen=True,
    )
    categorical_feat: list[StrictStr] = Field(
        description="List of the categorical features",
        frozen=True,
    )
    default_params: list[dict[StrictStr, Any]] = Field(
        description="List of combination of default parameters to check at the beginning of the bayesian optimization.",
        frozen=True,
    )

    @field_validator("search_space")
    @classmethod
    def validate_search_space(cls, value: dict[str, Any]) -> dict[str, Any]:
        """Validate the search space parameter for the bayesian optimization."""
        for hyparam_name, sampling_params in value.items():
            if sampling_params["sampling_type"] == "categorical":
                # For categorical parameters
                if "choices" not in sampling_params.keys():
                    raise KeyError(
                        f"Check {hyparam_name}. For a categorical feature, you should provide a 'choices' parameter for the possible values."
                    )
            elif sampling_params["sampling_type"] in ["float", "int"]:
                # For continuous parameters
                if ("min" not in sampling_params.keys()) | ("max" not in sampling_params.keys()):
                    raise KeyError(f"Check {hyparam_name}. For continuous parameters, you should provide 'min' and 'max' parameters.")
            else:
                raise ValueError(f"Check {hyparam_name}. Sampling type should be either 'categorical', 'int' or 'float'. You provided {sampling_params['sampling_type']}.")
        return value

In [14]:
params = load_conf_parameters("../conf/parameters.yml")

validated_params = validate_input_parameters(**params)

As you can see, some parameters don't follow the rules you defined in the pydantic class!

**Exercise**: Fix the parameters in the conf file to follow rules you defined.

In [15]:
#####################
### ACTIONS TO DO ###
#####################

Then, we can run the Bayesian optimization:

In [16]:
cols_to_drop = ["IDpol", "Year", "train_set", "val_set", "test_set", "big_train_set"]

X_big_train, y_big_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="big_train_set")
X_train, y_train = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="train_set")
X_val, y_val = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="val_set")
X_test, y_test = retrieve_data_w_features(df=df, features_to_drop=cols_to_drop, split="test_set")

best_params = run_bayesian_optimization(
    df_train=X_train,
    y_train=y_train,
    df_val=X_val,
    y_val=y_val,
    categorical_features=validated_params.categorical_feat,
    search_params=validated_params.search_space,
    default_params_list=validated_params.default_params,
)

best_params

[I 2025-11-01 22:20:52,407] A new study created in memory with name: basic_hgb_opt
Best trial: 0. Best value: 107.462:   2%|▏         | 1/50 [00:00<00:25,  1.91it/s]

[I 2025-11-01 22:20:52,939] Trial 0 finished with value: 107.46195150088059 and parameters: {'max_iter': 50, 'learning_rate': 0.1, 'l2_regularization': 1.0}. Best is trial 0 with value: 107.46195150088059.


Best trial: 0. Best value: 107.462:   4%|▍         | 2/50 [00:00<00:17,  2.71it/s]

[I 2025-11-01 22:20:53,200] Trial 1 finished with value: 112.03645313813234 and parameters: {'max_iter': 44, 'learning_rate': 0.4758500101408589, 'l2_regularization': 7.319939418114051}. Best is trial 0 with value: 107.46195150088059.


Best trial: 0. Best value: 107.462:   8%|▊         | 4/50 [00:01<00:15,  2.90it/s]

[I 2025-11-01 22:20:53,776] Trial 2 finished with value: 107.6255301759693 and parameters: {'max_iter': 64, 'learning_rate': 0.0864491338167939, 'l2_regularization': 1.5599452033620265}. Best is trial 0 with value: 107.46195150088059.
[I 2025-11-01 22:20:53,940] Trial 3 finished with value: 110.92010670995604 and parameters: {'max_iter': 15, 'learning_rate': 0.4344263114297182, 'l2_regularization': 6.011150117432088}. Best is trial 0 with value: 107.46195150088059.


Best trial: 0. Best value: 107.462:  10%|█         | 5/50 [00:02<00:19,  2.31it/s]

[I 2025-11-01 22:20:54,531] Trial 4 finished with value: 129.94096366108002 and parameters: {'max_iter': 74, 'learning_rate': 0.020086402204943198, 'l2_regularization': 9.699098521619943}. Best is trial 0 with value: 107.46195150088059.


Best trial: 5. Best value: 106.846:  12%|█▏        | 6/50 [00:02<00:22,  1.91it/s]

[I 2025-11-01 22:20:55,225] Trial 5 finished with value: 106.84609860409188 and parameters: {'max_iter': 85, 'learning_rate': 0.11404616423235532, 'l2_regularization': 1.8182496720710062}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  14%|█▍        | 7/50 [00:03<00:18,  2.30it/s]

[I 2025-11-01 22:20:55,482] Trial 6 finished with value: 109.50634730773871 and parameters: {'max_iter': 26, 'learning_rate': 0.1590786990501735, 'l2_regularization': 5.247564316322379}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  16%|█▌        | 8/50 [00:03<00:18,  2.31it/s]

[I 2025-11-01 22:20:55,910] Trial 7 finished with value: 107.45742830522933 and parameters: {'max_iter': 49, 'learning_rate': 0.15270227869704053, 'l2_regularization': 6.118528947223795}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  18%|█▊        | 9/50 [00:03<00:15,  2.69it/s]

[I 2025-11-01 22:20:56,148] Trial 8 finished with value: 111.21175632629901 and parameters: {'max_iter': 22, 'learning_rate': 0.1531508777822569, 'l2_regularization': 3.663618432936917}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  20%|██        | 10/50 [00:04<00:13,  2.93it/s]

[I 2025-11-01 22:20:56,420] Trial 9 finished with value: 109.83175380780865 and parameters: {'max_iter': 51, 'learning_rate': 0.39473622108257667, 'l2_regularization': 1.9967378215835974}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  22%|██▏       | 11/50 [00:04<00:14,  2.75it/s]

[I 2025-11-01 22:20:56,836] Trial 10 finished with value: 108.86838007340613 and parameters: {'max_iter': 88, 'learning_rate': 0.2666512384238821, 'l2_regularization': 1.9100360148552773}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  24%|██▍       | 12/50 [00:04<00:14,  2.54it/s]

[I 2025-11-01 22:20:57,300] Trial 11 finished with value: 107.43423931079255 and parameters: {'max_iter': 69, 'learning_rate': 0.20890065115787296, 'l2_regularization': 5.1301740183834}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  26%|██▌       | 13/50 [00:05<00:18,  2.00it/s]

[I 2025-11-01 22:20:58,041] Trial 12 finished with value: 107.13904459798019 and parameters: {'max_iter': 87, 'learning_rate': 0.06139327730628868, 'l2_regularization': 4.328178827638933}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  28%|██▊       | 14/50 [00:06<00:21,  1.68it/s]

[I 2025-11-01 22:20:58,859] Trial 13 finished with value: 116.76261914535365 and parameters: {'max_iter': 92, 'learning_rate': 0.025307360781797038, 'l2_regularization': 3.5558401810541027}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  30%|███       | 15/50 [00:06<00:18,  1.90it/s]

[I 2025-11-01 22:20:59,228] Trial 14 finished with value: 108.80929553974754 and parameters: {'max_iter': 99, 'learning_rate': 0.2831704031116692, 'l2_regularization': 6.337285824550085}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  32%|███▏      | 16/50 [00:07<00:17,  1.93it/s]

[I 2025-11-01 22:20:59,722] Trial 15 finished with value: 107.9370903554084 and parameters: {'max_iter': 94, 'learning_rate': 0.14975678508771134, 'l2_regularization': 2.140303555110155}. Best is trial 5 with value: 106.84609860409188.


Best trial: 5. Best value: 106.846:  34%|███▍      | 17/50 [00:08<00:18,  1.74it/s]

[I 2025-11-01 22:21:00,429] Trial 16 finished with value: 109.69859076268779 and parameters: {'max_iter': 84, 'learning_rate': 0.04399863776813073, 'l2_regularization': 0.23778048215570902}. Best is trial 5 with value: 106.84609860409188.


Best trial: 17. Best value: 106.753:  36%|███▌      | 18/50 [00:08<00:20,  1.55it/s]

[I 2025-11-01 22:21:01,236] Trial 17 finished with value: 106.75318289641064 and parameters: {'max_iter': 96, 'learning_rate': 0.0845937826354282, 'l2_regularization': 6.105017128867974}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  38%|███▊      | 19/50 [00:09<00:19,  1.56it/s]

[I 2025-11-01 22:21:01,871] Trial 18 finished with value: 107.0392378047381 and parameters: {'max_iter': 97, 'learning_rate': 0.11584002813593608, 'l2_regularization': 7.376377086559808}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  40%|████      | 20/50 [00:10<00:18,  1.58it/s]

[I 2025-11-01 22:21:02,483] Trial 19 finished with value: 110.94916045518733 and parameters: {'max_iter': 66, 'learning_rate': 0.052394046583764654, 'l2_regularization': 6.47037064871913}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  42%|████▏     | 21/50 [00:10<00:15,  1.85it/s]

[I 2025-11-01 22:21:02,805] Trial 20 finished with value: 109.94283456963805 and parameters: {'max_iter': 90, 'learning_rate': 0.34841480730860025, 'l2_regularization': 9.119167232179375}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  44%|████▍     | 22/50 [00:11<00:16,  1.67it/s]

[I 2025-11-01 22:21:03,543] Trial 21 finished with value: 106.97267978827404 and parameters: {'max_iter': 97, 'learning_rate': 0.1441711308016249, 'l2_regularization': 9.699006798352707}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  46%|████▌     | 23/50 [00:11<00:15,  1.74it/s]

[I 2025-11-01 22:21:04,056] Trial 22 finished with value: 107.55801568106776 and parameters: {'max_iter': 84, 'learning_rate': 0.21945644488000626, 'l2_regularization': 9.379654940376533}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  48%|████▊     | 24/50 [00:12<00:16,  1.58it/s]

[I 2025-11-01 22:21:04,831] Trial 23 finished with value: 108.7542925203563 and parameters: {'max_iter': 97, 'learning_rate': 0.04676474764226714, 'l2_regularization': 9.438482094909256}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  50%|█████     | 25/50 [00:13<00:16,  1.48it/s]

[I 2025-11-01 22:21:05,610] Trial 24 finished with value: 108.2322950128373 and parameters: {'max_iter': 100, 'learning_rate': 0.04763102887084621, 'l2_regularization': 5.775024427721742}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  52%|█████▏    | 26/50 [00:13<00:15,  1.56it/s]

[I 2025-11-01 22:21:06,162] Trial 25 finished with value: 107.48530071298948 and parameters: {'max_iter': 96, 'learning_rate': 0.2024855414275002, 'l2_regularization': 9.993624386415512}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  54%|█████▍    | 27/50 [00:14<00:13,  1.65it/s]

[I 2025-11-01 22:21:06,685] Trial 26 finished with value: 108.00930134394194 and parameters: {'max_iter': 73, 'learning_rate': 0.18272486611478012, 'l2_regularization': 1.9607594542555924}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  56%|█████▌    | 28/50 [00:14<00:10,  2.03it/s]

[I 2025-11-01 22:21:06,917] Trial 27 finished with value: 167.82510489356608 and parameters: {'max_iter': 23, 'learning_rate': 0.02366256256976662, 'l2_regularization': 8.609673685680681}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  58%|█████▊    | 29/50 [00:14<00:10,  2.04it/s]

[I 2025-11-01 22:21:07,402] Trial 28 finished with value: 107.29401910431827 and parameters: {'max_iter': 94, 'learning_rate': 0.15119393423714608, 'l2_regularization': 5.520321531256846}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  60%|██████    | 30/50 [00:15<00:09,  2.15it/s]

[I 2025-11-01 22:21:07,805] Trial 29 finished with value: 108.63748684887928 and parameters: {'max_iter': 60, 'learning_rate': 0.29754633374899386, 'l2_regularization': 8.541254843274125}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  62%|██████▏   | 31/50 [00:16<00:09,  1.91it/s]

[I 2025-11-01 22:21:08,466] Trial 30 finished with value: 106.99074683947958 and parameters: {'max_iter': 85, 'learning_rate': 0.1452616211921068, 'l2_regularization': 8.961420499493203}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  64%|██████▍   | 32/50 [00:16<00:09,  1.80it/s]

[I 2025-11-01 22:21:09,097] Trial 31 finished with value: 107.14882856940676 and parameters: {'max_iter': 83, 'learning_rate': 0.14076281760878637, 'l2_regularization': 9.804797870788795}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  66%|██████▌   | 33/50 [00:17<00:09,  1.87it/s]

[I 2025-11-01 22:21:09,580] Trial 32 finished with value: 107.35528166971592 and parameters: {'max_iter': 87, 'learning_rate': 0.16223967781662307, 'l2_regularization': 7.41153191179948}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  70%|███████   | 35/50 [00:17<00:06,  2.24it/s]

[I 2025-11-01 22:21:10,249] Trial 33 finished with value: 107.00601771726296 and parameters: {'max_iter': 81, 'learning_rate': 0.10624828710274628, 'l2_regularization': 8.186352307485334}. Best is trial 17 with value: 106.75318289641064.
[I 2025-11-01 22:21:10,397] Trial 34 finished with value: 113.41947650857344 and parameters: {'max_iter': 12, 'learning_rate': 0.22201982983363633, 'l2_regularization': 0.10989189243771236}. Best is trial 17 with value: 106.75318289641064.


Best trial: 17. Best value: 106.753:  72%|███████▏  | 36/50 [00:18<00:06,  2.30it/s]

[I 2025-11-01 22:21:10,802] Trial 35 finished with value: 107.80594700669182 and parameters: {'max_iter': 35, 'learning_rate': 0.22866961438205535, 'l2_regularization': 9.826861567778046}. Best is trial 17 with value: 106.75318289641064.


Best trial: 36. Best value: 106.626:  74%|███████▍  | 37/50 [00:19<00:06,  1.93it/s]

[I 2025-11-01 22:21:11,512] Trial 36 finished with value: 106.62567495958235 and parameters: {'max_iter': 98, 'learning_rate': 0.13127891494843358, 'l2_regularization': 8.326217135905448}. Best is trial 36 with value: 106.62567495958235.


Best trial: 36. Best value: 106.626:  76%|███████▌  | 38/50 [00:19<00:06,  1.87it/s]

[I 2025-11-01 22:21:12,084] Trial 37 finished with value: 106.68151680977583 and parameters: {'max_iter': 94, 'learning_rate': 0.12548829562855865, 'l2_regularization': 7.536655492734865}. Best is trial 36 with value: 106.62567495958235.


Best trial: 36. Best value: 106.626:  78%|███████▊  | 39/50 [00:20<00:06,  1.61it/s]

[I 2025-11-01 22:21:12,911] Trial 38 finished with value: 110.12909239248403 and parameters: {'max_iter': 88, 'learning_rate': 0.042748174449258686, 'l2_regularization': 7.2153791014055955}. Best is trial 36 with value: 106.62567495958235.


Best trial: 36. Best value: 106.626:  80%|████████  | 40/50 [00:20<00:05,  1.86it/s]

[I 2025-11-01 22:21:13,257] Trial 39 finished with value: 112.43983975412435 and parameters: {'max_iter': 88, 'learning_rate': 0.49641180536051777, 'l2_regularization': 6.605992458284577}. Best is trial 36 with value: 106.62567495958235.


Best trial: 36. Best value: 106.626:  82%|████████▏ | 41/50 [00:21<00:04,  1.83it/s]

[I 2025-11-01 22:21:13,818] Trial 40 finished with value: 107.23473141540659 and parameters: {'max_iter': 99, 'learning_rate': 0.19312783362336472, 'l2_regularization': 7.325055474542392}. Best is trial 36 with value: 106.62567495958235.


Best trial: 41. Best value: 106.437:  84%|████████▍ | 42/50 [00:22<00:05,  1.43it/s]

[I 2025-11-01 22:21:14,880] Trial 41 finished with value: 106.43685626802875 and parameters: {'max_iter': 99, 'learning_rate': 0.09475222987703978, 'l2_regularization': 7.2322349454003865}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  86%|████████▌ | 43/50 [00:23<00:04,  1.43it/s]

[I 2025-11-01 22:21:15,570] Trial 42 finished with value: 107.44122838004272 and parameters: {'max_iter': 84, 'learning_rate': 0.14106613498700257, 'l2_regularization': 6.810825258326328}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  88%|████████▊ | 44/50 [00:23<00:04,  1.35it/s]

[I 2025-11-01 22:21:16,415] Trial 43 finished with value: 107.04758826925774 and parameters: {'max_iter': 97, 'learning_rate': 0.06623011418423239, 'l2_regularization': 7.615790612507069}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  90%|█████████ | 45/50 [00:24<00:03,  1.66it/s]

[I 2025-11-01 22:21:16,695] Trial 44 finished with value: 108.70427410343304 and parameters: {'max_iter': 27, 'learning_rate': 0.3117889135605332, 'l2_regularization': 3.7734415134720214}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  92%|█████████▏| 46/50 [00:24<00:02,  1.93it/s]

[I 2025-11-01 22:21:17,012] Trial 45 finished with value: 157.89615456763977 and parameters: {'max_iter': 31, 'learning_rate': 0.021880743737880598, 'l2_regularization': 0.41965122581661696}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  94%|█████████▍| 47/50 [00:25<00:01,  1.72it/s]

[I 2025-11-01 22:21:17,738] Trial 46 finished with value: 115.97034336650506 and parameters: {'max_iter': 68, 'learning_rate': 0.03510119033483587, 'l2_regularization': 2.6612496300705093}. Best is trial 41 with value: 106.43685626802875.


Best trial: 41. Best value: 106.437:  96%|█████████▌| 48/50 [00:26<00:01,  1.50it/s]

[I 2025-11-01 22:21:18,610] Trial 47 finished with value: 106.55940045978865 and parameters: {'max_iter': 99, 'learning_rate': 0.11393554064054226, 'l2_regularization': 6.356501870689687}. Best is trial 41 with value: 106.43685626802875.


Best trial: 48. Best value: 106.114:  98%|█████████▊| 49/50 [00:27<00:00,  1.34it/s]

[I 2025-11-01 22:21:19,533] Trial 48 finished with value: 106.11371766638698 and parameters: {'max_iter': 94, 'learning_rate': 0.14650511854697806, 'l2_regularization': 4.80666423314504}. Best is trial 48 with value: 106.11371766638698.


Best trial: 48. Best value: 106.114: 100%|██████████| 50/50 [00:27<00:00,  1.80it/s]

[I 2025-11-01 22:21:20,198] Trial 49 finished with value: 106.97829670481916 and parameters: {'max_iter': 96, 'learning_rate': 0.17971247269577406, 'l2_regularization': 5.583428202094666}. Best is trial 48 with value: 106.11371766638698.





{'max_iter': 94,
 'learning_rate': 0.14650511854697806,
 'l2_regularization': 4.80666423314504}

Now that we have found the best hyperparameters, we can train on the big train set and evaluate on the test set!

**Exercise**: Train a final model on the big train set using the best hyperparameters and evaluate on the test set.

In [17]:
# Solution
from sklearn.ensemble import HistGradientBoostingRegressor

model = HistGradientBoostingRegressor(
    categorical_features=validated_params.categorical_feat,
    early_stopping=True,
    random_state=42,
    **best_params,
)
model.fit(X=X_big_train, y=y_big_train, X_val=X_test, y_val=y_test)

big_train_predictions = model.predict(X_big_train)
test_predictions = model.predict(X_test)

big_train_rmse = root_mean_squared_error(y_true=y_big_train, y_pred=big_train_predictions)
print(f"Big Train RMSE: {big_train_rmse}")
test_rmse = root_mean_squared_error(y_true=y_test, y_pred=test_predictions)
print(f"Test RMSE: {test_rmse}")

Big Train RMSE: 87.34474200996922
Test RMSE: 103.72454358053689


# 3. Bonus

You can see that there is overfitting! 

**Exercise**: Modify the function in the `train_utils.py` file to limit the overfitting of the model.

<details>

<summary>Click to reveal tip 1</summary>

Add an element in the metric to optimize containing the delta between train and validation.

<details>

<summary>Click to reveal tip 2</summary>

Use an alpha parameter to control the impact of this delta.

In [18]:
# Solution
import optuna
import numpy as np
from src.train_utils import build_search_space

def optimize_hyperparams_hgb(
    trial: optuna.trial.Trial,
    search_params: dict[str, Any],
    df_train: pd.DataFrame,
    y_train: pd.Series,
    df_val: pd.DataFrame,
    y_val: pd.Series,
    categorical_features: list[str],
    alpha: float = 0.3
) -> float:
    """Optimize the hyperparameters of the HistGradientBoosting model.
    
    Args:
        trial (optuna.trial.Trial): The Optuna trial object.
        search_params (dict[str, Any]): The search space parameters.
        df_train (pd.DataFrame): The training features.
        y_train (pd.Series): The training target.
        df_val (pd.DataFrame): The validation features.
        y_val (pd.Series): The validation target.
        categorical_features (list[str]): List of categorical feature names.

    Returns:
        (float): The validation loss.
    """
    # Build search space
    hyperparams = build_search_space(trial, search_params)

    # Define the model
    model = HistGradientBoostingRegressor(
        categorical_features=categorical_features,
        early_stopping=True,
        random_state=42,
        **hyperparams,
    )
    model.fit(X=df_train, y=y_train, X_val=df_val, y_val=y_val)
    val_predictions = model.predict(df_val)
    rmse_train = root_mean_squared_error(y_true=y_train, y_pred=model.predict(df_train))
    rmse_val = root_mean_squared_error(y_true=y_val, y_pred=val_predictions)

    return rmse_val + alpha * np.abs(rmse_train - rmse_val)