## Goals: Hyper Parameter Optimisation of *QRF* model

This notebook propose different methods of hyper parameter optimisation based on X-Validation :
* Random Search
* Genetic algorithm [Not yet included]

# 1. Data Import and Setup

Imports necessary libraries, sets up environment paths.

In [1]:
# Standard library imports
import os
import sys

# Third-party imports
from functools import partial
import pandas as pd
from quantile_forest import RandomForestQuantileRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

# Append project root to sys.path for local imports
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..', '..')))

# Local application imports
from src.utils.model import get_station_stats, custom_log_likelihood
from src.utils.SpatioTemporalSplit import SpatioTemporalSplit

Defines constants :
* INPUT_DIR must be the same as the one defined in *00 Preprocessing/Feature Engineering*.
* MODEL_DIR is the directory where the exploration models will be saved.

In [4]:
INPUT_DIR = "../../../../data/input/"
MODEL_DIR = "../../../../models/exploration/"

SEED = 42 
ALPHA = 0.1
WEEK_TO_PREDICT=4 
DATASET_SPEC="soil_pca" 

# columns to drop : target at different horizon, station_code, and features removed from Feature Selection
TO_DROP = ["water_flow_week1", "station_code", "water_flow_week2", "water_flow_week3", "water_flow_week4"]

# 2. Data Loading
Load in the baseline datasets, create the directory to save models.

In [5]:
# load the dataset
ds_train = pd.read_csv(f"{INPUT_DIR}dataset_{DATASET_SPEC}.csv")
train_data = ds_train.copy()
train_data.reset_index(inplace=True)
train_data = train_data.loc[:, ~train_data.columns.duplicated()]
ds_train = ds_train.set_index("ObsDate")
y_train = train_data[f"water_flow_week{WEEK_TO_PREDICT}"]
cv_data = train_data.copy()


### 3. Model preparation

Compute station statistics (usefull for scalling)

In [6]:
station_stats = get_station_stats(
    y_train.to_numpy(),
    train_data["station_code"].to_numpy()
)

Create a custom Pipeline to keep track of the station code

In [7]:
cols_to_drop = TO_DROP.copy()
cols_to_drop += ["ObsDate"]
predictor_cols = [col for col in cv_data.columns if col not in cols_to_drop]
preprocessor = ColumnTransformer(transformers=[
    ('select', 'passthrough', predictor_cols)
], remainder='drop')

qrf_week1 = RandomForestQuantileRegressor(n_estimators=10, max_depth=10, min_samples_leaf=10)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', qrf_week1)
])

Initialisation of the log likelihood scorer

In [8]:
scorer = partial(custom_log_likelihood,
                 cv_data=cv_data,
                 station_stats=station_stats,
                 alpha=ALPHA)

Initialisation of the SpatioTemporal Splitter

In [9]:
cv = SpatioTemporalSplit(
    n_splits=10,
    date_col='ObsDate',
    station_col='station_code',
    temporal_frac=0.75,
    spatial_frac=0.75,
    random_state=42
)


### 4. Hyper parameter tuning

Define the hyperparameter distributions for random search, take care the parameters presented here are choosen so that the search is fast you need to explore wider parameters range.

#### a. Random Search

In [14]:
param_distributions = {
    'model__n_estimators': [2, 5, 10, 15, 20, 30, 45, 60, 80, 100],
    'model__max_depth': [2, 5, 6, 7, 10, 13, 16, 20, 25, 30],
    'model__min_samples_leaf': [1, 2, 5, 9, 12, 15, 20],
    'model__min_samples_split': [2, 5, 10, 15, 20],
    'model__max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7, None],
    'model__bootstrap': [True, False]
}

# 9. Set up RandomizedSearchCV.
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=75,            # Number of parameter settings sampled
    scoring=scorer,       # Use our custom scorer
    cv=cv,                # Our custom spatio-temporal splitter
    random_state=42,
    n_jobs=-1             # Use all available cores
)

random_search.fit(cv_data, y_train)

  _data = np.array(data, dtype=dtype, copy=copy,


In [15]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 2, 'model__min_samples_split': 20, 'model__min_samples_leaf': 15, 'model__max_features': 0.5, 'model__max_depth': 10, 'model__bootstrap': False}
Best Score: 14373.411202444146


In [13]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 2, 'model__min_samples_leaf': 5, 'model__max_depth': 10}
Best Score: 31556.9611889695


In [11]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 2, 'model__min_samples_leaf': 1, 'model__max_depth': 5}
Best Score: 6530.383805338368


#### b. GA

COMMING SOON