## Goals: Hyper Parameter Optimisation of *QRF* model

This notebook propose different methods of hyper parameter optimisation based on X-Validation :
* Random Search
* Genetic algorithm [Not yet included]

# 1. Data Import and Setup

Imports necessary libraries, sets up environment paths.

In [31]:
# Standard library imports
import os
import sys

# Third-party imports
from functools import partial
import pandas as pd
from quantile_forest import RandomForestQuantileRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingRegressor

# Append project root to sys.path for local imports
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..', '..')))

# Local application imports
from src.utils.model import get_station_stats, custom_log_likelihood
from src.utils.SpatioTemporalSplit import SpatioTemporalSplit
from src.utils.custom_models import SnowIndexComputeTransformer


Defines constants :
* INPUT_DIR must be the same as the one defined in *00 Preprocessing/Feature Engineering*.
* MODEL_DIR is the directory where the exploration models will be saved.

In [32]:
INPUT_DIR = "../../../../data/input/"
MODEL_DIR = "../../../../models/exploration/"

SEED = 42 
ALPHA = 0.1
WEEK_TO_PREDICT=4 

DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    "snow_index",
    # "oh_enc_date",
    "cyc_enc_date",
    "clust_index",
    "scl_feat",
    # "scl_feat_wl", # Scale all except waterflow lag
    "scl_catch",
]

DATASET_SPEC = "_".join(DATASET_TRANSFORMS)

# columns to drop : target at different horizon, station_code, and features removed from Feature Selection
TO_DROP = ["water_flow_week1", "station_code", "water_flow_week2", "water_flow_week3", "water_flow_week4"]

# 2. Data Loading
Load in the baseline datasets, create the directory to save models.

In [33]:
# load the dataset
ds_train = pd.read_csv(f"{INPUT_DIR}dataset_{DATASET_SPEC}.csv")
train_data = ds_train.copy()
train_data.reset_index(inplace=True)
train_data = train_data.loc[:, ~train_data.columns.duplicated()]
ds_train = ds_train.set_index("ObsDate")
y_train = train_data[f"water_flow_week{WEEK_TO_PREDICT}"]
cv_data = train_data.copy()


### 3. Model preparation

Compute station statistics (usefull for scalling)

In [34]:
station_stats = get_station_stats(
    y_train.to_numpy(),
    train_data["station_code"].to_numpy()
)

Create a custom Pipeline to keep track of the station code

In [35]:
cols_to_drop = TO_DROP.copy()
cols_to_drop += ["ObsDate"]
predictor_cols = [col for col in cv_data.columns if col not in cols_to_drop]
preprocessor = ColumnTransformer(transformers=[
    ('select', 'passthrough', predictor_cols)
], remainder='drop')

snowIndexer = SnowIndexComputeTransformer()

qrf_week1 = RandomForestQuantileRegressor(n_estimators=10, max_depth=10, min_samples_leaf=10)
# qrf_week1 = GradientBoostingRegressor()

pipeline = Pipeline(steps=[
    # ('snowindexer', SnowIndexComputeTransformer(temp_col_name="tempartures_pca_1", rain_col_name="precipitations_pca_1",)),
    ('preprocessor', preprocessor),
    ('model', qrf_week1)
])

Initialisation of the log likelihood scorer

In [36]:
def inverted_log_likelihood(estimator, X, y_true, cv_data, station_stats, alpha=0.1):
    return -custom_log_likelihood(estimator, X, y_true, cv_data, station_stats, alpha=alpha)

In [37]:
scorer = partial(inverted_log_likelihood,
                 cv_data=cv_data,
                 station_stats=station_stats,
                 alpha=ALPHA)

Initialisation of the SpatioTemporal Splitter

In [38]:
cv = SpatioTemporalSplit(
    n_splits=10,
    date_col='ObsDate',
    station_col='station_code',
    temporal_frac=0.75,
    spatial_frac=0.75,
    random_state=42
)


### 4. Hyper parameter tuning

Define the hyperparameter distributions for random search, take care the parameters presented here are choosen so that the search is fast you need to explore wider parameters range.

#### a. Random Search

In [39]:
# param_distributions = {
#     'model__n_estimators': [2, 10, 20, 45, 60, 85, 100],
#     'model__max_depth': [2, 7, 13, 20, 30, 50],
#     'model__min_samples_leaf': [1, 4, 9, 15, 20, 30],
#     'model__min_samples_split': [2, 5, 10, 20],
#     'model__max_features': ['sqrt', 'log2', 0.3, 0.7, None],
#     'model__bootstrap': [True, False],
#     'snowindexer__altitude_weight': [0.9, 1, 1.1, 1.2],
#     'snowindexer__temp_weight': [0.9, 1, 1.1, 1.2],
#     'snowindexer__precip_weight': [0.1, 0.15, 0.2, 0.25],
# }

# param_distributions = {
#     'model__n_estimators': [52, 55, 57, 59, 61],
#     'model__max_depth': [21, 22, 23, 24, 25],
#     'model__min_samples_leaf': [11, 12, 13, 14, 16],
#     'model__min_samples_split': [18, 19, 20, 21, 22],
#     'model__max_features': [None],
#     'model__bootstrap': [True],
#     'snowindexer__altitude_weight': [0.2, 0.4, 0.6, 0.8, 1],
#     'snowindexer__temp_weight': [0.2, 0.4, 0.6, 0.8, 1],
#     'snowindexer__precip_weight': [0.2, 0.4, 0.6, 0.8, 1],
# }

param_distributions = {
    'model__n_estimators': [2, 10, 20, 45, 60, 85, 100],
    'model__max_depth': [2, 7, 13, 20, 30, 50],
    'model__min_samples_leaf': [1, 4, 9, 15, 20, 30],
    'model__min_samples_split': [2, 5, 10, 20],
    'model__max_features': ['sqrt', 'log2', 0.3, 0.7, None],
    'model__bootstrap': [True, False]
}

# 9. Set up RandomizedSearchCV.
random_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_distributions,
    n_iter=60,            # Number of parameter settings sampled
    scoring=scorer,       # Use our custom scorer
    cv=cv,                # Our custom spatio-temporal splitter
    random_state=42,
    n_jobs=-1,             # Use all available cores
    verbose=3
)

random_search.fit(cv_data, y_train)

Fitting 10 folds for each of 60 candidates, totalling 600 fits


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

### params for config : 

```python
pipeline = Pipeline(steps=[
    ('snowindexer', SnowIndexComputeTransformer(temp_col_name="tempartures_pca_1", rain_col_name="precipitations_pca_1",)),
    ('preprocessor', preprocessor),
    ('model', qrf_week1)
])

DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    "scl_catch"
]
```


In [10]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'snowindexer__temp_weight': 1, 'snowindexer__precip_weight': 0.2, 'snowindexer__altitude_weight': 1, 'model__n_estimators': 52, 'model__min_samples_split': 19, 'model__min_samples_leaf': 13, 'model__max_features': None, 'model__max_depth': 25, 'model__bootstrap': True}
Best Score: -2.0340065806334726


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    "scl_catch"
]
```


In [None]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 55, 'model__min_samples_split': 20, 'model__min_samples_leaf': 11, 'model__max_features': None, 'model__max_depth': 23, 'model__bootstrap': True}
Best Score: -1.9197539988339372


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    "scl_catch"
]
```


In [None]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -1.9379500224596236


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    "scl_catch"
]
```


In [None]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 59, 'model__min_samples_split': 18, 'model__min_samples_leaf': 14, 'model__max_features': None, 'model__max_depth': 25, 'model__bootstrap': True}
Best Score: -1.923900362267943


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    "scl_catch"
]
```


In [None]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -1.9528121007193104


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
print("Best Parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 57, 'model__min_samples_split': 20, 'model__min_samples_leaf': 14, 'model__max_features': None, 'model__max_depth': 22, 'model__bootstrap': True}
Best Score: -2.067613593686774


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 55, 'model__min_samples_split': 22, 'model__min_samples_leaf': 16, 'model__max_features': None, 'model__max_depth': 25, 'model__bootstrap': True}
Best Score: -1.9144132951754593


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 56, 'model__min_samples_split': 19, 'model__min_samples_leaf': 13, 'model__max_features': None, 'model__max_depth': 23, 'model__bootstrap': True}
Best Score: -1.9230515071418086


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 57, 'model__min_samples_split': 20, 'model__min_samples_leaf': 11, 'model__max_features': None, 'model__max_depth': 17, 'model__bootstrap': True}
Best Score: -1.9198867119123673


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -2.1232930047642933


### params for config : 
```python
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    # "snow_index",
    "oh_enc_date",
    "scl_wtr_flows",
    # "rm_st_id"
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -2.1232930047642933


### params for config : 
```python
DATASET_TRANSFORMS = [
    "remove_geneve_station",
    "full_pca",
    # "snow_index",
    "one_hot_encode_month_season",
    "scale_train_waterflows",
]
```


In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -2.1232930047642933


### params for qrf pca_and_sin_season_encode

In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -2.1232930047642933


### params for qrf full_pca

In [None]:
# print("Best Parameters:", random_search.best_params_)
# print("Best Score:", random_search.best_score_)

Best Parameters: {'model__n_estimators': 60, 'model__min_samples_split': 20, 'model__min_samples_leaf': 9, 'model__max_features': None, 'model__max_depth': 13, 'model__bootstrap': True}
Best Score: -2.1232930047642933


#### b. GA

COMMING SOON