## Goals: 10-Fold Cross Validation Performance of *QRF* model

Performing a cross-validation similar to our evaluation method in this scenario implies:

* *Temporal and Spatial Splitting:* To accurately replicate the evaluation dataset, the splits must be designed so that no dates or stations in the evaluation fold appear in the training set.

<img src="../../images/eval.png" alt="Experiment Diagram" style="width:50%;" />

* *Data Scaling:* The evaluation dataset is scaled to balance the contribution of each station, ensuring that errors from stations with high water streamflow do not overshadow those from stations with lower streamflow.
* *Prediction Intervals:* A non-standard evaluation approach based on log-likelihood is used to account for prediction intervals.



This notebook addresses these question by employing a custom `SpatioTemporalSplit` class for folding, a pipeline for proper scaling, and a custom `custom_log_likelihood` scorer for performance evaluation.

### 1. Data Import and Setup

Imports necessary libraries, sets up environment paths.

In [1]:
# Standard library imports
import os
import sys

# Third-party imports
from functools import partial
import numpy as np
import pandas as pd
from quantile_forest import RandomForestQuantileRegressor
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline


# Append project root to sys.path for local imports
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..', '..')))

# Local application imports
from src.utils.model import get_station_stats, custom_log_likelihood
from src.utils.SpatioTemporalSplit import SpatioTemporalSplit

Defines constants :
* INPUT_DIR must be the same as the one defined in *00 Preprocessing/Feature Engineering*.
* MODEL_DIR is the directory where the exploration models will be saved.

In [3]:
INPUT_DIR = "../../../../data/input/"
MODEL_DIR = "../../../../models/exploration/"

SEED = 42 
ALPHA = 0.1
WEEK_TO_PREDICT=1 
DATASET_SPEC="soil_pca" 

# columns to drop : target at different horizon, station_code, and features removed from Feature Selection
TO_DROP = ["water_flow_week1", "station_code", "water_flow_week2", "water_flow_week3", "water_flow_week4"]

### 2. Data Loading
Load in the baseline datasets, create the directory to save models.

In [4]:
# load the dataset
ds_train = pd.read_csv(f"{INPUT_DIR}dataset_{DATASET_SPEC}.csv")
train_data = ds_train.copy()
train_data.reset_index(inplace=True)
train_data = train_data.loc[:, ~train_data.columns.duplicated()]
ds_train = ds_train.set_index("ObsDate")
y_train = train_data[f"water_flow_week{WEEK_TO_PREDICT}"]
cv_data = train_data.copy()


### 3. Model preparation

Compute station statistics (usefull for scalling)

In [5]:
station_stats = get_station_stats(
    y_train.to_numpy(),
    train_data["station_code"].to_numpy()
)

Create a custom Pipeline to keep track of the station code

In [6]:
cols_to_drop = TO_DROP
cols_to_drop += ["ObsDate"]

predictor_cols = [col for col in cv_data.columns if col not in cols_to_drop]
preprocessor = ColumnTransformer(transformers=[
    ('select', 'passthrough', predictor_cols)
], remainder='drop')

qrf_week1 = RandomForestQuantileRegressor(n_estimators=10, max_depth=10, min_samples_leaf=10)

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', qrf_week1)
])


Initialisation of the log likelihood scorer

In [7]:
scorer = partial(custom_log_likelihood,
                 cv_data=cv_data,
                 station_stats=station_stats,
                 alpha=ALPHA)


Initialisation of the SpatioTemporal Splitter

In [8]:

cv = SpatioTemporalSplit(
    n_splits=10,
    date_col='ObsDate',
    station_col='station_code',
    temporal_frac=0.75,
    spatial_frac=0.75,
    random_state=42
)



### 4. Cross-validation


In [9]:
scores = cross_val_score(
    pipeline,    # our pipeline estimator
    cv_data,     # full data with all columns needed for splitting
    y_train,     # target variable
    cv=cv,       # custom spatio-temporal splitter
    scoring=scorer # custom scorer
)

Fold: coverage = 0.760, interval size = 17.571
Fold: coverage = 0.808, interval size = 16.195
Fold: coverage = 0.764, interval size = 19.182
Fold: coverage = 0.749, interval size = 19.673
Fold: coverage = 0.791, interval size = 18.485
Fold: coverage = 0.856, interval size = 14.124
Fold: coverage = 0.849, interval size = 21.628
Fold: coverage = 0.848, interval size = 24.890
Fold: coverage = 0.798, interval size = 16.925
Fold: coverage = 0.799, interval size = 15.047


In [10]:
print("10-Fold CV performance (per fold log-likelihood):", scores)
print("Average log-likelihood:", np.mean(scores))

10-Fold CV performance (per fold log-likelihood): [1.45247746 1.32777317 1.50590227 1.55308338 1.51081303 1.27841663
 1.64575978 1.53403073 1.45057697 1.4702587 ]
Average log-likelihood: 1.472909210693793
