## Goals: Training the *Final* Models

This notebook trains the model on the full *baseline_dataset* for the final prediction on evaluation data.

Here, we train a model designed to generalize across water stations in Brazil and France. However, you are not required to follow this approach and may opt to train separate models for different geographic *regions*.

This baseline model training example utilizes all available features, with hyperparameters chosen for quick execution rather than optimization. For hyperparameter tuning and feature selection explorations, refer to the `02_exploration` folder.

> **Note:** This notebook requires outputs from the `00 Preprocessing` notebooks.

<img src="../images/notebook-3.png" alt="Experiment Diagram" style="width:75%; text-align:center;" />

### 1. Data Import and Setup

This section imports the necessary libraries, sets up environment paths, and includes custom utility functions.

In [1]:
import os
import sys

import joblib
import numpy as np
import pandas as pd
import lightgbm as lgb
import tensorflow as tf

from interpret.glassbox import ExplainableBoostingRegressor
from mapie.regression import MapieQuantileRegressor
from quantile_forest import RandomForestQuantileRegressor

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', '..', '..')))

from src_new.utils.model import split_dataset, compare_models_per_station, create_deep_model

2025-04-26 01:07:25.725976: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-04-26 01:07:25.741684: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-04-26 01:07:25.759182: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-04-26 01:07:25.764428: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-26 01:07:25.777852: I tensorflow/core/platform/cpu_feature_guar

##### Constants :
- **INPUT_DIR**: Directory for input data (same as in "02 - Feature Engineering").
- **MODEL_DIR**: Directory where trained models are saved.
- **DATASET_DIR**: Directory where the Zenodo dataset is unzipped.

##### Model Parameters

- **SEED**: 42 (for reproducibility)
- **NUMBER_OF_WEEK**: 4 (one model is trained per week)

##### FINAL_MODELS

- **mapie**: Combines LightGBM with MAPIE. **MAPIE** (Model Agnostic Prediction Interval Estimator) computes prediction intervals for any regression model using conformal methods.
- **qrf**: Quantile Random Forest (natively produces prediction intervals)
- **ebm**: Explainable Boosting Machine is used as a exemple that does not natively implement prediction intervals, but that can be customised to do so.

In [2]:
INPUT_DIR = "../../../data/input/"
MODEL_DIR = "../../../models/"
DATASET_DIR = "../../../dataset/"

SEED = 40
NUMBER_OF_WEEK = 4 # Number of weeks to predict one model is trained per week

FINAL_MODELS = ["qrf",
                # "mapie",
                #"ebm",
                #"deep_ensemble",
                ]
mapie_enbpi = {}
mapie = {}
qrf = {}
mapie_aci = {}

COLUMNS_TO_DROP = ["water_flow_week1", "water_flow_week2", "water_flow_week3", "water_flow_week4"]


### 2. Data Loading
Load in the baseline datasets, create the directory to save models.

In [3]:
dataset_train = pd.read_csv(f"{INPUT_DIR}dataset_baseline.csv")

dataset_train = dataset_train.set_index("ObsDate")

if not os.path.exists(f"{MODEL_DIR}final/"):
    os.makedirs(f"{MODEL_DIR}final/")

X_train = dataset_train.drop(columns=COLUMNS_TO_DROP)
y_train = {}
for i in range(0, NUMBER_OF_WEEK):
    y_train[i] = dataset_train[f"water_flow_week{i+1}"]

### 2. Models training
#### a. LGBM + MAPIE

## Mapie Model Training Overview

- **Configuration:**  
  - Sets `ALPHA` (0.1) as the prediction interval level.
  - Defines `TIME_VALIDATION` as a split point for creating a validation set.
  - Configures LightGBM parameters (`LGBM_PARAMS`) for quantile regression.




In [4]:
ALPHA = 0.1
TIME_VALIDATION = "2000-01-01 00:00:00"

LGBM_PARAMS = {
    "max_depth": 15,
    "learning_rate": 0.01,
    "n_estimators": 500,
    "colsample_bytree": 0.7,
    "objective": "quantile",
    "alpha": ALPHA
}

- **Data Preparation:**  
  - Splits `dataset_train` into training and validation subsets using `split_dataset`.
  - Removes unnecessary columns from both the training and validation datasets.
  - Extracts target variables for each week (from `water_flow_week1` to `water_flow_week4`).

- **Model Training:**  
  For each week:
  - Initializes a LightGBM regressor with the specified parameters.
  - Wraps it in a `MapieQuantileRegressor` to estimate prediction intervals.
  - Trains the model on the training data and calibrates it using the validation data.
  - Saves the trained model 

In [5]:
if "mapie" in FINAL_MODELS: 
    print("Training Mapie")


    train_mapie, val_mapie, val_temporal  = split_dataset(dataset_train, 0.75, TIME_VALIDATION)    

    X_train_mapie = train_mapie.drop(columns=COLUMNS_TO_DROP)
    X_train_mapie = X_train_mapie.drop(columns=["station_code"])
    print(len(X_train_mapie.columns))
    y_train_mapie = {}
    for i in range(0, NUMBER_OF_WEEK):
        y_train_mapie[i] = train_mapie[f"water_flow_week{i+1}"]

    X_val = val_mapie.drop(columns=COLUMNS_TO_DROP)
    X_val = X_val.drop(columns=["station_code"])
    y_val = {}
    y_val[0] = val_mapie["water_flow_week1"]
    for i in range(1, NUMBER_OF_WEEK):
        y_val[i] = val_mapie[f"water_flow_week{i+1}"]

    for i in range(NUMBER_OF_WEEK):
        print(f"Training week {i}")
        # Initialize and train MapieQuantileRegressor
        regressor = lgb.LGBMRegressor(**LGBM_PARAMS)
        mapie[i] = MapieQuantileRegressor(estimator=regressor, method="quantile", cv="split", alpha=ALPHA)
        mapie[i].fit(X_train_mapie, y_train_mapie[i], X_calib=X_val, y_calib=y_val[i])
        
        # save model with date
        time = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")

        model_path = f"{MODEL_DIR}final/mapie_quantile_{time}_week_{i}.pkl"
        joblib.dump(mapie[i], model_path)


In [61]:
X_train_qrf = X_train.drop(columns=["station_code"])

if "qrf" in FINAL_MODELS:
    for i in range(NUMBER_OF_WEEK):
        print(f"Training week {i}")
        # Train RandomForestQuantileRegressor
        qrf[i] = RandomForestQuantileRegressor(n_estimators=50, max_depth=7, min_samples_leaf=6, n_jobs=-1)
        qrf[i].fit(X_train_qrf, y_train[i])

        time = pd.Timestamp.now().strftime("%Y-%m-%d_%H-%M-%S")
        model_path = f"{MODEL_DIR}final/qrf_quantile_{time}_week_{i}.pkl"
        joblib.dump(qrf[i], model_path)

Training week 0
Training week 1
Training week 2
Training week 3
