## Objective: Final Prediction Computation

This notebook generates the final model predictions and formats them for submission on Codabench.

The evaluation dataset comprises data from 39 stations included in the training set and 13 stations exclusive to the evaluation set.

<img src="../images/notebook-4.png" alt="Experiment Diagram" style="width:75%;" style="text-align:center;" />

### 1. Imports

Starts by importing the necessary libraries, configuring environment paths, and loading custom utility functions.


In [34]:
import sys
import pandas as pd
import os
import zipfile

import joblib
import pandas as pd

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "..", "..")))

from src.utils.model import load_models_auto
from src.utils.analysis import create_predict_function, create_quantile_function
from src.utils.model import load_models_auto
from src.utils.model import XGBQuantileRegressor, XGBQRFModel, XGBQRF_SimpleModel

Defines constants :

- _DATASET_DIR_ must be the directory where you unzip the _zenodo_ dataset.
- _EVAL_DIR_ will be used to store inference / evaluation data it must be the same as the one defined in _01 Training > 01 - Modelisation_
- _FINAL_MODEL_ will be used to store inference / evaluation data

FINAL_MODEL describe the model that will be loaded if you use auto-loading


In [35]:
ALPHA = 0.1
NUMBER_OF_WEEK = 4
USE_AUTO_SCAN = True  # Toggle this to switch between the loading of the last model of the manual load of a specific model
FINAL_MODEL = "xgb_qrf_simple"
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    "snow_index",
    # "oh_enc_date",
    "cyc_enc_date",
    "clust_index",
    "clust_hydro",
    # "scl_feat",
    "scl_feat_wl",  # Scale all except waterflow lag
    "rm_wl",  # remove custom generated water_flow_lag 3w & 4w ---> Need USE_CUSTOM_PREPROCESS = True
    "slct_ma",  # keep only specific mobile average 2w or/and 3w or/and 4w ---> Need USE_CUSTOM_PREPROCESS = True
    "lag_slope",  # add an indicator that is calculated between water_flow_lag 1w and 2w
]

PCA_THRESHOLD = 0.98
N_CLUSTER = 10

DATASET_SPEC = "_".join(DATASET_TRANSFORMS)
# DATASET_SPEC = "custom_dataset_4"

if "pca" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_pct_{PCA_THRESHOLD}"

if "clust_index" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_geocl_{N_CLUSTER}"

if "clust_hydro" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_hydcl_{N_CLUSTER}"

# DATASET_SPEC = "custom_dataset_4"

ADJUSTED_BONDS = False

EVAL_DIR = "../../../data/evaluation/"
EVAL_DIR_MINI = "../../../data/evaluation_mini/"
MODEL_DIR = f"../../../models/{DATASET_SPEC}/"

PREDS_DIR = f"{EVAL_DIR}{DATASET_SPEC}/{FINAL_MODEL}/"
COMPUTE_MINICHALLENGE = True

USE_ONLY_BEST_FEATURES = False
BEST_FEATURES = [
        "precipitations_lag_1w_pca_2",
        "precipitations_pca_1",
        "precipitations_pca_2",
        "tempartures_lag_1w_pca_1",
        "tempartures_pca_1",
        "tempartures_pca_2",
        "soil_moisture_pca_1",
        "soil_moisture_pca_2",
        "soil_moisture_pca_3",
        "evaporation_lag_1w_pca_1",
        "evaporation_pca_1",
        "soil_composition_pca_1",
        "soil_composition_pca_4",
        "soil_composition_pca_6",
        "soil_composition_pca_7",
        "soil_composition_pca_9",
        "latitude",
        "longitude",
        "catchment",
        "altitude",
        "water_flow_lag_1w",
        "water_flow_lag_2w",
        "water_flow_ma_4w_lag_1w_gauss",
        "north_hemisphere",
        "snow_index",
        "month_cos"
    ]

print(f"PREDS_DIR: {PREDS_DIR}")
os.makedirs(PREDS_DIR, exist_ok=True)

PREDS_DIR: ../../../data/evaluation/rm_gnv_st_pca_snow_index_cyc_enc_date_clust_index_clust_hydro_scl_feat_wl_rm_wl_slct_ma_lag_slope_pct_0.98_geocl_10_hydcl_10/xgb_qrf_simple/


### 2. Data and models Loading

Loading of the inference dataset.


In [36]:
# load the dataset
inference_data = pd.read_csv(f"{EVAL_DIR}dataset_{DATASET_SPEC}.csv")
inference_data = inference_data.set_index("ObsDate")

In [37]:

from quantile_forest import RandomForestQuantileRegressor


class ChainedQrfModel:
    def __init__(self, qrf_params: dict, qrf_features: dict, number_of_weeks: int = 4):
        self.qrf_params = qrf_params
        self.qrf_features = qrf_features
        self.number_of_weeks = number_of_weeks
        self.models = {}
        for i in range(self.number_of_weeks):
            self.models[i] = RandomForestQuantileRegressor(**self.qrf_params[i])

    def fit(self, X, y):
        print("Fitting QRF models")
        X_incremental = {}
        for i in range(self.number_of_weeks):
            print(f"Fitting week {i}")
            if i == 0:
                X_incremental[i] = X.copy(deep=True)
                self.models[i].fit(X_incremental[i][self.qrf_features[i]], y[i])
            else:
                # Use the previous week's predictions as features
                y_pred = self.models[i - 1].predict(X_incremental[i - 1][self.qrf_features[i - 1]], quantiles=[0.5])
                y_pred = np.reshape(y_pred, (-1, 1))
                
                X_incremental[i] = X_incremental[i - 1].copy(deep=True)
                print(f"week_{i}_pred")
                X_incremental[i][f"week_{i}_pred"] = y_pred

                if i == 1:
                    X_incremental[i][f"week_{i-1}_{i}_slope"] = (
                            X_incremental[i][f"week_{i}_pred"] - X_incremental[i]["water_flow_lag_1w"]
                        ) / X_incremental[i]["water_flow_lag_1w"].replace(0, np.nan)
                else:
                    X_incremental[i][f"week_{i-1}_{i}_slope"] = (
                            X_incremental[i][f"week_{i}_pred"] - X_incremental[i][f"week_{i-1}_pred"]
                        ) / X_incremental[i][f"week_{i-1}_pred"].replace(0, np.nan)

                self.models[i].fit(X_incremental[i][self.qrf_features[i]], y[i])

    def predict(self, X, quantiles=[0.05,0.5,0.95]):
        print("Predicting QRF models")
        predictions = {}
        X_incremental = {}
        for i in range(self.number_of_weeks):
            if i == 0:
                X_incremental[i] = X.copy(deep=True)
                predictions[i] = self.models[i].predict(X_incremental[i][self.qrf_features[i]], quantiles=quantiles)
            else:
                # Use the previous week's predictions as features
                y_pred = predictions[i - 1][:,1]
                y_pred = np.reshape(y_pred, (-1, 1))
                X_incremental[i] = X_incremental[i - 1].copy(deep=True)
                X_incremental[i][f"week_{i}_pred"] = y_pred

                if i == 1:
                    X_incremental[i][f"week_{i-1}_{i}_slope"] = (
                            X_incremental[i][f"week_{i}_pred"] - X_incremental[i]["water_flow_lag_1w"]
                        ) / X_incremental[i]["water_flow_lag_1w"].replace(0, np.nan)
                else:
                    X_incremental[i][f"week_{i-1}_{i}_slope"] = (
                            X_incremental[i][f"week_{i}_pred"] - X_incremental[i][f"week_{i-1}_pred"]
                        ) / X_incremental[i][f"week_{i-1}_pred"].replace(0, np.nan)
                
                predictions[i] = self.models[i].predict(X_incremental[i][self.qrf_features[i]], quantiles=quantiles)

        return predictions

(1390, 42)

Loading of the final models.


In [38]:
# Load models based on conditions
final_models = []
if FINAL_MODEL == "mapie":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("mapie_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-04_week0.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-11_week1.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-17_week2.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-17_week3.pkl"
            )
        )
elif FINAL_MODEL == "qrf":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "gpr":
    selected_kernel = [
        "rbf",
        # "",
        # "",
    ]
    if USE_AUTO_SCAN:
        final_models = load_models_auto(
            f"gpr_quantile_{"".join(selected_kernel)}", f"{MODEL_DIR}final/"
        )
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )

elif FINAL_MODEL == "gbr":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("gbr_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "qrf_voting":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_voting_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "qrf_bagging":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_bagging_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "lgbm":

    if USE_AUTO_SCAN:
        models_low = load_models_auto("lgbm_quantile_q0.05", f"{MODEL_DIR}final/")
        models_med = load_models_auto("lgbm_quantile_q0.5", f"{MODEL_DIR}final/")
        models_high = load_models_auto("lgbm_quantile_q0.95", f"{MODEL_DIR}final/")
        final_models = [[] for _ in range(NUMBER_OF_WEEK)]
        final_models[0] = [models_low[0], models_med[0], models_high[0]]
        final_models[1] = [models_low[1], models_med[1], models_high[1]]
        final_models[2] = [models_low[2], models_med[2], models_high[2]]
        final_models[3] = [models_low[3], models_med[3], models_high[3]]
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "ebm_ensemble":
    print("Loading EBM Ensemble")
    if USE_AUTO_SCAN:
        final_models = load_models_auto("ebm_ensemble", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "deep_ensemble":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("deep_ensemble", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "xgb":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("xgb", f"{MODEL_DIR}final/")
elif FINAL_MODEL == "xgb_qrf":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("xgb_qrf", f"{MODEL_DIR}final/")
elif FINAL_MODEL == "xgb_qrf_simple":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("xgb_qrf_simple", f"{MODEL_DIR}final/")

In [39]:
final_models.models

[<src.utils.model.XGBQRF_SimpleModel at 0x29b7d2690>,
 <src.utils.model.XGBQRF_SimpleModel at 0x29b80a600>,
 <src.utils.model.XGBQRF_SimpleModel at 0x29fefff20>,
 <src.utils.model.XGBQRF_SimpleModel at 0x29b808290>]

In [10]:
if "chained_qrf" in FINAL_MODEL:
    for i in range(NUMBER_OF_WEEK):
        print(final_models.models[i].feature_names_in_)

['precipitations_lag_1w_pca_1' 'precipitations_lag_1w_pca_2'
 'precipitations_pca_1' 'precipitations_pca_2' 'tempartures_lag_1w_pca_1'
 'tempartures_pca_1' 'tempartures_pca_2' 'soil_moisture_pca_1'
 'soil_moisture_pca_2' 'soil_moisture_pca_3' 'evaporation_lag_1w_pca_1'
 'evaporation_pca_1' 'soil_composition_pca_1' 'soil_composition_pca_4'
 'soil_composition_pca_6' 'soil_composition_pca_7'
 'soil_composition_pca_9' 'latitude' 'longitude' 'catchment' 'altitude'
 'water_flow_lag_1w' 'water_flow_lag_2w' 'water_flow_ma_4w_lag_1w_gauss'
 'north_hemisphere' 'snow_index' 'month_sin' 'month_cos' 'season_sin']
['precipitations_lag_1w_pca_1' 'precipitations_lag_1w_pca_2'
 'precipitations_pca_1' 'precipitations_pca_2' 'tempartures_lag_1w_pca_1'
 'tempartures_pca_1' 'tempartures_pca_2' 'soil_moisture_pca_1'
 'soil_moisture_pca_2' 'soil_moisture_pca_3' 'evaporation_lag_1w_pca_1'
 'evaporation_pca_1' 'soil_composition_pca_1' 'soil_composition_pca_3'
 'soil_composition_pca_4' 'soil_composition_pca_5'


### 3. Predictions computation

Evaluation data include a spatio-temporal split and a temporal only split.

<img src="../images/eval.png" alt="Experiment Diagram" style="width:50%;" />


In [40]:
import numpy as np

predictions = inference_data[["station_code"]].copy()
y_pred_test_quantile = {}
y_pred_test = {}
X_test = inference_data.drop(columns=["station_code"])

if USE_ONLY_BEST_FEATURES:
    X_test = X_test[BEST_FEATURES]

if FINAL_MODEL == "chained_qrf":
    # y_pred_test = final_models.predict(X_test, quantiles=[0.04, 0.5, 0.96])
    y_pred_test = final_models.predict(X_test)

    for i in range(NUMBER_OF_WEEK):
        predictions[f"week_{i}_pred"] = y_pred_test[i][:, 1]
        predictions[f"week_{i}_sup"] = y_pred_test[i][:, 2]
        predictions[f"week_{i}_inf"] = y_pred_test[i][:, 0]
else:
    for i in range(NUMBER_OF_WEEK):

        if FINAL_MODEL == "qrf":
            # reorder the columns
            X_test = X_test[final_models[0].feature_names_in_]

        # if FINAL_MODEL == "xgb":
        #     X_test = (
        #         X_test.drop(columns=["north_hemisphere"])
        #         if "north_hemisphere" in X_test.columns
        #         else X_test
        #     )
        print(FINAL_MODEL)
        predict_adjusted = create_predict_function(final_models, i, FINAL_MODEL)
        quantile_adjusted = create_quantile_function(final_models, i, FINAL_MODEL, ALPHA)

        y_pred_test[i] = predict_adjusted(X_test)
        y_pred_test_quantile[i] = quantile_adjusted(X_test)

        if FINAL_MODEL == "lgbm":
            y_pred_test_quantile[i][y_pred_test_quantile[i] < 0] = 0
            y_pred_test[i][y_pred_test[i] < 0] = 0

        if FINAL_MODEL == "xgb":
            y_pred_test_quantile[i][:, 0] *= 0.98
            y_pred_test_quantile[i][:, 1] *= 1.02

    for i in range(NUMBER_OF_WEEK):
        predictions[f"week_{i}_pred"] = y_pred_test[i]
        predictions[f"week_{i}_sup"] = y_pred_test_quantile[i][:, 1]
        predictions[f"week_{i}_inf"] = y_pred_test_quantile[i][:, 0]

xgb_qrf_simple
model : xgb_qrf_simple
xgb_qrf_simple
model : xgb_qrf_simple
xgb_qrf_simple
model : xgb_qrf_simple
xgb_qrf_simple
model : xgb_qrf_simple


### 4. Saving of the predictions


Saving of the predictions as a csv file

> The file must be named `predictions.csv`


In [41]:
predictions[["week_1_inf", "week_1_pred", "week_1_sup"]]

Unnamed: 0_level_0,week_1_inf,week_1_pred,week_1_sup
ObsDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-02-01,16.675714,35.680386,99.411428
2004-04-25,8.700643,23.240000,53.699285
2004-07-18,6.216643,14.028571,41.044285
2004-10-10,1.461500,4.097143,18.305785
2005-01-02,7.736000,18.548214,67.758856
...,...,...,...
2008-11-30,598.121157,1420.464966,3225.717523
2009-02-22,646.462763,894.302612,1652.615010
2009-05-17,431.919245,657.728271,832.463786
2009-08-09,375.655248,427.950897,499.771205


In [42]:
test = predictions["week_1_pred"] - predictions["week_1_inf"]
test.min()

np.float64(0.02078571356832981)

In [43]:
# save the predictions to a csv file
predictions["ObsDate"] = X_test.index
predictions.to_csv(f"{PREDS_DIR}predictions.csv", index=False)

Compression of the submission file.

> The file need to be compress for Codabench.


In [44]:
# Create a ZIP file containing predictions.csv
with zipfile.ZipFile(f"{PREDS_DIR}predictions.zip", "w", zipfile.ZIP_DEFLATED) as zipf:
    zipf.write(f"{PREDS_DIR}predictions.csv", "predictions.csv")

You are ready to submit go to codabench and submit the zip file that have been generated in My Submissions > Phase 1.

You don't have to use this notebook to submit but the file file format must includes the following columns:

- station_code: Identification code of the station.
- ObsDate: Date of the prediction.
- for every week of prediction i from 0 to 3 :
  - week_i_pred
  - week_i_inf
  - week_i_sup

Save the dataset as a CSV file named predictions.csv.

> The file must be named predictions.csv, but the .zip file can have any name.

Compress the CSV file into a .zip archive.

> You cannot submit an uncompressed file. Ensure that the software you use does not create a subfolder inside the archive.

Submit your file in [Codabench](https://www.codabench.org/competitions/4335):

> My Submissions > Phase 1 (keep all the tasks selected):

<img src="../images/submissions.png" alt="Experiment Diagram" style="width:75%;" style="text-align:center;" />
