## Objective: Final Prediction Computation

This notebook generates the final model predictions and formats them for submission on Codabench.

The evaluation dataset comprises data from 39 stations included in the training set and 13 stations exclusive to the evaluation set.

<img src="../images/notebook-4.png" alt="Experiment Diagram" style="width:75%;" style="text-align:center;" />

### 1. Imports

Starts by importing the necessary libraries, configuring environment paths, and loading custom utility functions.


In [46]:
%load_ext autoreload
%autoreload 2

import sys
import pandas as pd
import os
import zipfile

import joblib
import pandas as pd

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..", "..", "..")))

from src.utils.model import load_models_auto
from src.utils.analysis import create_predict_function, create_quantile_function
from src.utils.model import load_models_auto
from src.utils.model import split_dataset, compare_models_per_station, XGBQRFModel, ChainedQrfModel, SpecialistQrfModel, XGBQRF_SimpleModel

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Defines constants :

- _DATASET_DIR_ must be the directory where you unzip the _zenodo_ dataset.
- _EVAL_DIR_ will be used to store inference / evaluation data it must be the same as the one defined in _01 Training > 01 - Modelisation_
- _FINAL_MODEL_ will be used to store inference / evaluation data

FINAL_MODEL describe the model that will be loaded if you use auto-loading


In [47]:
ALPHA = 0.1
NUMBER_OF_WEEK = 4
USE_AUTO_SCAN = True  # Toggle this to switch between the loading of the last model of the manual load of a specific model
FINAL_MODEL = "qrf_voting"
DATASET_TRANSFORMS = [
    "rm_gnv_st",
    "pca",
    "snow_index",
    # "oh_enc_date",
    "cyc_enc_date",
    "clust_index",
    "scl_feat",
    # "scl_feat_wl",  # Scale all except waterflow lag
    "scl_catch",
]

PCA_THRESHOLD = 0.98
N_CLUSTER = 10

DATASET_SPEC = "_".join(DATASET_TRANSFORMS)

if "pca" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_pct_{PCA_THRESHOLD}"

if "clust_index" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_geocl_{N_CLUSTER}"

if "clust_hydro" in DATASET_TRANSFORMS:
    DATASET_SPEC += f"_hydcl_{N_CLUSTER}"

DATASET_SPEC = "dataset_custom_21"

ADJUSTED_BONDS = False

EVAL_DIR = "../../../data/evaluation/"
EVAL_DIR_MINI = "../../../data/evaluation_mini/"
MODEL_DIR = f"../../../models/{DATASET_SPEC}/"

PREDS_DIR = f"{EVAL_DIR}{DATASET_SPEC}/{FINAL_MODEL}/"
COMPUTE_MINICHALLENGE = True

USE_ONLY_BEST_FEATURES = False
BEST_FEATURES = [
        "precipitations_lag_1w_pca_2",
        "precipitations_pca_1",
        "precipitations_pca_2",
        "tempartures_lag_1w_pca_1",
        "tempartures_pca_1",
        "soil_moisture_pca_1",
        "soil_moisture_pca_2",
        "soil_moisture_pca_3",
        "evaporation_lag_1w_pca_1",
        "evaporation_pca_1",
        "soil_composition_pca_1",
        "soil_composition_pca_4",
        "soil_composition_pca_6",
        "soil_composition_pca_7",
        "latitude",
        "longitude",
        "catchment",
        "altitude",
        "water_flow_lag_1w",
        "water_flow_lag_2w",
        "water_flow_ma_4w_lag_1w_gauss",
        "north_hemisphere",
        "snow_index",
        "month_cos"
    ]

print(f"PREDS_DIR: {PREDS_DIR}")
os.makedirs(PREDS_DIR, exist_ok=True)

PREDS_DIR: ../../../data/evaluation/dataset_custom_21/qrf_voting/


### 2. Data and models Loading

Loading of the inference dataset.


In [48]:
# load the dataset
inference_data = pd.read_csv(f"{EVAL_DIR}{DATASET_SPEC}.csv")
inference_data = inference_data.set_index("ObsDate")

if COMPUTE_MINICHALLENGE:
    inference_data_mini = pd.read_csv(f"{EVAL_DIR_MINI}{DATASET_SPEC}.csv")
    inference_data_mini = inference_data_mini.set_index("ObsDate")
    inference_data = pd.concat([inference_data, inference_data_mini], axis=0)

Loading of the final models.


In [49]:
# Load models based on conditions
final_models = []
if FINAL_MODEL == "mapie":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("mapie_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-04_week0.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-11_week1.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-17_week2.pkl"
            )
        )
        final_models.append(
            joblib.load(
                f"{MODEL_DIR}final/mapie_quantile_2025-01-17_15-15-17_week3.pkl"
            )
        )
elif FINAL_MODEL == "qrf":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "gpr":
    selected_kernel = [
        "rbf",
        # "",
        # "",
    ]
    if USE_AUTO_SCAN:
        final_models = load_models_auto(
            f"gpr_quantile_{"".join(selected_kernel)}", f"{MODEL_DIR}final/"
        )
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )

elif FINAL_MODEL == "gbr":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("gbr_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "qrf_voting":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_voting_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "qrf_bagging":

    if USE_AUTO_SCAN:
        final_models = load_models_auto("qrf_bagging_quantile", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "lgbm":

    if USE_AUTO_SCAN:
        models_low = load_models_auto("lgbm_quantile_q0.05", f"{MODEL_DIR}final/")
        models_med = load_models_auto("lgbm_quantile_q0.5", f"{MODEL_DIR}final/")
        models_high = load_models_auto("lgbm_quantile_q0.95", f"{MODEL_DIR}final/")
        final_models = [[] for _ in range(NUMBER_OF_WEEK)]
        final_models[0] = [models_low[0], models_med[0], models_high[0]]
        final_models[1] = [models_low[1], models_med[1], models_high[1]]
        final_models[2] = [models_low[2], models_med[2], models_high[2]]
        final_models[3] = [models_low[3], models_med[3], models_high[3]]
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/qrf_quantile_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "ebm_ensemble":
    print("Loading EBM Ensemble")
    if USE_AUTO_SCAN:
        final_models = load_models_auto("ebm_ensemble", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/ebm_ensemble_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "deep_ensemble":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("deep_ensemble", f"{MODEL_DIR}final/")
    else:
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-04_week0.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-11_week1.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-17_week2.pkl")
        )
        final_models.append(
            joblib.load(f"{MODEL_DIR}final/deep_ensemble_2025-01-17_15-15-17_week3.pkl")
        )
elif FINAL_MODEL == "xgb":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("xgb", f"{MODEL_DIR}final/")
elif FINAL_MODEL == "xgb_qrf":
    if USE_AUTO_SCAN:
        final_models = load_models_auto("xgb_qrf", f"{MODEL_DIR}final/")
elif FINAL_MODEL == "chained_qrf":
    if USE_AUTO_SCAN:
        final_models = joblib.load(f"{MODEL_DIR}final/chained_qrf_quantile_2025-04-18_11-55-29_4weeks.pkl")
elif FINAL_MODEL == "specialized_qrf":
    if USE_AUTO_SCAN:
        final_models = joblib.load(f"{MODEL_DIR}final/specialized_qrf_quantile_2025-04-18_14-08-20_4weeks.pkl")


In [50]:
final_models

[<src.utils.custom_models.VotingRandomForestQuantileRegressor at 0x1cfa6905d60>,
 <src.utils.custom_models.VotingRandomForestQuantileRegressor at 0x1cfa6b43fe0>,
 <src.utils.custom_models.VotingRandomForestQuantileRegressor at 0x1cfa6b47fb0>,
 <src.utils.custom_models.VotingRandomForestQuantileRegressor at 0x1cfa6b4bf80>]

In [51]:
if "chained_qrf" in FINAL_MODEL:
    for i in range(NUMBER_OF_WEEK):
        print(final_models.models[i].feature_names_in_)

### 3. Predictions computation

Evaluation data include a spatio-temporal split and a temporal only split.

<img src="../images/eval.png" alt="Experiment Diagram" style="width:50%;" />


In [52]:
import numpy as np

predictions = inference_data[["station_code"]].copy()
y_pred_test_quantile = {}
y_pred_test = {}
# X_test = inference_data.drop(columns=["station_code"])
X_test = inference_data

if USE_ONLY_BEST_FEATURES:
    X_test = X_test[BEST_FEATURES]

if FINAL_MODEL == "chained_qrf":
    y_pred_test = final_models.predict(X_test, quantiles=[0.05, 0.5, 0.95])
    # y_pred_test = final_models.predict(X_test)

    for i in range(NUMBER_OF_WEEK):
        if ADJUSTED_BONDS:
            print("Adjusting bonds")
            y_pred_test[i][:, 0] *= 0.94
            y_pred_test[i][:, 2] *= 1.15
        predictions[f"week_{i}_pred"] = y_pred_test[i][:, 1]
        predictions[f"week_{i}_sup"] = y_pred_test[i][:, 2]
        predictions[f"week_{i}_inf"] = y_pred_test[i][:, 0]
elif FINAL_MODEL == "specialized_qrf":
    # y_pred_test = final_models.predict(X_test, quantiles=[0.04, 0.5, 0.96])
    y_pred_test = final_models.predict(X_test, quantiles=[0.05, 0.5, 0.95])

    for i in range(NUMBER_OF_WEEK):
        predictions[f"week_{i}_pred"] = y_pred_test[i][0.5]
        predictions[f"week_{i}_sup"] = y_pred_test[i][0.95]
        predictions[f"week_{i}_inf"] = y_pred_test[i][0.05]
elif FINAL_MODEL == "qrf":
    quantiles_weeks = {
        0: [0.04, 0.5, 0.95],
        1: [0.04, 0.5, 0.97],
        2: [0.04, 0.5, 0.98],
        3: [0.04, 0.5, 0.985],
    }
    for i in range(NUMBER_OF_WEEK):

        y_pred_test = final_models[i].predict(X_test, quantiles=quantiles_weeks[i])
        
        if ADJUSTED_BONDS:
            y_pred_test[i][:, 0] *= 0.98
            y_pred_test[i][:, 2] *= 1.02
        
        print(y_pred_test.shape)
        predictions[f"week_{i}_pred"] = y_pred_test[:,1]
        predictions[f"week_{i}_sup"] = y_pred_test[:,2]
        predictions[f"week_{i}_inf"] = y_pred_test[:,0]
elif FINAL_MODEL == "qrf_voting":
    for i in range(NUMBER_OF_WEEK):
        final_models[i].adjust_weights(location_confidence=2)
        y_pred_test = final_models[i].predict(X_test)
        
        predictions[f"week_{i}_pred"] = y_pred_test["mean"]
        predictions[f"week_{i}_sup"] = y_pred_test["upper"]
        predictions[f"week_{i}_inf"] = y_pred_test["lower"]
else:
    for i in range(NUMBER_OF_WEEK):

        if FINAL_MODEL == "qrf":
            # reorder the columns
            X_test = X_test[final_models[0].feature_names_in_]

        # if FINAL_MODEL == "xgb":
        #     X_test = (
        #         X_test.drop(columns=["north_hemisphere"])
        #         if "north_hemisphere" in X_test.columns
        #         else X_test
        #     )
        print(FINAL_MODEL)
        predict_adjusted = create_predict_function(final_models, i, FINAL_MODEL)
        quantile_adjusted = create_quantile_function(final_models, i, FINAL_MODEL, ALPHA)

        y_pred_test[i] = predict_adjusted(X_test)
        y_pred_test_quantile[i] = quantile_adjusted(X_test)

        if FINAL_MODEL == "lgbm":
            y_pred_test_quantile[i][y_pred_test_quantile[i] < 0] = 0
            y_pred_test[i][y_pred_test[i] < 0] = 0

        if FINAL_MODEL == "xgb":
            y_pred_test_quantile[i][:, 0] *= 0.95
            y_pred_test_quantile[i][:, 1] *= 1.1
        
        if FINAL_MODEL == "xgb_qrf" and ADJUSTED_BONDS == True:    
            print("Adjusting bounds for xgbqrf")

            low_mean = 4e-1
            
            print(low_mean)

            
            y_pred_xgb_qrf =  np.stack([y_pred_test_quantile[i][:, 0], y_pred_test[i], y_pred_test_quantile[i][:, 1]], axis=1)

            y_pred_test_quantile[i][:, 0] *= 0.95

            y_pred_test[i][y_pred_test[i] < 0] =  low_mean
            y_pred_test_quantile[i][y_pred_test_quantile[i][:,0] < 0, 0] = low_mean*0.85
            y_pred_test_quantile[i][y_pred_test_quantile[i][:,1] < 0, 1] = low_mean*1.15

            y_pred_test_quantile[i][y_pred_test_quantile[i][:,0] < 40, 0] *= 0.05
            y_pred_test_quantile[i][y_pred_test_quantile[i][:,1] < 40, 1] *= 1.6

    for i in range(NUMBER_OF_WEEK):
        predictions[f"week_{i}_pred"] = y_pred_test[i]
        predictions[f"week_{i}_sup"] = y_pred_test_quantile[i][:, 1]
        predictions[f"week_{i}_inf"] = y_pred_test_quantile[i][:, 0]

Predicting for full_model_remove_station_identication
Predicting for france_remove_station_identication
Predicting for brazil_remove_station_identication
Predicting for full_model
Predicting for france_model
Predicting for brazil_model
Predicting for full_model_remove_station_identication
Predicting for france_remove_station_identication
Predicting for brazil_remove_station_identication
Predicting for full_model
Predicting for france_model
Predicting for brazil_model
Predicting for full_model_remove_station_identication
Predicting for france_remove_station_identication
Predicting for brazil_remove_station_identication
Predicting for full_model
Predicting for france_model
Predicting for brazil_model
Predicting for full_model_remove_station_identication
Predicting for france_remove_station_identication
Predicting for brazil_remove_station_identication
Predicting for full_model
Predicting for france_model
Predicting for brazil_model


### 4. Saving of the predictions


Saving of the predictions as a csv file

> The file must be named `predictions.csv`


In [53]:
predictions[["week_1_inf","week_1_pred", "week_1_sup"]]

Unnamed: 0_level_0,week_1_inf,week_1_pred,week_1_sup
ObsDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-02-01,13.443500,41.038263,74.498810
2004-04-25,9.371190,24.593002,45.920426
2004-07-18,6.790757,15.371916,26.697811
2004-10-10,1.967543,4.692386,11.109444
2005-01-02,8.250048,21.417081,47.318333
...,...,...,...
2014-04-20,560.272680,1126.902575,2004.108651
2014-07-13,259.858737,339.402818,504.397903
2014-10-05,120.999090,205.899595,365.965330
2014-12-28,703.902333,1290.718307,2180.586390


In [54]:
test = predictions["week_1_pred"] - predictions["week_1_inf"]
test.min()

np.float64(0.02436020408163262)

In [55]:
# save the predictions to a csv file
predictions["ObsDate"] = X_test.index
predictions.to_csv(f"{PREDS_DIR}predictions.csv", index=False)

Compression of the submission file.

> The file need to be compress for Codabench.


In [56]:
# Create a ZIP file containing predictions.csv
with zipfile.ZipFile(f"{PREDS_DIR}predictions.zip", "w", zipfile.ZIP_DEFLATED) as zipf:
    zipf.write(f"{PREDS_DIR}predictions.csv", "predictions.csv")

You are ready to submit go to codabench and submit the zip file that have been generated in My Submissions > Phase 1.

You don't have to use this notebook to submit but the file file format must includes the following columns:

- station_code: Identification code of the station.
- ObsDate: Date of the prediction.
- for every week of prediction i from 0 to 3 :
  - week_i_pred
  - week_i_inf
  - week_i_sup

Save the dataset as a CSV file named predictions.csv.

> The file must be named predictions.csv, but the .zip file can have any name.

Compress the CSV file into a .zip archive.

> You cannot submit an uncompressed file. Ensure that the software you use does not create a subfolder inside the archive.

Submit your file in [Codabench](https://www.codabench.org/competitions/4335):

> My Submissions > Phase 1 (keep all the tasks selected):

<img src="../images/submissions.png" alt="Experiment Diagram" style="width:75%;" style="text-align:center;" />
