# Benchmark Models

This notebook implements benchmark for IMERG rain data as predictor for the water discharge for the Senegal river

In [None]:
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import RobustScaler

from ombs_senegal.benchmark_model import FeatureGenerator, SimpleRegressionModel, BenchmarkScores
from ombs_senegal.benchmark_model import plot_interactive_benchmark_scores, plot_prediction_comparison

## Data preprocessing

In [None]:

DATA_PATH = Path("../../data")

In [None]:
df = pd.read_csv(DATA_PATH/"data_cumul.csv")

In [None]:
df = pd.read_csv(
    DATA_PATH/'data_cumul.csv', 
    sep=';', 
    usecols=['time', 'P_mean', 'P_cumul_7j', 'débit_insitu', 'débit_mgb'], 
    index_col='time',
    converters={"time": pd.to_datetime}
    )

In [None]:
def normalize(df):
    return (df - df.min()) / (df.max() - df.min())

normalize(df).plot()#.hvplot.line()

Select feature and target columns

In [None]:
x_col, y_col = ['P_cumul_7j','débit_mgb'], ['débit_insitu']

Split data

In [None]:
train_mask = df.index < '2019-01-01'
train = df[train_mask]
valid = df[~train_mask]

Scale data

In [None]:
features_scaler = RobustScaler()
train[x_col] = features_scaler.fit_transform(train[x_col])
valid[x_col] = features_scaler.transform(valid[x_col])

## Model training

In [None]:
predictions = []
scores = []
for degree in range(1, 4):
    for window in range(10, 51, 10):
        feature_generator = FeatureGenerator(context_window=window, target_window=10, degree=degree)        
        train_x, train_y = feature_generator.generate(train, x_col, y_col)
        valid_x, valid_y = feature_generator.generate(valid, x_col, y_col)

        model = SimpleRegressionModel()
        model.fit(train_x, train_y)
        predictions.append(model.predict_as_dataframe(valid_x, degree=degree, ctx_window=window))

predictions = pd.concat(predictions).reorder_levels(['degree', 'ctx_window', 'time']).to_xarray()
observations = valid[y_col].to_xarray().sel(time=slice(predictions.time.min(), predictions.time.max()))


## Scoring

Since we have generated multiple predictions with different parameters, we will select only the best performing models according to each metric

In [None]:
#| hide
#| eval: false
benchmark_scores = BenchmarkScores()
scores_ds = benchmark_scores.compute_scores(
    predictions.to_array(),
    observations["débit_insitu"],
    ["mae", "rmse", "nse", "kge"])
best_scores = benchmark_scores.find_nbest_scores(
    scores_ds,
    how={"mae": "min", "rmse": "min", "nse": "max", "kge": "max"},
    n=1)

## Results

In [None]:
plot_interactive_benchmark_scores(best_scores,)


Based on the scatter plot comparing MAE vs MSE metrics, we can conclude that polynomial regression with degree 2 and window sizes between 30-50 days provides the optimal predictions. This is evident from the cluster of points in the lower left corner of the plot, which indicates lower error rates for both metrics. Specifically, the combinations of degree=2 with windows around 40 days achieve the best balance between Mean Absolute Error and Mean Squared Error, suggesting these parameters offer the most accurate and stable predictions without overfitting the data.

#### Time series verification

Now that we have identified the optimal model parameters, let's verify its performance by comparing the predicted discharge values with both observed values and MGB model predictions. This comparison will be done across the full 10-day prediction horizon to assess how well our model maintains its predictive power over time. We'll visualize these comparisons using time series plots that show the observed discharge, our model's predictions, and the MGB model predictions side by side.


We first define the optimal models as follors

In [None]:
best_models={
    "t+1": {"degree": 2, "ctx_window": 40}, 
    **{f"t+{i}": {"degree": 2, "ctx_window": 50} for i in range(2, 11)}
    }

We can now plot the data

In [None]:
fig = plot_prediction_comparison(
    observed=observations["débit_insitu"], 
    predicted=predictions, 
    best_model=best_models, 
    mgb=df[~train_mask]["débit_mgb"].to_xarray(),
    scores=scores_ds
    )

## Save prediction data

We save the best model for later use

In [None]:
# benchmark_ds = results_ds.sel(degree=2, window=slice(30,50))
# benchmark_ds.to_netcdf(DATA_PATH/'regression_benchmark.nc')