# TAMSAT Pertinence Analysis

This notebook analyzes the relevance of TAMSAT (Tropical Applications of Meteorology using SATellite data) rainfall data for hydrological modeling in Senegal. It explores the relationship between TAMSAT precipitation estimates and river discharge measurements to assess the dataset's utility for flood forecasting and water resource management in the region.


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from pathlib import Path
import xarray as xr
import geopandas as gpd
import hvplot.pandas
import pandas as pd
import matplotlib.pyplot as plt

from ombs_senegal.region import get_region_mask


DATA_PATH = Path("../../data")

## TAMSAT Data preprocessing

This section preprocesses TAMSAT rainfall data. First we will load and mask TAMSAT data over the region of interest
 



In [None]:
from math import floor, ceil

In [None]:
roi_gdf = gpd.read_file(DATA_PATH/"point_ajustement/sub4/sub4_senegal.shp")
bounds = roi_gdf.geometry.bounds
min_lat, max_lat = bounds["miny"].values, bounds["maxy"].values
min_lon, max_lon = bounds["minx"].values, bounds["maxx"].values

tamsat = xr.open_dataset(DATA_PATH/"01-tamsatDaily.v3.1-20100101-20250531-20250603_-16.85_-6.05_10.15_18.95.nc")
tamsat = tamsat.sel(lat=slice(floor(min_lat), ceil(max_lat), -1), lon=slice(floor(min_lon), ceil(max_lon)))
mask = get_region_mask(tamsat, roi_gdf)

In [None]:
#| skip_export
roi_tamsat = tamsat.where(mask)
roi_tamsat = roi_tamsat.sel(time=slice(None, "2024-12-31"))

Since we're interested in the total rainfall across the basin rather than its spatial distribution, we'll sum up all rainfall values within the basin area. We'll save this aggregated data to avoid repeating the preprocessing steps.

In [None]:
#| skip_export
daily_total = roi_tamsat.sum(["lat", "lon"])
daily_total.to_netcdf(DATA_PATH/"tamsat_sub4_senegal_daily_total.nc")

## TAMSAT estimate to in situ correlation

We will analyze the correlation between TAMSAT rainfall estimates and observed river discharge (débit).
To reduce noise and identify long-term patterns, we'll aggregate the data annually. This will help us:
1. Evaluate how well TAMSAT rainfall estimates correspond to actual river flow
2. Assess the potential effectiveness of using TAMSAT data in our benchmark model
3. Account for seasonal patterns and lag effects between rainfall and discharge

The correlation analysis will provide insights into whether TAMSAT data can be a reliable predictor for river discharge in our study area.

In [None]:
insitu_df = pd.read_csv(
    DATA_PATH/'data_cumul.csv', 
    sep=';', 
    usecols=['time', 'débit_insitu', 'P_mean'], 
    index_col='time',
    converters={"time": pd.to_datetime}
    )

tamsat_daily_total = xr.load_dataset(DATA_PATH/"tamsat_sub4_senegal_daily_total.nc")

In [None]:
combined_df = pd.merge(insitu_df, tamsat_daily_total["rfe"].to_dataframe(), left_index=True, right_index=True)
yearly_df = combined_df.resample("YS").sum()
yearly_df = (yearly_df - yearly_df.min())/(yearly_df.max() - yearly_df.min())

#### Yearly correlation

In [None]:
def r2(x, y):
    res = x.sub(y).pow(2).sum()
    tot = x.sub(x.mean()).pow(2).sum()
    return 1 - res/tot

r2(yearly_df["débit_insitu"], yearly_df["rfe"]), r2(yearly_df["débit_insitu"], yearly_df["P_mean"])

In [None]:
#| skip_export
plt.figure(figsize=(7,6))
plt.scatter(yearly_df['débit_insitu'], yearly_df['rfe'], label='TAMSAT')
plt.scatter(yearly_df['débit_insitu'], yearly_df['P_mean'], label='IMERG')

# Add year labels to each point
for idx, row in yearly_df.iterrows():
    plt.annotate(idx.year, (row['débit_insitu'], row['rfe']), xytext=(5,5), textcoords='offset points')
    plt.annotate(idx.year, (row['débit_insitu'], row['P_mean']), xytext=(5,5), textcoords='offset points')

plt.xlabel('Débit in-situ')
plt.ylabel('Rainfall Estimate (mm)')
plt.title('Débit vs Rainfall')
plt.legend()

The above graph, in addition to the r2 scores, shows that the correlation between TAMSAT and the river flow is smaller and expected.

#### Cross correlation

In order to determine the optimal smoothing window size, we will calculate the cross correlation between the rainfall and the river flow.

In [None]:
import numpy as np
from scipy import signal
from statsmodels.tsa.stattools import ccf
import pandas as pd

def find_optimal_window(
        rainfall: pd.Series,
        discharge: pd.Series,
        max_window: int = 100, 
        min_lag: int = 0, 
        max_lag: int = 30) -> pd.DataFrame:
    """Find optimal smoothing window with constrained lag range between rainfall and discharge time series."""
    def smooth(df, window, missing_val=0): return df.rolling(window=window).sum().fillna(missing_val)

    results = []    
    for window in range(1, max_window + 1):

        smoothed_rain = smooth(rainfall, window=window)
        
        # Remove NaN values
        valid_mask = ~np.isnan(smoothed_rain)
        smooth_rain_clean = smoothed_rain[valid_mask]
        discharge_clean = discharge[valid_mask]
        
        cross_corr = ccf(smooth_rain_clean, discharge_clean)
        
        # Only consider the specified lag range
        lag_range = slice(min_lag, max_lag + 1)
        restricted_ccf = cross_corr[lag_range]
        
        max_corr = np.max(np.abs(restricted_ccf))
        lag = np.argmax(np.abs(restricted_ccf)) + min_lag
        
        results.append({
            'window': window,
            'correlation': max_corr,
            'lag': lag
        })
            
    return pd.DataFrame(results)

In [None]:
best_correlations = find_optimal_window(combined_df['rfe'], combined_df['débit_insitu'])
best_correlations.hvplot.line(x='window', y='correlation', hover_cols=['lag'])

We can see that the best correlation is around 60 days of window size. We will now take a closer look by plotting the smoothed and normalized daily data.

In [None]:
def smooth(df, window=7, missing_val=0): return df.rolling(window=window).sum().fillna(missing_val)

def normalize(df): return (df - df.min())/(df.max() - df.min())

In [None]:
window = 60
processed_df = combined_df.copy()
processed_df[f"rfe_w={window}"] = smooth(combined_df["rfe"], window=window)
normalized_df = normalize(processed_df)

normalized_df[[f"rfe_w={window}", "débit_insitu"]].hvplot.line()

While a window size of 60 days yields the highest correlation, this longer aggregation period may smooth out important short-term variations in the rainfall-discharge relationship. A shorter window size might better capture these finer temporal dynamics, albeit with potentially lower overall correlation. Based on the previous optimal correlation windows, we will choose:
- 3 days, as it captures the immediate rainfall-discharge response while still showing linear improvement in correlation
- 7 days, as this is where the correlation curve begins to stabilize, suggesting it captures the main rainfall-discharge dynamics
- 15 days, as it provides a good compromise between short-term responsiveness and longer-term accumulation effects

In [None]:
processed_df = combined_df.copy()
w_vars = []
for window in [3, 7, 15, 60]:
    processed_df[f"rfe_w={window}"] = smooth(combined_df["rfe"], window=window)
    w_vars += [f"rfe_w={window}"]

normalize(processed_df)[[*w_vars, "débit_insitu"]].hvplot.line(width=1000, height=600)

## Model Benchmark with TAMSAT 

Based on the strong correlation observed between TAMSAT rainfall estimates and river flow, we will now evaluate the benchmark model using TAMSAT data. We will conduct two analyses:
1. Using only TAMSAT rainfall estimates and MGB water flow predictions as input features
2. Using all available parameters (TAMSAT rainfall, MGB flow, and other variables) as input features

Similar to our previous analysis with IMERG data, we will:
- Test different time window sizes to capture temporal patterns
- Evaluate multiple polynomial degrees to model non-linear relationships
- Compare model performance using standard metrics (MSE, MAE) and visual analysis

This will allow us to:
- Assess TAMSAT's effectiveness as a predictor
- Compare results with the IMERG-based models
- Determine optimal model parameters

In [None]:
import pandas as pd
from sklearn.preprocessing import RobustScaler
from ombs_senegal.benchmark_model import FeatureGenerator, SimpleRegressionModel, BenchmarkScores
from ombs_senegal.benchmark_model import plot_interactive_benchmark_scores, plot_prediction_comparison

In [None]:
#| skip_export
df = pd.read_csv(
    DATA_PATH/'data_cumul.csv', 
    sep=';', 
    usecols=['time', 'P_cumul_7j', 'débit_insitu', 'débit_mgb'], 
    index_col='time',
    converters={"time": pd.to_datetime}
    )

tamsat_daily_total = xr.load_dataset(DATA_PATH/"tamsat_sub_poly_daily_total.nc")

data = pd.merge(df, tamsat_daily_total["rfe"].to_dataframe(), left_index=True, right_index=True)


#### Preprocess data

Select feature and target columns

In [None]:
x_col, y_col = ["débit_mgb", "rfe"], ['débit_insitu']


Smooth data

In [None]:
data["rfe"] = smooth(data["rfe"], window=15)

Scale data

In [None]:
features_scaler = RobustScaler()

features = data[x_col]
data[x_col] = features_scaler.fit_transform(features)


In [None]:
#| skip_export
train_mask = df.index < '2019-01-01'

train = data[train_mask]
valid = data[~train_mask]

In [None]:

predictions = []
for degree in range(1, 4):
    for window in range(10, 51, 10):
        feature_generator = FeatureGenerator(context_window=window, target_window=10, degree=degree)        
        train_x, train_y = feature_generator.generate(train, x_col, y_col)
        valid_x, valid_y = feature_generator.generate(valid, x_col, y_col)

        model = SimpleRegressionModel()
        model.fit(train_x, train_y)
        predictions.append(model.predict_as_dataframe(valid_x, degree=degree, ctx_window=window))


predictions = pd.concat(predictions).reorder_levels(['degree', 'ctx_window', 'time']).to_xarray()
observations = valid[y_col[0]].to_xarray().sel(time=slice(predictions.time.min(), predictions.time.max()))


In [None]:
benchmark_scores = BenchmarkScores()
scores_ds = benchmark_scores.compute_scores(
    predictions,
    observations,
    ["mae", "rmse", "nse", "kge"])
best_scores = benchmark_scores.find_nbest_scores(
    scores_ds,
    how={"mae": "min", "rmse": "min", "nse": "max", "kge": "max"},
    n=1)

In [None]:
plot_interactive_benchmark_scores(best_scores)

Analysis of the benchmark results shows the expected pattern of decreasing forecast accuracy as prediction horizons increase. 

Using a 15-day smoothing window for rainfall data, the model achieves optimal performance with polynomial features of degree 2 and context windows ranging from 30 to 50 timesteps. This configuration provides the best balance between capturing relevant temporal patterns while avoiding overfitting. In order to have a single global score, we will normalize the RMSE and the MAE and will make the mean of all the scores.

In [None]:
def normalize_metrics(ds):
    dims = ["degree", "ctx_window"]
    return 1 - (ds - ds.min(dim=dims))/(ds.max(dim=dims) - ds.min(dim=dims))


normalized_scores = normalize_metrics(scores_ds)
normalized_scores = normalized_scores.to_array().mean(dim="variable")
single_metric_best_scores = benchmark_scores.find_nbest_scores(normalized_scores.to_dataset(name="score"), how={"score": "max"}, n=1)
single_metric_best_scores

In [None]:
best_models = {}
for idx, row in single_metric_best_scores.reset_index().iterrows():
    best_models[row["forecast_horizon"]] = {"degree": row["degree"], "ctx_window": row["ctx_window"]}


We can now plot the data

In [None]:
fig = plot_prediction_comparison(
    observed=observations, 
    predicted=predictions, 
    best_model=best_models, 
    mgb=df[~train_mask]["débit_mgb"].to_xarray(),
    scores=scores_ds
    )

In [None]:
#| hide
#| eval: false
benchmark_ds = results_ds.sel(degree=2, window=slice(30,50))
benchmark_ds.to_netcdf(DATA_PATH/'tamsat_regression_benchmark.nc')

In [None]:
#| skip_export
from statsmodels.tsa.stattools import ccf

# Calculate cross-correlation between rainfall and flow
ccf_rfe = ccf(smooth(combined_df["rfe"], window=30), combined_df['débit_insitu'], adjusted=False)
#ccf_mgb = ccf(normalized_df['débit_mgb'], normalized_df['débit_insitu'], adjusted=False)

# Plot cross-correlations
plt.figure(figsize=(10,6))
lags = range(len(ccf_rfe))
plt.plot(lags, ccf_rfe, label='TAMSAT RFE vs In-situ Flow')
#plt.plot(lags, ccf_mgb, label='MGB Flow vs In-situ Flow')
plt.axhline(y=0, color='k', linestyle='--')
plt.xlabel('Lag (days)')
plt.ylabel('Cross-correlation')
plt.title('Cross-correlation Analysis')
plt.legend()
plt.grid(True)
plt.show()

