# Optimal interpolation

Optimal interpolation is a method used to combine spatially distributed data (the "background field") with point-based observations. This technique adjusts the entire field by incorporating deviations between the observed data and the field at the observation points, resulting in a statistically optimal adjustment of the background field. For example, it can be used to blend reanalysis precipitation data (such as ERA5) with actual observational records, ensuring that the reanalysis precipitation is corrected over the entire domain.

This page demonstrates how to use `xHydro` to perform optimal interpolation for hydrological modeling by integrating field-like simulations with point observations. In this case, the background field consists of outputs from a distributed hydrological model, while the point observations correspond to real hydrometric station measurements. The goal is to correct the background field (i.e., the hydrological model outputs) using optimal interpolation techniques, following the approach outlined in Lachance-Cloutier et al. (2017).

*Lachance-Cloutier, S., Turcotte, R. and Cyr, J.F., 2017. Combining streamflow observations and hydrologic simulations for the retrospective estimation of daily streamflow for ungauged rivers in southern Quebec (Canada). Journal of hydrology, 550, pp.294-306.*

Optimal interpolation relies on a set of hyperparameters. Some of these are more complex than others, so let’s break down the main steps.

The first step is to compute the differences (or "departures") between the observed and simulated flow at the stations where both values are available. These differences must be scaled by the catchment area to ensure that errors are relative and can be properly interpolated. Also, we take the logarithm of these scaled values to prevent negative streamflow during extrapolation. We will reverse this transformation later in the process.

Next, we need some additional information, which may or may not be available for our observation and simulation sites. These include estimates of:

* The variance of the observations at the gauged sites.
* The variance of the simulated flows at the observation sites.
* The variance of the simulated flows at the estimation sites, including those that also correspond to an observation site.

These can be estimated in real-world applications using long time series of log-transformed and scaled flows, or from measurement errors associated with the instrumentation at gauged sites. These parameters can also be fine-tuned based on past experience or through trial-and-error.

The final component we need is the error covariance function (ECF). In simple terms, optimal interpolation takes into account the distance between an observation (or multiple observations) and the site where we need to estimate a new flow value. Intuitively, a simulation station close to an observation station should have a high correlation with it, while a station farther away will have a lower correlation. Therefore, we need a covariance function that estimates:

1. The degree of covariability between an observed and simulated point.
2. The distance between these points. 

The ECF function is key to this, and several models of it exist in the literature. In many cases, a model form will be chosen *a priori*, and its parameters will be adjusted to best represent the covariance between points.

In this test example, we don’t have enough points or time steps to develop a meaningful model (or parameterization) from the data. As a result, we will impose a model. `xHydro` includes four built-in models, where `par[0]` and `par[1]` are the model parameters to be calibrated (under normal circumstances), and *h* represents the distance between points:

* **Model 1**: 
   $$
   \begin{flalign*}
   &\text{par}[0] \cdot \left( 1 + \frac{h}{\text{par}[1]} \right) \cdot \exp\left(- \frac{h}{\text{par}[1]} \right) && \text{— From Lachance-Cloutier et al. 2017.}
   \end{flalign*}
   $$
* **Model 2**:
   $$
   \begin{flalign*}
   &\text{par}[0] \cdot \exp\left( -0.5 \cdot \left( \frac{h}{\text{par}[1]} \right)^2 \right) &&
   \end{flalign*}
   $$
* **Model 3**:
   $$
   \begin{flalign*}
   &\text{par}[0] \cdot \exp\left( -\frac{h}{\text{par}[1]} \right) &&
   \end{flalign*}
   $$
* **Model 4**:
   $$
   \begin{flalign*}
   &\text{par}[0] \cdot \exp\left( -\frac{h^{\text{par}[1]}}{\text{par}[0]} \right) &&
   \end{flalign*}
   $$

In [None]:
import datetime as dt
from functools import partial
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pooch
import xarray as xr
from scipy.stats import norm

import xhydro as xh
import xhydro.optimal_interpolation
from xhydro.testing.helpers import deveraux

## Example with HYDROTEL data

Optimal interpolation relies on both observed and simulated datasets and requires the following information:

* Observed data for the gauged locations
* Simulated data for all locations
* Catchment areas (for error scaling)
* Catchment latitude and longitude (to develop the spatial error model)

This example will use a subset of data generated using the HYDROTEL hydrological model.

In [None]:
# Get data
test_data_path = deveraux().fetch(
    "optimal_interpolation/OI_data_corrected.zip",
    pooch.Unzip(),
)
directory_to_extract_to = Path(test_data_path[0]).parent

# Read-in all the files and set to paths that we can access later.
qobs = xr.open_dataset(directory_to_extract_to / "A20_HYDOBS_TEST_corrected.nc").rename(
    {"streamflow": "q"}
)
qsim = xr.open_dataset(directory_to_extract_to / "A20_HYDREP_TEST_corrected.nc").rename(
    {"streamflow": "q"}
)
station_correspondence = xr.open_dataset(
    directory_to_extract_to / "station_correspondence.nc"
)
df_validation = pd.read_csv(
    directory_to_extract_to / "stations_retenues_validation_croisee.csv",
    sep=None,
    dtype=str,
)
observation_stations = list(df_validation["No_station"])

There are three datasets, as well as a list:

- **qobs**: The dataset containing point observations and station metadata.
- **qsim**: The dataset containing the background field simulations (e.g. the raw HYDROTEL results), including simulated station metadata.
- **station_correspondence**: A dataset that simply links station identifiers between the observed and simulated stations. This is necessary because observed stations use "real-world" identifiers, while distributed simulations often employ coded or sequentially numbered identifiers.
- **observation_stations**: A list of the stations from the observation set that we want to use to build the optimal interpolation.


In [None]:
qobs

In [None]:
qsim

In [None]:
station_correspondence

In [None]:
print(
    f"There are a total of {len(observation_stations)} selected observation stations."
)
print(observation_stations)

<div class="alert alert-warning"> <b>WARNING</b>
    
The optimal interpolation module in `xHydro` is still a work-in-progress and is highly hard-coded, particularly regarding inputs. Expect significant changes as the code is refactored and improved.

</div>

The datasets need to follow specific formatting requirements.

For the observed dataset (`qobs` in this example), the following conditions must be met:
- The dimensions should be `station` and `time`.
- The streamflow data must be stored in a variable called `streamflow`.
- The catchment drainage area must be represented in a variable named `drainage_area`.
- The latitude and longitude of the catchment centroids must be stored as `centroid_lat` and `centroid_lon` (these are not the hydrometric station coordinates).
- A variable called `station_id` must exist, containing a unique identifier for each station. This will be used to match the observation stations with their corresponding simulated stations.

For the simulation dataset (`qsim` in this example), the following requirements apply:
- The dimensions should be `station` and `time`.
- The streamflow data should be in a variable named `streamflow`.
- The drainage area for each catchment, as simulated by the model, should be stored in a variable called `drainage_area`.
- The centroids of the catchments must be represented by the coordinates `lat` and `lon`.
- A variable called `station_id` must exist, containing a unique identifier for each simulated station, used to map it to the observed stations.

The correspondence table (`station_correspondence` in this example) must include:
- `station_id` for the observed stations.
- `reach_id` for the simulated stations.

Optimal interpolation in `xHydro` is primarily accessed through the `xhydro.optimal_interpolation.optimal_interpolation_fun.execute_interpolation` function. When performing leave-one-out cross-validation across multiple catchments, the entire interpolation process is repeated for each catchment. In each iteration, one observation station is left out and held independent for validation. This process can be time-consuming but can be parallelized by adjusting the relevant flag and setting the number of CPU cores based on your machine’s capacity. By default, the code will use only 1 core, but if you choose to increase it, the maximum number of cores used will be limited to `([number-of-available-cores / 2] - 1)` to avoid overloading your computer.


In [None]:
help(xhydro.optimal_interpolation.execute)

In [None]:
ds = xh.optimal_interpolation.execute(
    qobs=qobs.sel(time=slice("2018-11-01", "2019-01-01")),
    qsim=qsim.sel(time=slice("2018-11-01", "2019-01-01")),
    station_correspondence=station_correspondence,
    observation_stations=observation_stations,
    form=1,
    ratio_var_bg=0.15,
    percentiles=[25, 50, 75],
    parallelize=False,
    max_cores=1,
    leave_one_out_cv=False,
)

ds

The returned dataset contains a streamflow variable called `q` with the dimensions `[percentile, station_id, time]`, providing estimates for any requested percentile to assess uncertainty. Let's now explore how the optimal interpolation has changed the streamflow at one catchment.


In [None]:
# Get a pair of station ID at one of the stations used for optimal interpolation
pair = station_correspondence.where(
    station_correspondence.station_id == observation_stations[0], drop=True
)
obs_id = pair["station_id"].data
sim_id = pair["reach_id"].data

# Get the streamflow data
observed_flow_select = (
    qobs["q"]
    .where(qobs.station_id == obs_id, drop=True)
    .sel(time=slice("2018-11-01", "2019-01-01"))
    .squeeze()
)
raw_simulated_flow_select = (
    qsim["q"]
    .where(qsim.station_id == sim_id, drop=True)
    .sel(time=slice("2018-11-01", "2019-01-01"))
    .squeeze()
)
interpolated_flow_select = ds["q"].sel(
    station_id=sim_id[0], percentile=50.0, time=slice("2018-11-01", "2019-01-01")
)

In [None]:
plt.plot(observed_flow_select, label="Observed flow")
plt.plot(raw_simulated_flow_select, label="Raw simulation")
plt.plot(interpolated_flow_select, label="Interpolated simulation")
plt.xlabel("Simulation day")
plt.ylabel("Streamflow (m³/s)")
plt.legend()
plt.show()

We can observe that optimal interpolation generally helped bring the model simulation closer to the observed data.
