# Frequency analysis module - Regional analysis

In [None]:
import matplotlib.pyplot as plt
import xdatasets as xd
from lmoments3.distr import KappaGen
from sklearn.cluster import HDBSCAN, OPTICS, AgglomerativeClustering

import xhydro as xh
import xhydro.frequency_analysis as xhfa
import xhydro.gis as xhgis
from xhydro.frequency_analysis.regional import *

This notebook will demonstrate how to use the xhydro package to perform regional frequency analysis on a dataset of streamflow data. The first steps will be similar to the local frequency analysis notebook, but will will keep it simple to focus on the regional frequency analysis.

Lets start with getting the 02 region stations that are natural and have a minimum duration of 15 years

In [None]:
ds = (
    xd.Query(
        **{
            "datasets": {
                "deh": {
                    "id": ["02*"],
                    "regulated": ["Natural"],
                    "variables": ["streamflow"],
                }
            },
            "time": {"start": "1970-01-01", "minimum_duration": (15 * 365, "d")},
        }
    )
    .data.squeeze()
    .load()
)

# This dataset lacks some attributes, so let's add them.
ds["id"].attrs["cf_role"] = "timeseries_id"
ds["streamflow"].attrs = {
    "long_name": "Streamflow",
    "units": "m3 s-1",
    "standard_name": "water_volume_transport_in_river_channel",
    "cell_methods": "time: mean",
}

ds

Here, we hide years with more than 15% of missing data and get yearly max and spring max


In [None]:
timeargs = {
    "spring": {"date_bounds": ["02-11", "06-19"]},
    "annual": {},
}

ds_4fa = xh.indicators.get_yearly_op(
    ds, op="max", timeargs=timeargs, missing="pct", missing_options={"tolerance": 0.15}
)

## Explainatory variables

### a) Extraction using `xhydro.gis`

Regional frequency analyses rely on explanatory variables to link the information at the various sites. For this example, we'll use catchment properties, but other variables sur as climatological averages or land use data could also be used. Refer to the GIS example for more details.

In [None]:
gdf = xd.Query(
    **{
        "datasets": {
            "deh_polygons": {
                "id": ["02*"],
                "regulated": ["Natural"],
                "variables": ["streamflow"],
            }
        },
        "time": {"start": "1970-01-01", "minimum_duration": (15 * 365, "d")},
    }
).data.reset_index()

dswp = xhgis.watershed_properties(
    gdf[["Station", "geometry"]], unique_id="Station", output_format="xarray"
)
cent = dswp["centroid"].to_numpy()
lon = [ele[0] for ele in cent]
lat = [ele[1] for ele in cent]
dswp = dswp.assign(lon=("Station", lon))
dswp = dswp.assign(lat=("Station", lat))
dswp = dswp.drop("centroid")
dswp

### b) Principal component analysis
To do our regional frequency analysis, we'll process the data with a principal component analysis (PCA) to reduce the dimensionality of our dataset:
The function `xhydro.regional.fit_pca` takes a `xarray.Dataset` as input and returns a `xarray.Dataset` with the principal components.

In [6]:
data, pca = xhfa.regional.fit_pca(dswp, n_components=3)

We can see that the correlation is close to 0 between the components, which means that the first 3 components are independent enough to be used for the rest of our analysis.

In [None]:
data.to_dataframe(name="value").reset_index().pivot(
    index="Station", columns="components"
).corr()

### b) Clustering
In this example we'll use `AgglomerativeClustering`, but other methods would also provide valid results. The regional clustering itself is performed using xhfa.regional.get_group_from_fit, which can take the arguments of the skleanr functions as a dictionnary.

In [None]:
groups = xhfa.regional.get_group_from_fit(
    AgglomerativeClustering, {"n_clusters": 3}, data
)
groups

## Regional analysis
**Hosking and Wallis** developed a method for regional frequency analysis that uses L-moments to analyze extreme values across different regions. Here’s a concise overview:
1. **L-Moments**: L-moments are summary statistics derived from linear combinations of order statistics. They are less sensitive to outliers compared to traditional moments (like mean and variance) and provide more robust estimates, especially for small sample sizes.
2. **Regional Frequency Analysis**: This approach involves pooling data from multiple sites or regions to determine the frequency distribution of extreme events, such as floods. Hosking and Wallis’s methods involve estimating the parameters of regional frequency distributions and evaluating the fit of these distributions to the data.
3. **Regional L-Moments**: These are used to summarize data from various sites within a region. By applying L-moment-based methods, parameters can be estimated, and the frequency of extreme events can be assessed across the region.


We calculate the L-moments for each station


In [None]:
ds_moment = calc_moments(ds_4fa)
ds_moment

We need to reshape our datasets of annual maximums and L-moments according to the groupings found using the clustering algorithm. Since there is no convention on the name of that new dimension, it has been decided in xHydro that it would need to be called `group_id`.

In [10]:
ds_groups = group_ds(ds_4fa, groups)
ds_moments_groups = group_ds(ds_moment, groups)

### H-Score (Homogeneity Score)

The **H-Score** measures the homogeneity of data across different sites or regions relative to the regional model:

- **H < 1: Homogeneous** - Indicates that data from different sites are quite similar and fit well with the regional model. This suggests that the model is appropriate for the region as a whole.

- **1 ≤ H < 2: Maybe Homogeneous** - Suggests some degree of heterogeneity, but the data might still fit reasonably well with the regional model. There could be some variations that the model does not fully capture.

- **H ≥ 2: Heterogeneous** - Indicates significant differences between sites or regions, suggesting that the model may not be suitable for all the data. The regions might be too diverse, or the model might need adjustments.

### Z-Score (Goodness of Fit)

The **Z-Score** assesses how well the theoretical distribution (based on the regional model) fits the observed data:

- **Z-Score Calculation**: This score quantifies the discrepancy between observed and expected values, standardized by their variability. It indicates whether the differences are statistically significant.

- **Interpretation**:

    - **Low Z-Score**: A good fit of the model to the observed data. Typically, an absolute value of the Z-Score below 1.64 suggests that the model is appropriate and the fit is statistically acceptable.
    
    - **High Z-Score**: Indicates significant discrepancies between the observed and expected values. An absolute value above 1.64 suggests that the model may not fit the data well, and adjustments might be necessary.


To calculate H and Z, we also need a `KappaGen` object from the lmoment3 librairy. This librairy is not part of the xhydro package, so it need to be installed seperately.

In [None]:
kap = KappaGen()
ds_H_Z = calc_h_z(ds_groups, ds_moments_groups, kap)
ds_H_Z

We filter the data to only include the data that has H and Z below the, The thresholds can be specified but are fixed respectively to 1 and 1.64 for H and Z.

In [12]:
mask = mask_h_z(ds_H_Z)
ds_groups_H1 = ds_groups.where(mask).load()
ds_moments_groups_H1 = ds_moments_groups.where(mask).load()

In [13]:
# Centiles and return periods :
centiles = [x / 100.0 for x in range(101)]
return_periods = [
    2,
    10,
    100,
    1000,
    10000,
]

We can now calculate the return periods for each group and return period. Also since we dont want to do our analyssis on really small regions, `remove_small_regions` removes any region below a certain threshold. By default this threshold is 5.

In [14]:
Q_T = calculate_rp_from_afr(ds_groups_H1, ds_moments_groups_H1, return_periods)
Q_T = remove_small_regions(Q_T)

To plot, let see what it looks like on 023401

In [17]:
Q_reg = Q_T.sel(id="023401").dropna(dim="group_id", how="all")
reg = Q_reg.streamflow_max_annual.squeeze()

Let's compare local and regional

In [18]:
params_loc = xhfa.local.fit(ds_4fa)
Q_loc = xhfa.local.parametric_quantiles(params_loc, return_periods)
loc = Q_loc.sel(id="023401", scipy_dist="genextreme").streamflow_max_annual

In [None]:
fig = plt.figure(figsize=(15, 4))
plt.plot(reg.return_period.values, reg.values, "blue")
plt.plot(loc.return_period.values, loc.values, "red")
plt.xscale("log")
plt.grid(visible=True)
plt.xlabel("Return period (years)")
plt.ylabel("Discharge (m$^3$/s)")
plt.legend()

# Uncertainties
## Local frequency analysis uncertainties
To add some uncertainities, we will work with only one catchment and two distributions as uncertainities can be intensive in computation.
We select the station 023401, and distribution 'genextreme' and 'pearson3'. 

For the local frequency analysis, we need to fit the distribution so the calulting time can be long.

In [20]:
ds_4fa_one_station = ds_4fa.sel(id="023401")
params_loc_one_station = params_loc.sel(
    id="023401", scipy_dist=["genextreme", "pearson3"]
)

### Bootstraping the observations
A way to get uncertainities is to bootstrap the observations 200 times.

In [21]:
ds_4fa_iter = xhfa.uncertainities.boostrap_obs(ds_4fa_one_station, 200)
params_boot_obs = xhfa.local.fit(ds_4fa_iter, distributions=["genextreme", "pearson3"])

In [22]:
Q_boot_obs = xhfa.local.parametric_quantiles(
    params_boot_obs.load(), return_periods
).squeeze()
Q_boot_obs = Q_boot_obs.streamflow_max_annual

### Resampling the fitted distributions
Here, instead of resampling the observations, we resample the fitted distributions 200 times to get the uncertainty

In [23]:
values = xhfa.uncertainities.boostrap_dist(
    ds_4fa_one_station, params_loc_one_station, 200
)
params_boot_dist = xhfa.uncertainities.fit_boot_dist(values)

In [24]:
Q_boot_dist = xhfa.local.parametric_quantiles(
    params_boot_dist.load(), return_periods
).squeeze()
Q_boot_dist = Q_boot_dist.streamflow_max_annual

In [25]:
loc_dist = Q_boot_dist.sel(scipy_dist="genextreme")
loc_obs = Q_boot_obs.sel(scipy_dist="genextreme")

In [None]:
fig, ax = plt.subplots()
fig.set_figheight(4)
fig.set_figwidth(15)

ax.plot(reg.return_period.values, reg.values, "blue", label="Regional")
ax.plot(
    loc_obs.return_period.values,
    loc_obs.quantile(0.5, "samples"),
    "red",
    label="bootstrap obs",
)
loc_obs_05 = loc_obs.quantile(0.05, "samples")
loc_obs_95 = loc_obs.quantile(0.95, "samples")
ax.fill_between(loc_dist.return_period.values, loc_obs_05, loc_obs_95, alpha=0.2)
loc_dist_05 = loc_dist.quantile(0.05, "samples")
loc_dist_95 = loc_dist.quantile(0.95, "samples")
ax.fill_between(loc_dist.return_period.values, loc_dist_05, loc_dist_95, alpha=0.2)
ax.plot(
    loc_dist.return_period.values,
    loc_dist.quantile(0.5, "samples"),
    "green",
    label="bootstrap dist",
)
plt.xscale("log")
plt.grid(visible=True)
plt.xlabel("Return period (years)")
plt.ylabel("Discharge (m$^3$/s)")
ax.legend()

## Regional frequency analysis uncertainties
### Bootstraping the observations

For the regional analysis, we again use `boostrap_obs` to resample the observations, but, this time, it's much faster as no fit is involved.

In [27]:
ds_reg_samples = xhfa.uncertainities.boostrap_obs(ds_4fa, 200)
ds_moments_iter = xhfa.uncertainities.calc_moments_iter(ds_reg_samples).load()

In [28]:
Q_reg_boot = xhfa.uncertainities.calc_q_iter(
    "023401", "streamflow_max_annual", ds_groups_H1, ds_moments_iter, return_periods
)

In [29]:
reg_boot = Q_reg_boot.streamflow_max_annual.sel(id="023401")

Since we'll do a few plots to illustrate the results, let's make a function to somplify things a litle.

In [30]:
def plot_ds_with_CI(
    ds_list, CI_dim_list, color_list, label_list, x_label, y_label, title=None
):
    fig, ax = plt.subplots()
    fig.set_figheight(4)
    fig.set_figwidth(15)

    plt.xscale("log")
    plt.grid(visible=True)
    for i, ds in enumerate(ds_list):
        x = ds.return_period.values
        CI_dim = CI_dim_list[i]
        y_5 = ds.quantile(0.5, CI_dim)
        y_05 = ds.quantile(0.05, CI_dim)
        y_95 = ds.quantile(0.95, CI_dim)
        color = color_list[i]
        label = label_list[i]
        plt.plot(x, y_5, color, label=label)
        ax.fill_between(x, y_05, y_95, alpha=0.2, color=color)

    plt.xscale("log")
    plt.grid(visible=True)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.title(title)
    ax.legend()

In [None]:
plot_ds_with_CI(
    [loc_obs, loc_dist, reg_boot],
    ["samples", "samples", "samples"],
    ["blue", "green", "red"],
    ["bootstrap obs", "bootstrap dist", "Regional bootstrap"],
    "Return period (years)",
    "Discharge (m$^3$/s)",
)

### Multiple regions
Another way to get the uncertainty is to have many regions for one catchement of interest. We can achive this by trying different clustering methods. Or by performing a jackknife on the station list. We dont do too many tests here since it can take quite a while to run and the goal is just to illustrate the possibilities

We will try three clustering methods and for each method, we'll try to change some of the parameters.

In [32]:
PARAM = {
    AgglomerativeClustering: {"arg_name": "n_clusters", "range": range(2, 12)},
    HDBSCAN: {"arg_name": "min_cluster_size", "range": range(6, 7)},
    OPTICS: {"arg_name": "min_samples", "range": range(4, 5)},
}

We now generaste stations combination by removing 0-n stations. 

In [33]:
n = 2
combinations_list = xhfa.uncertainities.generate_combinations(data, n)

So our station instead of beein in one region, will be in many of the regions

In [34]:
groups = []

for model in [AgglomerativeClustering, HDBSCAN, OPTICS]:

    for p in PARAM[model]["range"]:
        d_param = {}
        d_param[PARAM[model]["arg_name"]] = p
        for combination in combinations_list:
            # Extract data for the current combination
            data_com = data.sel(Station=list(combination))
            # Get groups from the fit and add to the list
            groups = groups + get_group_from_fit(model, d_param, data_com)
unique_groups = [list(x) for x in {tuple(x) for x in groups}]

The followin steps are similar to the previous one, just with more regions. 

In [35]:
ds_groups = group_ds(ds_4fa, unique_groups)
ds_moments_groups = group_ds(ds_moment, unique_groups)

In [36]:
kap = KappaGen()
ds_H_Z = calc_h_z(ds_groups, ds_moments_groups, kap)

In [37]:
mask = mask_h_z(ds_H_Z)
ds_groups_H1 = ds_groups.where(mask).load()
ds_moments_groups_H1 = ds_moments_groups.where(mask).load()

Q_T = calculate_rp_from_afr(ds_groups_H1, ds_moments_groups_H1, return_periods)
Q_T = remove_small_regions(Q_T)

Q = Q_T.sel(id="023401").dropna(dim="group_id", how="all")

In [38]:
regional_multiple_region = Q.streamflow_max_annual

In [39]:
ds_moment = calc_moments(ds_4fa)

In [None]:
plot_ds_with_CI(
    [loc_obs, loc_dist, regional_multiple_region],
    ["samples", "samples", "group_id"],
    ["blue", "green", "red"],
    ["bootstrap obs", "bootstrap dist", "regional_multiple_region"],
    "Return period (years)",
    "Discharge (m$^3$/s)",
)

### Combining bootstrap and multiple regions

calc_q_iter will check in how many `group_id` the station is present, and stack it with samples.
In this case, it will be stacked with we boostrapped 200 times, and we have 533 regions so 103600 samples are generated.


In [None]:
Q_reg_boot = xhfa.uncertainities.calc_q_iter(
    "023401", "streamflow_max_annual", ds_groups_H1, ds_moments_iter, return_periods
)
Q_reg_boot

In [42]:
regional_multiple_region_boot = Q_reg_boot.sel(id="023401").streamflow_max_annual

In [None]:
plot_ds_with_CI(
    [loc_obs, loc_dist, regional_multiple_region, regional_multiple_region_boot],
    ["samples", "samples", "group_id", "samples"],
    ["blue", "green", "red", "black"],
    [
        "bootstrap obs",
        "bootstrap dist",
        "regional_multiple_region",
        "regional_multiple_region_boot",
    ],
    "Return period (years)",
    "Discharge (m$^3$/s)",
)