# Climate change analysis of hydrological data

In [None]:
# Imports
from pathlib import Path

import hvplot.xarray  # noqa
import numpy as np
import pooch
import xarray as xr
import xclim

import xhydro as xh
from xhydro.testing.helpers import deveraux

D = deveraux()

# Future streamflow file (1 file - Hydrotel driven by BCC-CSM-1.1(m))
streamflow_file = D.fetch("cc_indicators/streamflow_BCC-CSM1.1-m_rcp45.nc")

# Reference mean annual streamflow (QMOYAN) for 6 calibrations of Hydrotel
reference_files = D.fetch("cc_indicators/reference.zip", pooch.Unzip())

# Future deltas of QMOYAN (63 simulations x 6 calibrations of Hydrotel)
deltas_files = D.fetch("cc_indicators/deltas.zip", pooch.Unzip())

While there is a vast array of analyses that can be performed to assess the impacts of climate change on hydrology, this notebook covers some of the most common steps:

- Computing a list of relevant indicators over climatological periods.
- Computing future differences to assess the changes.
- Computing ensemble statistics to evaluate future changes and variability.

<div class="alert alert-info"><b>INFO</b>

Several functions from the `xscen` library have been integrated into `xhydro` to simplify access for users, such as those in `xhydro.indicators` and `xhydro.cc`. This notebook will cover the basics, but for further details on these functions, please refer to the following resources:

- [compute_indicators](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Computing-indicators)
- [climatological_op](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Climatological-operations)
- [compute_deltas](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Computing-deltas)
- [ensemble_statistics](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Ensemble-statistics)

</div>


## Computing hydrological indicators over a given time period

Hydrological indicators can be categorized into two main types:

- Frequential indicators: These indicators describe hydrological events that occur at recurring intervals. They include metrics like the maximum 20-year flow (`Qmax20`) or the minimum 2-year 7-day average flow in summer (`Q7min2_summer`). The methodology for computing these indicators is covered in the [Local Frequency Analysis](local_frequency_analysis.ipynb) notebook.
- Non-frequential indicators: These indicators do not explicitly describe recurrence, but rather absolute values or trends in hydrological variables. They include metrics like average yearly flow.

Since frequential indicators are already covered in another example, this notebook will focus on the methodology for computing non-frequential indicators using `xhydro.indicators.compute_indicators`. This function is built on top of `xclim` and supports both predefined indicators, such as `xclim.indicator.land.doy_qmax`, as well as custom indicators created using `xclim.core.indicator.Indicator.from_dict`. The latter option can be quite complex—see the box below for more information. For advanced users, indicator construction can also be defined through a YAML file.

The output of `xhydro.indicators.compute_indicators` is a dictionary, where each key represents the frequency of the requested indicators, following the `pandas` nomenclature. In our example, we will only use yearly data starting in January, so the frequency will be `YS-JAN`.


<div class="alert alert-info"> <b>INFO</b>

Custom indicators in `xHydro` are built by following the YAML formatting required by `xclim`.

A custom indicator built using `xclim.core.indicator.Indicator.from_dict` will need these elements:

- "data": A dictionary with the following information:
  - "base": The "YAML ID" obtained from [here](https://xclim.readthedocs.io/en/stable/indicators.html).
  - "input": A dictionary linking the default xclim input to the name of your variable. Needed only if it is different. In the link above, they are the string following "Uses:".
  - "parameters": A dictionary containing all other parameters for a given indicator. In the link above, the easiest way to access them is by clicking the link in the top-right corner of the box describing a given indicator.
  - More entries can be used here, as described [in the xclim documentation](https://xclim.readthedocs.io/en/latest/api.html#yaml-file-structure) under "identifier".
- "identifier": A custom name for your indicator. This will be the name returned in the results.
- "module": Needed, but can be anything. To prevent an accidental overwriting of `xclim` indicators, it is best to use something different from: ["atmos", "land", "generic"].

</div>


The example file used in this notebook is a daily time series of streamflow data, generated from the HYDROTEL hydrological model. This data is driven by bias-adjusted outputs from the BCC-CSM-1.1(m) climatological model (RCP4.5), spanning the years 1950 to 2100. For this example, the dataset includes data from just 2 stations. The function `xhydro.indicators.compute_indicators` can be used with any number of indicators. For this example, we will compute the mean annual flow and the mean summer-fall flow.


In [None]:
ds = xr.open_dataset(streamflow_file).rename({"streamflow": "q"})
ds.q.hvplot(x="time", grid=True, widget_location="bottom", groupby="station")

In [None]:
help(xh.indicators.compute_indicators)

In [None]:
help(xclim.core.indicator.Indicator.from_dict)

In [None]:
indicators = [
    # 1st indicator: Mean annual flow
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "stats",
            "input": {"da": "q"},
            "parameters": {"op": "mean"},
        },
        identifier="QMOYAN",
        module="hydro",
    ),
    # 2nd indicator: Mean summer-fall flow
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "stats",
            "input": {"da": "q"},
            "parameters": {"op": "mean", "indexer": {"month": [6, 7, 8, 9, 10, 11]}},
        },  # The indexer is used to restrict available data to the relevant months only
        identifier="QMOYEA",
        module="hydro",
    ),
]

# Call compute_indicators
dict_indicators = xh.indicators.compute_indicators(ds, indicators=indicators)

dict_indicators

In [None]:
dict_indicators["YS-JAN"].QMOYAN.hvplot(
    x="time", grid=True, widget_location="bottom", groupby="station"
)

The next step is to compute averages over climatological periods. This can be done using the `xhydro.cc.climatological_op` function.

If the indicators themselves are not relevant to your analysis and you only need the climatological averages, you can directly use `xhydro.cc.produce_horizon` instead of combining `xhydro.indicators.compute_indicators` with `xhydro.cc.climatological_op`. The key advantage of `xhydro.cc.produce_horizon` is that it eliminates the `time` axis, replacing it with a `horizon` dimension that represents a slice of time. This is particularly useful when computing indicators with different output frequencies. An example of this approach is provided in the [Use Case Example](use_case.ipynb).


In [None]:
help(xh.cc.climatological_op)

In [None]:
# Call climatological_op. Here we don't need 'time' anymore, so we can use horizons_as_dim=True
ds_avg = xh.cc.climatological_op(
    dict_indicators["YS-JAN"],
    op="mean",
    periods=[[1981, 2010], [2011, 2040], [2041, 2070], [2071, 2100]],
    min_periods=29,
    horizons_as_dim=True,
    rename_variables=False,
).drop_vars(["time"])
ds_avg

Once the averages over time periods have been computed, calculating the differences between future and past values is straightforward. Simply call `xhydro.cc.compute_deltas` to perform this calculation.


In [None]:
help(xh.cc.compute_deltas)

In [None]:
ds_deltas = xh.cc.compute_deltas(
    ds_avg, reference_horizon="1981-2010", kind="%", rename_variables=False
)
ds_deltas

In [None]:
# Show the results as Dataframes
print("30-year averages")
display(ds_avg.QMOYAN.isel(station=0).to_dataframe())
print("Deltas")
display(ds_deltas.QMOYAN.isel(station=0).to_dataframe())

## Ensemble statistics

In a real-world application, the steps outlined so far would need to be repeated for all available hydroclimatological simulations. For this example, we will work with a subset of pre-computed deltas from the RCP4.5 simulations used in the 2022 Hydroclimatic Atlas of Southern Quebec.


In [None]:
ds_dict_deltas = {}
for f in deltas_files:
    id = Path(f).stem
    ds_dict_deltas[id] = xr.open_dataset(f)

print(f"Loaded data from {len(ds_dict_deltas)} simulations")

It is considered good practice to use multiple climate models when performing climate change analyses, especially since the impacts on the hydrological cycle can be nonlinear. Once multiple hydrological simulations are completed and ready for analysis, you can use `xhydro.cc.ensemble_stats` to access a variety of functions available in `xclim.ensemble`, such as calculating ensemble quantiles or assessing the agreement on the sign of change.

### Weighting simulations

When the ensemble of climate models is heterogeneous—such as when one model provides more simulations than others—it is recommended to weight the results accordingly. While this functionality is not currently available directly through `xhydro` (as it expects metadata specific to `xscen` workflows), the `xscen.generate_weights` function can help create an approximation of the weights based on available metadata.

The following attributes are required for the function to work properly:

- `'cat:source'` in all datasets
- `'cat:driving_model'` in regional climate models
- `'cat:institution'` in all datasets (if `independence_level='institution'`)
- `'cat:experiment'` in all datasets (if `split_experiments=True`)

The `xscen.generate_weights` function offers three possible independence levels:

- `model` (1 Model - 1 Vote): This assigns a total weight of 1 to all unique combinations of `'cat:source'` and `'cat:driving_model'`.
- `GCM` (1 GCM - 1 Vote): This assigns a total weight of 1 to all unique global climate models (GCMs), effectively averaging together regional climate simulations that originate from the same driving model.
- `institution` (1 institution - 1 Vote): This assigns a total weight of 1 to all unique `'cat:institution'` values.

In all cases, the "total weight of 1" is not distributed equally between the involved simulations. The function will attempt to respect the model genealogy when distributing the weights. For example, if an institution has produced 4 simulations from Model A and 1 simulation from Model B, using `independence_level='institution'` would result in a weight of 0.125 for each Model A run and 0.5 for the single Model B run.



In [None]:
import xscen

independence_level = "model"  # 1 Model - 1 Vote
weights = xscen.generate_weights(ds_dict_deltas, independence_level="model")

# Show the results. We multiply by 6 for the showcase here simply because there are 6 hydrological platforms in the results.
weights.where(weights.realization.str.contains("LN24HA"), drop=True) * 6

### Ensemble statistics with deterministic reference data

In most cases, you will have deterministic data for the reference period. This means that, for a given location, the 30-year average for a specific indicator is represented by a single value.


In [None]:
# The Hydrological Portrait produces probabilistic estimates, but we'll take the 50th percentile to fake deterministic data
ref = xr.open_dataset(reference_files[0]).sel(percentile=50).drop_vars("percentile")

Given that biases may still persist in climate simulations even after bias adjustment, which can impact hydrological modeling, we need to employ a perturbation technique to combine data over the reference period with climate simulations. This is particularly important in hydrology, where nonlinear interactions between climate and hydrological indicators can be significant. Multiple other methodologies exist for combining observed and simulated data, but comparing various approaches goes beyond the scope of this example.

The perturbation technique involves calculating ensemble percentiles on the deltas and then applying those percentiles to the reference dataset. For this example, we'll compute the 10th, 25th, 50th, 75th, and 90th percentiles of the ensemble, as well as the agreement on the sign of the change, using the `xhydro.cc.ensemble_stats` function.


In [None]:
help(xh.cc.ensemble_stats)

In [None]:
statistics = {
    "ensemble_percentiles": {"values": [10, 25, 50, 75, 90], "split": False},
    "robustness_fractions": {"test": None},
}

ens_stats = xh.cc.ensemble_stats(ds_dict_deltas, statistics, weights=weights)
ens_stats

This results in a large amount of data with many unique variables. To simplify the results, we'll compute three new statistics:

- The median change.
- The interquartile range of the change.
- The agreement between models using the IPCC categories.

The last statistic is slightly more complex. For more details on the categories of agreement for the sign of change, refer to the technical summary in "Climate Change 2021 – The Physical Science Basis: Working Group I Contribution to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change", [Cross-Chapter Box 1](https://www.cambridge.org/core/books/climate-change-2021-the-physical-science-basis/atlas/24E1C016DBBE4725BDFBC343695DE7DB). 

To compute this, you can use the results produced by `robustness_fractions`, but it needs a call to the function `xclim.ensembles.robustness_categories`. The thresholds and operations require two entries: the first is related to the significance test, and the second refers to the percentage of simulations showing a positive delta. For example, "Agreement towards increase" is met if more than 66% of simulations show a significant change, and 80% of simulations see a positive change.


In [None]:
out = xr.Dataset()

out["QMOYAN_median"] = ens_stats["QMOYAN"].sel(percentiles=50)
out["QMOYAN_iqr"] = ens_stats["QMOYAN"].sel(percentiles=75) - ens_stats["QMOYAN"].sel(
    percentiles=25
)

In [None]:
from xclim.ensembles import robustness_categories

categories = [
    "Agreement towards increase",
    "Agreement towards decrease",
    "Conflicting signals",
    "No change or robust signal",
]
thresholds = [[0.66, 0.8], [0.66, 0.2], [0.66, 0.8], [0.66, np.nan]]
ops = [[">=", ">="], [">=", "<="], [">=", "<"], ["<", None]]

out["QMOYAN_robustness_categories"] = robustness_categories(
    changed_or_fractions=ens_stats["QMOYAN_changed"],
    agree=ens_stats["QMOYAN_positive"],
    categories=categories,
    thresholds=thresholds,
    ops=ops,
)

Finally, using a perturbation method, future values for QMOYAN can be obtained by multiplying the reference indicator with the percentiles of the ensemble deltas.

In [None]:
out["QMOYAN_projected"] = ref.QMOYAN * (1 + ens_stats.QMOYAN / 100)

In [None]:
out

### Ensemble statistics with probabilistic reference data

This method is similar to the previous section, but it applies to cases like the [Hydrological Atlas of Southern Quebec](https://cehq.gouv.qc.ca/atlas-hydroclimatique/) or results from the [Optimal Interpolation](optimal_interpolation.ipynb) notebook, where hydrological indicators for the historical period are represented by a probability density function (PDF) rather than a single discrete value. In such cases, the ensemble percentiles cannot simply be multiplied by the reference value.

In this example, instead of a single value, `QMOYAN` is represented by 21 percentiles that capture the uncertainty surrounding this statistic. Similar to the future simulations, we also have 6 hydrological platforms to consider.

<div class="alert alert-warning"> <b>WARNING</b>

In these cases, the percentiles in `ref` represent <b>uncertainty</b> (e.g., related to hydrological modeling or input data uncertainty), not interannual variability. At this stage, the temporal average should already have been calculated.

</div>


In [None]:
ref = xr.open_mfdataset(reference_files, combine="nested", concat_dim="platform")

ref

In [None]:
# This can also be represented as a cumulative distribution function (CDF)
import matplotlib.pyplot as plt

for platform in ref.platform:
    plt.plot(
        ref.QMOYAN.isel(station=0).sel(platform=platform),
        ref.QMOYAN.percentile / 100,
        "grey",
    )
    plt.xlabel("Mean annual flow (m³/s)")
    plt.ylabel("Probability")
    plt.title("CDF for QMOYAN @ ABIT00057 \nEach line is an hydrological platform")

Due to their probabilistic nature, the historical reference values cannot be easily combined with the future deltas. To address this, the `xhydro.cc.weighted_random_sampling` and `xhydro.cc.sampled_indicators` functions have been designed. Together, these functions will:

1. Sample 'n' values from the historical distribution, in accordance with the 'percentile' dimension.
2. Sample 'n' values from the delta distribution, using the provided weights.
3. Create the future distribution by applying the sampled deltas to the sampled historical distribution element-wise.
4. Compute the percentiles of the future distribution.

First, we will sample within the reference dataset to combine the results of the 6 hydrological platforms together.

In [None]:
help(xh.cc.weighted_random_sampling)

In [None]:
deltas = xclim.ensembles.create_ensemble(ds_dict_deltas)

hist_dist = xh.cc.weighted_random_sampling(
    ds=ref,
    include_dims=["platform"],
    n=10000,
    seed=0,
)

hist_dist

In [None]:
# Let's show how the historical distribution was sampled and reconstructed
def _make_cdf(ds, bins):
    count, bins_count = np.histogram(ds.QMOYAN.isel(station=0), bins=bins)
    pdf = count / sum(count)
    return bins_count, np.cumsum(pdf)


# Barplot
plt.subplot(2, 1, 1)
uniquen = np.unique(hist_dist.QMOYAN.isel(station=0), return_counts=True)
plt.bar(uniquen[0], uniquen[1], width=0.01, color="k")
plt.ylabel("Number of instances")
plt.title("Sampling within the historical distribution")

# CDF
plt.subplot(2, 1, 2)
for i, platform in enumerate(ref.platform):
    plt.plot(
        ref.QMOYAN.isel(station=0).sel(platform=platform),
        ref.percentile / 100,
        "grey",
        label="CDFs from the percentiles" if i == 0 else None,
    )
bc, c = _make_cdf(hist_dist, bins=50)
plt.plot(bc[1:], c, "r", label=f"Sampled historical CDF (n={10000})", linewidth=3)
plt.ylabel("Probability")
plt.xlabel("QMOYAN (m³/s)")
plt.legend()

plt.tight_layout()

We can do the same for the deltas. Since `weights` already contains all dimensions that we want to sample from, we don't need `include_dims` here.

In [None]:
delta_dist = xh.cc.weighted_random_sampling(
    ds=deltas,
    weights=weights,
    n=10000,
    seed=0,
)

delta_dist

In [None]:
# Then, let's show how the deltas were sampled, for the last horizon
plt.subplot(2, 1, 1)
uniquen = np.unique(delta_dist.QMOYAN.isel(station=0, horizon=-1), return_counts=True)
plt.bar(uniquen[0], uniquen[1], width=0.25, color="k")
plt.ylabel("Number of instances")
plt.title("Sampling within the historical distribution")

plt.subplot(2, 1, 2)
bc, c = _make_cdf(delta_dist, bins=100)
plt.plot(bc[1:], c, "k", label=f"Sampled deltas CDF (n={10000})", linewidth=3)
plt.ylabel("Probability")
plt.xlabel("Deltas (%)")
plt.legend()

plt.tight_layout()

Once the two distributions have been acquired, `xhydro.cc.sampled_indicators` can be used to combine them element-wise and reconstruct a future distribution. The resulting distribution will possess the unique dimensions from both datasets. Here, this means that we get a reconstructed distribution for each future horizon.

In [None]:
help(xh.cc.sampled_indicators)

In [None]:
fut_dist, fut_pct = xh.cc.sampled_indicators(
    ds_dist=hist_dist,
    deltas_dist=delta_dist,
    delta_kind="percentage",
    percentiles=ref.percentile,
)

fut_dist

Since we used the `percentiles` argument, it also computed a series of percentiles.

In [None]:
fut_pct

In [None]:
# The distributions themselves can be used to create boxplots and compare the historical distribution to the future ones.
plt.boxplot(
    [
        hist_dist.QMOYAN.isel(station=0),
        fut_dist.QMOYAN.isel(station=0, horizon=0),
        fut_dist.QMOYAN.isel(station=0, horizon=1),
        fut_dist.QMOYAN.isel(station=0, horizon=2),
    ],
    labels=["Historical", "2011-2040", "2041-2070", "2071-2100"],
)

plt.ylabel("Mean summer flow (m³/s)")
plt.tight_layout()

The same statistics as before can also be computed by using the 10,000 samples within `delta_dist`.

In [None]:
# The same statistics as before can also be computed by using delta_dist
delta_dist = delta_dist.rename({"sample": "realization"})  # xclim compatibility
ens_stats = xh.cc.ensemble_stats(delta_dist, statistics)

out_prob = xr.Dataset()
out_prob["QMOYAN_median"] = ens_stats["QMOYAN"].sel(percentiles=50)
out_prob["QMOYAN_iqr"] = ens_stats["QMOYAN"].sel(percentiles=75) - ens_stats[
    "QMOYAN"
].sel(percentiles=25)
out_prob["QMOYAN_robustness_categories"] = robustness_categories(
    changed_or_fractions=ens_stats["QMOYAN_changed"],
    agree=ens_stats["QMOYAN_positive"],
    categories=categories,
    thresholds=thresholds,
    ops=ops,
)

out_prob