# Climate change analysis of hydrological data

In [None]:
# Imports
from pathlib import Path

import hvplot.xarray  # noqa
import numpy as np
import pooch
import xarray as xr
import xclim

import xhydro as xh
from xhydro.testing.helpers import deveraux

D = deveraux()

# Future streamflow file (1 file - Hydrotel driven by BCC-CSM-1.1(m))
streamflow_file = D.fetch("cc_indicators/streamflow_BCC-CSM1.1-m_rcp45.nc")

# Reference mean annual streamflow (QMOYAN) for 6 calibrations of Hydrotel
reference_files = D.fetch("cc_indicators/reference.zip", pooch.Unzip())

# Future deltas of QMOYAN (63 simulations x 6 calibrations of Hydrotel)
deltas_files = D.fetch("cc_indicators/deltas.zip", pooch.Unzip())

While there is a huge variety of analyses that could be done to assess the impacts of climate change on hydrology, this notebook will go through some of the most common steps:

- Computing a list of relevant indicators over climatological periods
- Computing future deltas
- Computing ensemble statistics to assess future changes

<div class="alert alert-info"> <b>INFO</b>

Multiple functions in `xh.indicators` and `xh.cc` have been leveraged from the `xscen` library and made accessible to `xhydro` users. For more information on these function, it is recommended to look at:

- [compute_indicators](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Computing-indicators)
- [climatological_op](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Climatological-operations)
- [compute_deltas](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Computing-deltas)
- [ensemble_statistics](https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Ensemble-statistics)

</div>

## Computing hydrological indicators over a given time period

In [None]:
# The file used as an example is a daily timeseries of streamflow data generated from the Hydrotel hydrological model
# driven by bias-adjusted data from the BCC-CSM-1.1(m) climatological model (RCP4.5), from 1950 to 2100.
# For this example, the dataset covers only 2 stations.
ds = xr.open_dataset(streamflow_file)
ds.streamflow.hvplot(x="time", grid=True, widget_location="bottom", groupby="station")

Hydrological indicators can be separated in two broad categories:

- Frequential indicators, such as the maximum 20-year flow (*Qmax20*) or the minimum 2-year 7-day averaged flow in summer (*Q7min2_summer*). Computing these is already covered in the [Local Frequency Analysis notebook](local_frequency_analysis.ipynb) notebook.
- Non frequencial indicators, such as the average yearly flow.

Since frequential indicators have already been covered in another example, this notebook will instead look at the methodology that would be used to compute non frequential indicators using `xhydro.indicators.compute_indicators`. The inputs of that function are:

- *ds*: the Dataset.
- *indicators*: a list of indicators to compute, or the path to a YAML file containing those.
- *periods* (optional): either [start, end] or list of [start, end] of continuous periods over which to compute the indicators.

<div class="alert alert-info"> <b>INFO</b>

Custom indicators are built by following the YAML formatting required by `xclim`. More information is available [in the xclim documentation](https://xclim.readthedocs.io/en/latest/api.html#yaml-file-structure).

The list of Yaml IDs is available [here](https://xclim.readthedocs.io/en/stable/indicators.html).

</div>

In [None]:
# We'll define 2 indicators to compute by using dictionaries.
#
# We minimally need to define three things under `data`:
#    1. 'base': A base indicator for the computation, identified through its Yaml ID (here, 'stats').
#    2. 'parameters': Specific parameters to use instead of the defaults.
#      - This potentially includes a 'indexer' parameter to focus on particular periods of the year.
#    3. 'input': The name of the input variable. The key here must be the variable name used by xclim (here, 'da').
#
# The 'identifier' is the label that will be given by 'xclim' to the new indicator. The 'module' can be anything.

indicators = [
    # 1st indicator: Mean annual flow
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "stats",
            "input": {"da": "streamflow"},
            "parameters": {"op": "mean"},
        },
        identifier="QMOYAN",
        module="hydro",
    ),
    # 2nd indicator: Mean summer-fall flow
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "stats",
            "input": {"da": "streamflow"},
            "parameters": {"op": "mean", "indexer": {"month": [6, 7, 8, 9, 10, 11]}},
        },  # The indexer is used to restrict available data to the relevant months only
        identifier="QMOYEA",
        module="hydro",
    ),
]

# Call compute_indicators
dict_indicators = xh.indicators.compute_indicators(ds, indicators=indicators)

dict_indicators

In [None]:
dict_indicators["YS-JAN"].QMOYAN.hvplot(
    x="time", grid=True, widget_location="bottom", groupby="station"
)

Since indicators could be output at varying frequencies, `compute_indicators` will return a dictionary where the keys are the output frequencies. In this example, we only have one key: `AS-JAN` (annual data starting in January). The keys follow the `pandas` nomenclature.

The next step is to obtain averages over climatological periods. The `xh.cc.climatological_op` function can be called for this purpose. The inputs of that function are:

- *ds*: Dataset to use for the computation.
- *op*: Operation to perform over time. While other operations are technically possible, the following are recommended and tested:  ['max', 'mean', 'median', 'min', 'std', 'sum', 'var', 'linregress'].
- *window* (optional): Number of years to use for the rolling operation. If None, all the available data will be used.
- *min_periods* (optional): For the rolling operation, minimum number of years required for a value to be computed.
- *stride*: Stride (in years) at which to provide an output from the rolling window operation.
- *periods* (optional): Either [start, end] or list of [start, end] of continuous periods to be considered.
- *rename_variables*: If True, '_clim_{op}' will be added to variable names.
- *horizons_as_dim*: If True, the output will have 'horizon' and the frequency as 'month', 'season' or 'year' as dimensions and coordinates.

In [None]:
# Define the periods using a list of lists
periods = [[1981, 2010], [2011, 2040], [2041, 2070], [2071, 2100]]
min_periods = 29  # This is an example of a model where the data stops in 2099, so we can use 'min_periods' to still obtain a value for the last period

# Call climatological_op. Here we don't need 'time' anymore, so we can use horizons_as_dim=True
ds_avg = xh.cc.climatological_op(
    dict_indicators["YS-JAN"],
    op="mean",
    periods=periods,
    min_periods=min_periods,
    horizons_as_dim=True,
    rename_variables=False,
).drop_vars(["time"])
ds_avg

Computing deltas is then as easy as calling `xh.cc.compute_deltas`. The inputs of that function are:

- *ds*: Dataset to use for the computation.
- *reference_horizon*: Either a YYYY-YYYY string corresponding to the 'horizon' coordinate of the reference period, or a xr.Dataset containing the climatological mean.
- *kind*: ['+', '/', '%'] Whether to provide absolute, relative, or percentage deltas. Can also be a dictionary separated per variable name.

In [None]:
# Here, we'll use a string from the 'horizon' dimension.
reference_horizon = "1981-2010"
kind = "%"

ds_deltas = xh.cc.compute_deltas(
    ds_avg, reference_horizon=reference_horizon, kind=kind, rename_variables=False
)
ds_deltas

In [None]:
# Show the results as Dataframes
print("30-year averages")
display(ds_avg.QMOYAN.isel(station=0).to_dataframe())
print("Deltas")
display(ds_deltas.QMOYAN.isel(station=0).to_dataframe())

## Ensemble statistics

In [None]:
# To save time, let's open pre-computed deltas for the RCP4.5 simulations used in the 2022 Hydroclimatic Atlas
ds_dict_deltas = {}
for f in deltas_files:
    id = Path(f).stem
    ds_dict_deltas[id] = xr.open_dataset(f)

It is a good practice to use multiple climate models to perform climate change analyses, especially since the impacts on the hydrological cycle can be non linear. Once multiple hydrological simulations have been run and are ready to be analysed, `xh.cc.ensemble_stats` can be used to call a variety of functions available in `xclim.ensemble`, such as for getting ensemble quantiles or the agreement on the sign of the change.

### Weighting simulations
If the ensemble of climate models is heterogeneous, for example if a given climate model has provided more simulations, it is recommended to weight the results accordingly. While this is not currently available through `xhydro`, `xscen.generate_weights` can create a first approximation of the weights to use, based on available metadata.

The following attributes are required for the function to work:

- 'cat:source' in all datasets
- 'cat:driving_model' in regional climate models
- 'cat:institution' in all datasets if independence_level='institution'
- 'cat:experiment' in all datasets if split_experiments=True

That function has three possible independence levels:

- *model*: 1 Model - 1 Vote
- *GCM*: 1 GCM - 1 Vote
- *institution*: 1 institution - 1 Vote

In [None]:
import xscen

independence_level = "model"  # 1 Model - 1 Vote

weights = xscen.generate_weights(ds_dict_deltas, independence_level=independence_level)

# Show the results. We multiply by 6 for the showcase here simply because there are 6 hydrological platforms in the results.
weights.where(weights.realization.str.contains("LN24HA"), drop=True) * 6

### Use Case #1: Deterministic reference data

In most cases, you'll likely have deterministic data for the reference period, meaning that for a given location, the 30-year average for the indicator is a single value.

In [None]:
# The Hydrological Portrait produces probabilistic estimates, but we'll take the 50th percentile to fake deterministic data
ref = xr.open_dataset(reference_files[0]).sel(percentile=50).drop_vars("percentile")

Multiple methodologies exist on how to combine the information of the observed and simulated data. Due to biases that may remain in the climate simulations even after bias adjustment and affect the hydrological modelling, we'll use a perturbation technique. This is especially relevant in hydrology with regards to non linear interactions between the climate and hydrological indicators.

The perturbation technique consists in computing ensemble percentiles on the deltas, then apply them on the reference dataset.For this example, we'll compute the 10th, 25th, 50th, 75th, and 90th percentiles of the ensemble, as well as the agreement on the sign of change, using `xh.cc.ensemble_stats`. The inputs of that function are:

- *datasets*: List of file paths or xarray Dataset/DataArray objects to include in the ensemble. A dictionary can be passed instead of a list, in which case the keys are used as coordinates along the new `realization` axis.
- *statistics*:  dictionary of xclim.ensembles statistics to be called, with their arguments.
- *weights* (optional):  Weights to apply along the 'realization' dimension.

In [None]:
# Statistics to compute
statistics = {
    "ensemble_percentiles": {"values": [10, 25, 50, 75, 90], "split": False},
    "robustness_fractions": {"test": None},
}  # Robustness fractions is the function that provides the agreement between models.

# Here, we call ensemble_stats on the dictionary deltas, since this is the information that we want to extrapolate.
# If relevant, weights are added at this step
ens_stats = xh.cc.ensemble_stats(ds_dict_deltas, statistics, weights=weights)

In [None]:
# Additional statistics not explicitly supported by ensemble_stats
from xclim.ensembles import robustness_categories

# Interquartile range
ens_stats["QMOYAN_iqr"] = ens_stats["QMOYAN"].sel(percentiles=75) - ens_stats[
    "QMOYAN"
].sel(percentiles=25)

# Categories of agreement for the sign of change. This follows the Advanced IPCC Atlas categories.
# See the Cross-Chapter Box 1 for reference: https://www.cambridge.org/core/books/climate-change-2021-the-physical-science-basis/atlas/24E1C016DBBE4725BDFBC343695DE7DB
# For thresholds and ops, the first entry is related to the significance test, while the 2nd is related to the percentage of simulations that see a positive delta.
# For example, "Agreement towards increase" is met if more than 66% of simulations see a significant change AND 80% of simulations see a positive change.
categories = [
    "Agreement towards increase",
    "Agreement towards decrease",
    "Conflicting signals",
    "No change or robust signal",
]
thresholds = [[0.66, 0.8], [0.66, 0.2], [0.66, 0.8], [0.66, np.nan]]
ops = [[">=", ">="], [">=", "<="], [">=", "<"], ["<", None]]

ens_stats["QMOYAN_robustness_categories"] = robustness_categories(
    changed_or_fractions=ens_stats["QMOYAN_changed"],
    agree=ens_stats["QMOYAN_positive"],
    categories=categories,
    thresholds=thresholds,
    ops=ops,
)

# The future values for QMOYAN can be obtained by multiplying the reference indicator with the percentiles of the ensemble deltas
ens_stats["QMOYAN_projected"] = ref.QMOYAN * (1 + ens_stats.QMOYAN / 100)

In [None]:
ens_stats

### Use Case #2: Probabilistic reference data

This method follows a similar approach to Use Case #1, but for a case like the [Hydrological Atlas of Southern Quebec](https://cehq.gouv.qc.ca/atlas-hydroclimatique/), where the hydrological indicators computed for the historical period are represented by a probability density function (PDF), rather than a discrete value. This means that the ensemble percentiles can't simply be multiplied by the reference value.

<div class="alert alert-info"> <b>INFO</b>

Note that the percentiles in `ref` are <b>not</b> the interannual variability, but rather the uncertainty related, for example, to hydrological modelling or the quality of the input data. At this stage, the temporal average should already have been done.

</div>

In [None]:
ref = xr.open_mfdataset(reference_files, combine="nested", concat_dim="platform")

# Rather than a single value, QMOYAN is represented by 21 percentiles that try to represent the uncertainty surrounding this statistics.
# Like for the future simulations, we also have 6 hydrological platforms to take into account.
ref

In [None]:
# This can also be represented as a cumulative distribution function (CDF)
import matplotlib.pyplot as plt

for platform in ref.platform:
    plt.plot(
        ref.QMOYAN.isel(station=0).sel(platform=platform),
        ref.QMOYAN.percentile / 100,
        "grey",
    )
    plt.xlabel("Mean annual flow (m³/s)")
    plt.ylabel("Probability")
    plt.title("CDF for QMOYAN @ ABIT00057 \nEach line is an hydrological platform")

Because of their probabilistic nature, the historical reference values can't easily be combined to the future deltas. The `weighted_random_sampling` and `sampled_indicators` functions have been created to circumvent this issue. Together, these functions will:

1. Sample 'n' values from the historical distribution, weighting the percentiles by their associated coverage.
2. Sample 'n' values from the delta distribution, using the provided weights.
3. Create the future distribution by applying the sampled deltas to the sampled historical distribution, element-wise.
4. Compute the percentiles of the future distribution.

In [None]:
print(xh.cc.weighted_random_sampling.__doc__)

In [None]:
n = 10000
deltas = xclim.ensembles.create_ensemble(
    ds_dict_deltas
)  # The function expects an xarray object. This xclim function can be used to easily create the required input.

# First, we sample within the reference dataset to combine the results of the 6 hydrological platforms together.
hist_dist = xh.cc.weighted_random_sampling(
    ds=ref,
    include_dims=["platform"],
    n=n,
    seed=0,
)

# Let's show how the historical distribution was sampled and reconstructed


def _make_cdf(ds, bins):
    count, bins_count = np.histogram(ds.QMOYAN.isel(station=0), bins=bins)
    pdf = count / sum(count)
    return bins_count, np.cumsum(pdf)


# Barplot
plt.subplot(2, 1, 1)
uniquen = np.unique(hist_dist.QMOYAN.isel(station=0), return_counts=True)
plt.bar(uniquen[0], uniquen[1], width=0.01, color="k")
plt.ylabel("Number of instances")
plt.title("Sampling within the historical distribution")

# CDF
plt.subplot(2, 1, 2)
for i, platform in enumerate(ref.platform):
    plt.plot(
        ref.QMOYAN.isel(station=0).sel(platform=platform),
        ref.percentile / 100,
        "grey",
        label="CDFs from the percentiles" if i == 0 else None,
    )
bc, c = _make_cdf(hist_dist, bins=50)
plt.plot(bc[1:], c, "r", label=f"Sampled historical CDF (n={n})", linewidth=3)
plt.ylabel("Probability")
plt.xlabel("QMOYAN (m³/s)")
plt.legend()

plt.tight_layout()

In [None]:
# We can also inspect the array to see that the `platform` and `percentile` dimensions have indeed been reduced.
hist_dist

In [None]:
# We can do the same for the deltas. Since `weights` already contains all dimensions that we want to sample from, we don't need `include_dims` here.
delta_dist = xh.cc.weighted_random_sampling(
    ds=deltas,
    weights=weights,
    n=n,
    seed=0,
)

# Then, let's show how the deltas were sampled, for the last horizon
plt.subplot(2, 1, 1)
uniquen = np.unique(delta_dist.QMOYAN.isel(station=0, horizon=-1), return_counts=True)
plt.bar(uniquen[0], uniquen[1], width=0.25, color="k")
plt.ylabel("Number of instances")
plt.title("Sampling within the historical distribution")

plt.subplot(2, 1, 2)
bc, c = _make_cdf(delta_dist, bins=100)
plt.plot(bc[1:], c, "k", label=f"Sampled deltas CDF (n={n})", linewidth=3)
plt.ylabel("Probability")
plt.xlabel("Deltas (%)")
plt.legend()

plt.tight_layout()

In [None]:
# We can inspect the results here too.
delta_dist

Once the two distributions have been acquired, `xh.cc.sampled_indicators` can be used to combine them element-wise and reconstruct a future distribution.

In [None]:
print(xh.cc.sampled_indicators.__doc__)

In [None]:
fut_dist, fut_pct = xh.cc.sampled_indicators(
    ds_dist=hist_dist,
    deltas_dist=delta_dist,
    delta_kind="percentage",
    percentiles=ref.percentile,
)

# The resulting distribution will possess the unique dimensions from both datasets.
# Here, this means that we get a reconstructed distribution for each future horizon.
fut_dist

In [None]:
# Since we used the `percentiles` argument, it also computed a series of percentiles.
fut_pct

In [None]:
# The distributions themselves can be used to create boxplots and compare the historucal distribution to the future ones.
plt.boxplot(
    [
        hist_dist.QMOYAN.isel(station=0),
        fut_dist.QMOYAN.isel(station=0, horizon=0),
        fut_dist.QMOYAN.isel(station=0, horizon=1),
        fut_dist.QMOYAN.isel(station=0, horizon=2),
    ],
    labels=["Historical", "2011-2040", "2041-2070", "2071-2100"],
)

plt.ylabel("Mean summer flow (m³/s)")
plt.tight_layout()

In [None]:
# The same statistics as before can also be computed by using delta_dist
delta_dist = delta_dist.rename({"sample": "realization"})  # xclim compatibility
ens_stats_2 = xh.cc.ensemble_stats(delta_dist, statistics)

# Inter-quartile range
ens_stats_2["QMOYAN_iqr"] = ens_stats_2["QMOYAN"].sel(percentiles=75) - ens_stats_2[
    "QMOYAN"
].sel(percentiles=25)

# Categories of agreement on the sign of change
ens_stats_2["QMOYAN_robustness_categories"] = robustness_categories(
    changed_or_fractions=ens_stats_2["QMOYAN_changed"],
    agree=ens_stats_2["QMOYAN_positive"],
    categories=categories,
    thresholds=thresholds,
    ops=ops,
)

ens_stats_2