Merge 73f4bbb into e6df64d

hydrologie · Oct 4, 2023 · 04d923f · 04d923f
2 parents e6df64d + 73f4bbb
commit 04d923f
Show file tree

Hide file tree

Showing 16 changed files with 1,293 additions and 11 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -6,6 +6,7 @@ include README.rst
 include .zenodo.json
 
 recursive-include docs *.rst conf.py Makefile make.bat *.jpg *.png *.gif
+recursive-include docs *.ipynb
 recursive-include tests *
 recursive-exclude * __pycache__
 recursive-exclude * *.py[co]

diff --git a/docs/index.rst b/docs/index.rst
@@ -4,17 +4,19 @@ Welcome to xHydro's documentation!
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
+   :hidden:
 
    readme
    installation
    usage
+   notebooks/local_frequency_analysis
    apidoc/modules
    contributing
    authors
    history
 
-Indices and tables
-==================
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
+.. Indices and tables
+.. ==================
+.. * :ref:`genindex`
+.. * :ref:`modindex`
+.. * :ref:`search`
diff --git a/docs/notebooks/local_frequency_analysis.ipynb b/docs/notebooks/local_frequency_analysis.ipynb
@@ -0,0 +1,347 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Frequency analysis module"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Basic imports\n",
+        "import xarray as xr\n",
+        "import hvplot.xarray\n",
+        "import numpy as np\n",
+        "import xdatasets as xd\n",
+        "\n",
+        "import xhydro as xh\n",
+        "import xhydro.frequency_analysis as xhfa"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Extracting and preparing the data\n",
+        "For this example, we'll conduct a frequency analysis using historical time series from various sites. We begin by obtaining a dataset comprising hydrological information. Here, we use the [xdataset](https://hydrologie.github.io/xdatasets/notebooks/getting_started.html) library to acquire hydrological data from the [Ministère de l'Environnement, de la Lutte contre les changements climatiques, de la Faune et des Parcs](https://www.cehq.gouv.qc.ca/atlas-hydroclimatique/stations-hydrometriques/index.htm) in Quebec, Canada. Specifically, our query focuses on stations with IDs beginning with `020`, possessing a natural flow pattern and limited to streamflow data. \n",
+        "\n",
+        "Users may prefer to generate their own `xarray.DataArray` using their individual dataset. At a minimum, the `xarray.DataArray` used for frequency analysis has to follow these principles:\n",
+        "\n",
+        "- The dataset needs a `time` dimension.\n",
+        "- If there is a spatial dimension, such as `id` in the example below, it needs an attribute `cf_role` with `timeseries_id` as its value.\n",
+        "- The variable will at the very least need a `units` attribute, although other attributes such as `long_name` and `cell_methods` are also expected by `xclim` (which is called at various points during the frequency analysis) and warnings will be generated if they are missing."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "ds = xd.Query(\n",
+        "    **{\n",
+        "        \"datasets\":{\n",
+        "            \"deh\":{\n",
+        "                \"id\" :[\"020*\"],\n",
+        "                \"regulated\":[\"Natural\"],\n",
+        "                \"variables\":[\"streamflow\"],\n",
+        "            }\n",
+        "        }, \"time\":{\"start\": \"1970-01-01\", \n",
+        "                   \"minimum_duration\":(15*365, 'd')},\n",
+        "\n",
+        "  }\n",
+        ").data.squeeze()\n",
+        "\n",
+        "# This dataset lacks some of the aforementioned attributes, so we need to add them.\n",
+        "ds[\"id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n",
+        "ds[\"streamflow\"].attrs = {\"long_name\": \"Streamflow\", \"units\": \"m3 s-1\", \"standard_name\": \"water_volume_transport_in_river_channel\", \"cell_methods\": \"time: mean\"}\n",
+        "\n",
+        "ds"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "ds.streamflow.dropna('time', 'all').hvplot(x='time', grid=True, widget_location='bottom', groupby='id')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Customizing the analysis settings\n",
+        "\n",
+        "### a) Defining seasons\n",
+        "We can define seasons using indexers that are compatible with `xclim.core.calendar.select_time`. There are currently four accepted types of indexers:\n",
+        "\n",
+        "- `month`, followed by a sequence of month numbers.\n",
+        "- `season`, followed by one or more of 'DJF', 'MAM', 'JJA' and 'SON'.\n",
+        "- `doy_bounds`, followed by a sequence representing the inclusive bounds of the period to be considered (start, end).\n",
+        "- `date_bounds`, which is the same as above, but using a month-day (%m-%d) format.\n",
+        "\n",
+        "For the purpose of getting block maxima through `xhydro.indicators.get_yearly_op`, the indexers need to be grouped within a dictionary, with the key being the label to be given to the requested period of the year. A second key can be used to instruct on the resampling frequency, for example to wrap around the year for winter."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Some examples\n",
+        "timeargs = {\n",
+        "    \"spring\": {\"date_bounds\": [\"02-11\", \"06-19\"]},\n",
+        "    \"summer\": {\"doy_bounds\": [152, 243]},\n",
+        "    \"fall\": {\"month\": [9, 10, 11]},\n",
+        "    \"winter\": {\"season\": ['DJF'], \"freq\": \"AS-DEC\"},  # To correctly wrap around the year, we need to specify the resampling frequency.\n",
+        "    \"august\": {\"month\": [8]},\n",
+        "    \"annual\": {}\n",
+        "    }"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### b) Getting block maxima\n",
+        "Upon selecting each desired season, we can extract block maxima timeseries from every station using `xhydro.indicators.get_yearly_op`. The main arguments are:\n",
+        "\n",
+        "- `op`: the operation to compute. One of \"max\", \"min\", \"mean\", \"sum\".\n",
+        "- `input_var`: the name of the variable. Defaults to \"streamflow\".\n",
+        "- `window`: the size of the rolling window. A \"mean\" is performed on the rolling window prior to the `op` operation.\n",
+        "- `timeargs`: as defined previously. Leave at `None` to get the annual maxima.\n",
+        "- `missing` and `missing_options`: to define tolerances for missing data. See [this page](https://xclim.readthedocs.io/en/stable/checks.html#missing-values-identification) for more information.\n",
+        "- `interpolate_na`: whether to interpolate missing data prior to the `op` operation. Only used for `sum`."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "The function returns a `xarray.Dataset` with 1 variable per indexer."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Here, we hide years with more than 15% of missing data.\n",
+        "ds_4fa = xh.indicators.get_yearly_op(ds, op=\"max\", timeargs=timeargs, missing=\"pct\", missing_options={\"tolerance\": 0.15})\n",
+        "\n",
+        "ds_4fa"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "scrolled": true
+      },
+      "outputs": [],
+      "source": [
+        "ds_4fa.streamflow_max_summer.dropna('time', 'all').hvplot(x='time', grid=True, widget_location='bottom', groupby='id')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### c) Using custom seasons per year or per station\n",
+        "\n",
+        "Using individualized date ranges for each year or each catchment is not explicitely supported, so users should instead mask their data prior to calling `get_yearly_op`. Note that when doing this, `missing` should be adjusted accordingly."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Create a mask beforehand\n",
+        "import random\n",
+        "nyears = np.unique(ds.time.dt.year).size\n",
+        "dom_start = xr.DataArray(np.random.randint(1, 30, size=(nyears, )).astype(\"str\"), dims=(\"year\"), coords={\"year\": np.unique(ds.time.dt.year)})\n",
+        "dom_end = xr.DataArray(np.random.randint(1, 30, size=(nyears, )).astype(\"str\"), dims=(\"year\"), coords={\"year\": np.unique(ds.time.dt.year)})\n",
+        "\n",
+        "mask = xr.zeros_like(ds[\"streamflow\"])\n",
+        "for y in dom_start.year.values:\n",
+        "    # Random mask of dates per year, between April and June.\n",
+        "    mask.loc[{\"time\": slice(str(y)+\"-04-\"+str(dom_start.sel(year=y).item()), str(y)+\"-06-\"+str(dom_end.sel(year=y).item()))}] = 1"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "mask.hvplot(x='time', grid=True, widget_location='bottom', groupby='id')"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# The name of the indexer will be used to identify the variable created here\n",
+        "timeargs_custom = {\"custom\": {}}\n",
+        "\n",
+        "# We use where() to mask the data that we want to ignore\n",
+        "masked = ds.where(mask==1)\n",
+        "# Since we masked almost all of the year, our tolerance for missing data should be changed accordingly\n",
+        "missing = \"at_least_n\"\n",
+        "missing_options = {\"n\": 45}\n",
+        "\n",
+        "# We use xr.merge() to combine the results with the previous dataset.\n",
+        "ds_4fa = xr.merge([ds_4fa, xh.indicators.get_yearly_op(masked, op=\"max\", timeargs=timeargs_custom, missing=missing, missing_options=missing_options)])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "ds_4fa.streamflow_max_custom.dropna('time', 'all').hvplot(x='time', grid=True, widget_location='bottom', groupby='id')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### d) Computing volumes\n",
+        "\n",
+        "The frequency analysis can also be performed on volumes, using a similar workflow. The main difference is that if we're starting from streamflow, we'll first need to convert them into volumes using `xhydro.indicators.compute_volume`. Also, if required, `get_yearly_op` has an argument `interpolate_na` that can be used to interpolate missing data prior to the sum."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Get a daily volume from a daily streamflow\n",
+        "ds[\"volume\"] = xh.indicators.compute_volume(ds[\"streamflow\"], out_units=\"hm3\")\n",
+        "\n",
+        "# We'll take slightly different indexers\n",
+        "timeargs_vol = {\n",
+        "    \"spring\": {\"date_bounds\": [\"04-30\", \"06-15\"]},\n",
+        "    \"annual\": {}\n",
+        "    }\n",
+        "\n",
+        "# The operation that we want here is the sum, not the max.\n",
+        "ds_4fa = xr.merge([ds_4fa, xh.indicators.get_yearly_op(ds, op=\"sum\", input_var=\"volume\", timeargs=timeargs_vol, missing=\"pct\", missing_options={\"tolerance\": 0.15}, interpolate_na=True)])"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "ds_4fa.volume_sum_spring.dropna('time', 'all').hvplot(x='time', grid=True, widget_location='bottom', groupby='id')"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# Local frequency analysis\n",
+        "\n",
+        "Once we have our yearly maximums (or volumes/minimums), the first step in a local frequency analysis is to call `xhfa.local.fit` to obtain distribution parameters. The options are:\n",
+        "\n",
+        "- `distributions`: list of [SciPy distributions](https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions). Defaults to [\"expon\", \"gamma\", \"genextreme\", \"genpareto\", \"gumbel_r\", \"pearson3\", \"weibull_min\"].\n",
+        "- `min_years`: minimum number of years required to fit the data.\n",
+        "- `method`: fitting method. Defaults to the maximum likelihood."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# To speed up the Notebook, we'll only perform the analysis on a subset of variables\n",
+        "params = xhfa.local.fit(ds_4fa[[\"streamflow_max_spring\", \"volume_sum_spring\"]], min_years=15)\n",
+        "\n",
+        "params"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Information Criteria such as the AIC, BIC, and AICC are useful to determine which statistical distribution is better suited to a given location. These three criteria can be computed using `xhfa.local.criteria`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "criteria = xhfa.local.criteria(ds_4fa[[\"streamflow_max_spring\", \"volume_sum_spring\"]], params)\n",
+        "\n",
+        "criteria"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "Finally, return periods can be obtained using `xhfa.local.parametric_quantiles`. The options are:\n",
+        "\n",
+        "- `t`: the return period(s) in years.\n",
+        "- `mode`: whether the return period is the probability of exceedance (max) or non-exceedance (min). Defaults to `max`."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "rp = xhfa.local.parametric_quantiles(params, t=[20, 100])\n",
+        "\n",
+        "rp"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.12"
+    },
+    "vscode": {
+      "interpreter": {
+        "hash": "e28391989cdb8b31df72dd917935faad186df3329a743c813090fc6af977a1ca"
+      }
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}
diff --git a/environment-dev.yml b/environment-dev.yml
@@ -5,6 +5,10 @@ dependencies:
   - python >=3.9,<3.12
   # Don't forget to sync the changes here with environment.yml!
   # Main packages
+  - numpy
+  - scipy
+  - statsmodels
+  - xarray
   - xclim >=0.45.0
   - xscen >=0.7.1
   # Dev
@@ -33,3 +37,7 @@ dependencies:
   # Packaging
   - build
   - wheel
+  # Notebooks
+  - hvplot
+  - pip:
+    - xdatasets