Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding first gis module to perform geospatial operations, notebook builds #61

Merged
merged 103 commits into from
Mar 12, 2024

Conversation

sebastienlanglois
Copy link
Contributor

@sebastienlanglois sebastienlanglois commented Dec 19, 2023

Pull Request Checklist:

  • This PR addresses an already opened issue (for bug fixes / features)
  • (If applicable) Documentation has been added / updated (for bug fixes / features).
  • (If applicable) Tests have been added.
  • CHANGES.rst has been updated (with summary of main changes).
    • Link to issue (:issue:number) and pull request (:pull:number) has been added.

What kind of change does this PR introduce?

This PR adds a GIS module for usual geospatial operations that are common in hydrology such a watershed delineation, watershed properties extraction, etc. It adapts the work that's been done in ravenpy while also adding some new functionalities.

Watershed Delineation

  • Support concurrent delineation of multiple watersheds simultaneously.
  • Enable access to official watershed polygons (shapefiles/geojson/geoparquet) from authoritative sources (DEH, HYDAT, USGS, HQ, etc.) —implemented collaboratively with xdatasets.

Physiographic Variable (or others) Extraction

  • Support simultaneous extraction of physiographic variables across multiple watersheds.
  • Facilitate the extraction of variables present in STAC catalogs (e.g., Planetary Computer).
  • Implement extraction considering pixel weighting rather than an "all_touched" approach, as this can significantly impact final results —implemented collaboratively with xdatasets.

Does this PR introduce a breaking change?

No

Other information:

This PR also integrates the changes from #65 and #68

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sebastienlanglois sebastienlanglois added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 19, 2023
@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre @TC-FF
I would like to prepare the .po file for the doc that I'll be adding with this PR. Is there any documentation about the process ?

@Zeitsperre
Copy link
Collaborator

@sebastienlanglois

I mentioned this to @TC-FF in my other Pull Request, but I haven't set things up to translate the fucntions and classes of the library. The reason for this is because docstrings can change much more often than the rest of the documentation. This means that po files will need to be manually touched up on every pull request that changes code.

Quickly going over the documentation on po files (https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html) It doesn't appear that the #: comments require for a message to be on particular line (which is good), but if we want to have translated docstrings, we likely need to add documentation for performing this.

@TC-FF Would you want to add a new section in CONTRIBUTING.rst explaining how to do this (with poedit)? I'll open an issue to discuss committing the docstring (autodoc) files.

@sebastienlanglois
Copy link
Contributor Author

@sebastienlanglois

I mentioned this to @TC-FF in my other Pull Request, but I haven't set things up to translate the fucntions and classes of the library. The reason for this is because docstrings can change much more often than the rest of the documentation. This means that po files will need to be manually touched up on every pull request that changes code.

Quickly going over the documentation on po files (https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html) It doesn't appear that the #: comments require for a message to be on particular line (which is good), but if we want to have translated docstrings, we likely need to add documentation for performing this.

@TC-FF Would you want to add a new section in CONTRIBUTING.rst explaining how to do this (with poedit)? I'll open an issue to discuss committing the docstring (autodoc) files.

Sorry I was not super clear, I meant the documentation for the examples based on the notebooks. I agree that docstrings/classes/fonctions translation is probably too difficult to maintain. But for notebooks and .rst translation, I agree that a new section in CONTRIBUTING.rst would be useful.

@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre
I need to have xagg in the environment running the notebooks for the RTD documentation. I've added the library to the environment-dev.yml file but I'm usure where to put it in pyproject.toml.

Is is part of dev = [...] or docs = [ ... ] ? Thanks!

environment-dev.yml Outdated Show resolved Hide resolved
@Zeitsperre
Copy link
Collaborator

Zeitsperre commented Dec 20, 2023

@Zeitsperre I need to have xagg in the environment running the notebooks for the RTD documentation. I've added the library to the environment-dev.yml file but I'm usure where to put it in pyproject.toml.

Is is part of dev = [...] or docs = [ ... ] ? Thanks!

So we listed xagg in the extra recipe of xdatasets. If the dependency is needed for extra functionality of xdatasets within the xhydro library, I'd add it to the pyproject.toml like so:

[project.optional-dependencies]
...
extra = ["xdatasets[extra]>=0.3.1"]

If it's exclusively for an xagg function to show off in a notebook, I'd add xagg to the docs recipe:

[project.optional-dependencies]
docs = [
  ...
  "xagg"
]

@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre I need to have xagg in the environment running the notebooks for the RTD documentation. I've added the library to the environment-dev.yml file but I'm usure where to put it in pyproject.toml.
Is is part of dev = [...] or docs = [ ... ] ? Thanks!

So we listed xagg in the extra recipe of xdatasets. If the dependency is needed for extra functionality of xdatasets within the xhydro library, I'd add it to the pyproject.toml like so:

[project.optional-dependencies]
...
extra = ["xdatasets[extra]>=0.3.1"]

If it's exclusively for an xagg function to show off in a notebook, I'd add xagg to the docs recipe:

[project.optional-dependencies]
docs = [
  ...
  "xagg"
]

It is indeed to use extra functionnalities of xdatasets but not within the xhydro library. In the notebook example, we first call xdatasets [extra] and then, using the results, we call xhydro. Would that use case be covered by this ? :

[project.optional-dependencies]
docs = [
  ...
  "xagg"
]

@Zeitsperre
Copy link
Collaborator

It is indeed to use extra functionnalities of xdatasets but not within the xhydro library. In the notebook example, we first call xdatasets [extra] and then, using the results, we call xhydro. Would that use case be covered by this ? :

[project.optional-dependencies]
docs = [
  ...
  "xagg"
]

That looks good to me!

@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre
Other question, the RTD build is failing due to excessive memory consumption that is caused by the extraction and processing of data from xdatasets. I know I could reduce the amount of data and calculations involved but at the same time, I would prefer to show a real and useful example in the documentation.

As an alternative, I was thinking that I could :

  1. Convert my code cell to markdown
  2. Precalculate and store the query's results from xdatasets, perhaps in xhydro-testdata or in cloud storage
  3. Use a hidden cell the would fetch the precalculated data

In essence, this would reduce memory consumption for the RTD build while hiding the fact that the data is precalculated. We assume that the user will have more than enough (+8Gb) memory when actually testing the notebook in their own environment.

As I'm not that familiar with the RTD build, do you know if there is a way (cell tag, metadata, etc.) to hide a cell when bulding the docs while also allowing its execution ? Or alternatively, do you have a better idea that this ? Thanks !

N.B. : The same xdatasets' query runs fine when using a github runner (7 Gb RAM). I'm curious to see what are the specs for the RTD runners.

@Zeitsperre
Copy link
Collaborator

@sebastienlanglois

I think the total memory allocation for each builder is around 4GB (but it can be bumped up if you send the ReadTheDocs maintainers a message from the admin page of xhydro on RTD with some justifications). Given that it's a completely free service, I don't like the idea of asking for more memory too often (we've been granted more memory for a few projects at Ouranos).

If it isn't important to run the notebook on every change to xhydro, one thing you could do is set it so that that particular notebook is not run on ReadTheDocs. This would be as simple as doing the following:

  • Setting the pre-commit hook for nbstripout to not remove outputs from your particular notebook (https://github.com/kynan/nbstripout) and setting nbsphinx_execute to auto
    • This means that the notebook is never run (because it always has outputs), but because the other notebooks are stripped of outputs, they are always run.

This approach is a bit problematic since it means that the notebook is never checked at all. An alternative approach could look like:

  • Do the same as above, but add a notebook checking build on GitHub that runs all notebooks, including the xagg-dependent one. Notebook failures will block Pull Request merges
    • If the notebooks are relatively stable, this shouldn't often cause problems for contributors.
  • Disable notebooks from being run at all on ReadTheDocs, and have a GitHub Workflow that runs all notebooks and commits their outputs every time that a merge occurs to the main branch.
    • Fancier / more complex, but leverages GitHub's more liberal memory/processing power to reduce strain on ReadTheDocs.
    • Could be made simpler if the notebooks are run on a cron-based schedule (e.g. every Sunday night) or when preparing a new release (also complex, but possible).

There are a lot of ways to go about doing this. I've implemented similar approaches in xclim (see: https://github.com/Ouranosinc/xclim/blob/e678308f2c5cde75a3c63cddcab7835dab1e422e/docs/conf.py#L206). Curious to hear what you think of these.

@sebastienlanglois
Copy link
Contributor Author

sebastienlanglois commented Dec 20, 2023

@Zeitsperre
Thanks, those are some really great suggestions!

Looking ahead, we might come across other notebooks that are memory-intensive, making it make sense to switch to Github Workflows. However since this might involve quite a bit of work, we could maybe start by checking out your second suggestion for now :

  • pre-commit hook for nbstripout + running a notebook checking build on GitHub .

So, if I get it right, when we run the notebook checking build on Github, it goes through all the notebooks. Does this imply that the notebooks not affected with nbstripout removing outputs (all notebooks except the xagg dependant one, for now) would run twice—once in the Github checking build and again during the ReadTheDocs build (assuming Github check completed successfully) ?

@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre
For now, I have deactivated the execution of nbsphinx only for this notebook (GIS notebook) following this approach:
https://nbsphinx.readthedocs.io/en/0.9.3/never-execute.html

image

Coming back from the holidays, we can set this up properly!

Copy link
Collaborator

@RondeauG RondeauG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mes commentaires que j'avais laissés sont encore valides, mais mineurs et devraient pouvoir facilement être réglés. @sebastienlanglois , quand penses-tu avoir le temps de regarder ?

@sebastienlanglois
Copy link
Contributor Author

Mes commentaires que j'avais laissés sont encore valides, mais mineurs et devraient pouvoir facilement être réglés. @sebastienlanglois , quand penses-tu avoir le temps de regarder ?

Je profite de la semaine de relâche pour terminer ça! Ce sera fini d'ici la fin de la semaine sans faute!

@sebastienlanglois
Copy link
Contributor Author

@Zeitsperre @RondeauG
I've updated this branch based on @RondeauG's review. It should now be ready to merge but for some reason, the notebooks testing suite is not successful and I don't know why, The GIS notebook runs correctly on my local machine (python 3.10) with a environment-dev.yml conda environment.

@Zeitsperre
Copy link
Collaborator

@sebastienlanglois The kernelspec is a weird artifact that notebooks use. I have a hook in a few other projects that simply removes that field (since if it's wrong, that causes errors; no problem if missing). I'll see if I can find the config.

@Zeitsperre
Copy link
Collaborator

@sebastienlanglois

The last thing we need to fix here is that the notebooks need to be re-run with their outputs saved. ReadTheDocs can't handle running the notebook examples, but if they are saved to the notebooks, they can parse and display them at least.

@sebastienlanglois
Copy link
Contributor Author

@sebastienlanglois

The last thing we need to fix here is that the notebooks need to be re-run with their outputs saved. ReadTheDocs can't handle running the notebook examples, but if they are saved to the notebooks, they can parse and display them at least.

Thanks! That's what I figured from our previous discussions so I was surprised that it didn't work out. Now I understand that the culprit was the kernelspec property in the notebook. I will push the notebook with outputs in a few minutes and hopefully, we are done with this PR!

Copy link
Collaborator

@RondeauG RondeauG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quelques suggestions très mineures, mais ça me semble prêt ! Bravo :)

.github/workflows/main.yml Outdated Show resolved Hide resolved
Comment on lines +116 to +120
ESMF_VERSION := $(shell cat $(ESMFMKFILE) | grep "ESMF_VERSION_STRING=" | awk -F= '{print $$2}' | tr -d "'")
install-esmpy: clean ## install esmpy from git based on installed ESMF_VERSION
pip install git+https://github.com/esmf-org/esmf.git@v$(ESMF_VERSION)\#subdirectory=src/addon/esmpy

install: install-esmpy ## install the package to the active Python's site-packages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Zeitsperre We can leave this here as an additional security, but this should not be required anymore (yay!)

  • If xESMF is badly installed or initiated, xscen will simply deactivate the functions using it. xhydro/xdatasets do not use it at all, for now.
  • esmf/esmpy 8.6.0 has been released and should have fixed that bug too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to, we can remove the install-esmpy step from install. This just means that we don't install the GitHub repo version when we run $ make install. I can see having that disabled as being beneficial for users in systems that don't have grep, awk, and tr installed (powershell and command prompt, probably)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave that for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to point out that xdatasets has xesmf as a dependency because of xagg. I had initially created a spinoff xagg library with no xesmf dependency (because xesmf is used for regridding but not for spatial averaging in xagg, which is the only thing required for xdatasets) but we now have xagg as a dependency.

xesmf is required for the notebooks but might not be for the tests.

docs/index.rst Outdated Show resolved Hide resolved
environment-dev.yml Outdated Show resolved Hide resolved
xhydro/gis.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the approved Approved for additional tests label Mar 12, 2024
sebastienlanglois and others added 4 commits March 12, 2024 11:31
Co-authored-by: RondeauG <38501935+RondeauG@users.noreply.github.com>
Co-authored-by: RondeauG <38501935+RondeauG@users.noreply.github.com>
Co-authored-by: RondeauG <38501935+RondeauG@users.noreply.github.com>
Co-authored-by: RondeauG <38501935+RondeauG@users.noreply.github.com>
@RondeauG
Copy link
Collaborator

RondeauG commented Mar 12, 2024

@sebastienlanglois Il y a eu un mixup dans les triggers pour les tests, mais c'est bon maintenant. Tu peux merger quand tu es prêt !

@sebastienlanglois sebastienlanglois merged commit 376684f into main Mar 12, 2024
24 checks passed
@sebastienlanglois sebastienlanglois deleted the basin-delineation branch March 12, 2024 17:25
@sebastienlanglois
Copy link
Contributor Author

sebastienlanglois commented Mar 22, 2024

@Zeitsperre @RondeauG I know this PR is closed but just wanted to point out that xagg has changed the xesmf dependency as optional (only required for regridding). This means that xdatasets no longer has a xesmf dependency when installing xagg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Approved for additional tests CI documentation Improvements or additions to documentation enhancement New feature or request notebooks Run tests against notebooks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Geospatial operations for hydrological analysis
4 participants