-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
~58,000 watershed boundary polygons (plus upstream river networks) have been delineated from MERIT-Hydro for the gauges already supported by RivRetrieve. This issue proposes making them accessible directly through the library.
What exists
hydra-shed (a pure-Rust reimplementation of delineator by Matt Heberger) was used to delineate catchments for every gauge station in the RivRetrieve catalog. The ~58k basins correspond to all gauges whose coordinates have more than 3 decimal places of precision and whose outlets fall inside the MERIT-Hydro basin domain. ~565 outlets could not be snapped to the MERIT-Hydro network and are excluded.
Coverage
| Country | Gauges delineated |
|---|---|
| USA | 23,841 |
| Canada | 7,630 |
| Australia | 6,241 |
| France | 5,327 |
| Norway | 4,541 |
| Brazil | 4,579 |
| Poland | 1,301 |
| South Africa | 1,285 |
| Czech Republic | 825 |
| Japan | 816 |
| Slovenia | 713 |
| Chile | 529 |
| Germany (Berlin) | 189 |
| Lithuania | 95 |
| Portugal | 73 |
| Total | ~58,000 |
GeoPackage schema (two layers per file)
| Layer | Key columns | Geometry |
|---|---|---|
watersheds |
gauge_id, gauge_name, gauge_lat, gauge_lon, snap_lat, snap_lon, snap_dist, area, basin, country, low_res |
POLYGON |
rivers |
gauge_id, comid, uparea, strahler, shreve |
LINESTRING |
Total data size is ~31 GB across 15 GeoPackage files (USA alone is ~10 GB), so per-country lazy downloading is essential.
Proposed hosting
HuggingFace Datasets at rivretrieve/watersheds — free, programmatic access via huggingface_hub, built-in caching.
Proposed API
from rivretrieve import PortugalFetcher
fetcher = PortugalFetcher()
# Returns a GeoDataFrame with the watershed polygon for gauge "04K/04A"
watershed = fetcher.get_watershed("04K/04A")
# Optionally include the upstream river network
watershed, rivers = fetcher.get_watershed("04K/04A", include_rivers=True)Implementation outline
-
COUNTRY_CODEclass attribute on each fetcher (e.g.PortugalFetcher.COUNTRY_CODE = "portugal"). Fetchers without watershed data (Spain, UK-EA, UK-NRFA) still get the attribute — they produce a clearValueErrorlisting available countries. -
get_watershed()concrete method onRiverDataFetcherbase class — all fetchers inherit it automatically:def get_watershed(self, gauge_id: str, include_rivers: bool = False): from . import watersheds # lazy import return watersheds.query_watershed( country_code=self.COUNTRY_CODE, gauge_id=gauge_id, include_rivers=include_rivers, )
-
New module
rivretrieve/watersheds.py— thin wrapper aroundhuggingface_hub.hf_hub_downloadandgeopandas.read_file. Handles per-country gpkg download, caching, and gauge ID lookup. -
geopandasas optional dependency:# setup.py extras_require={ "geo": ["geopandas>=0.12.0", "huggingface_hub>=0.20.0"], }
pip install rivretrieve[geo] -
Clear error messages for: gauge not found in database, country not covered, geopandas/huggingface_hub not installed.
-
Unit tests with a tiny fixture GeoPackage, mocking
hf_hub_download. -
Docs page
docs/watersheds.rst.
Questions for discussion
- Is the HuggingFace hosting approach acceptable, or would you prefer a different mechanism (e.g. Zenodo, direct S3)?
- Should
COUNTRY_CODEbe a required abstract class attribute, or is a softer convention acceptable? - Should the rivers layer be exposed at all in v1, or keep it watershed-only?
Credits
- Watershed delineation: hydra-shed (Rust), implementing the algorithm from delineator by Matt Heberger
- Hydrography: MERIT-Hydro (Yamazaki et al., 2019)