Skip to content

[Feature] Add MERIT-Hydro watershed boundaries for ~58k gauges #87

@CooperBigFoot

Description

@CooperBigFoot

Summary

~58,000 watershed boundary polygons (plus upstream river networks) have been delineated from MERIT-Hydro for the gauges already supported by RivRetrieve. This issue proposes making them accessible directly through the library.

What exists

hydra-shed (a pure-Rust reimplementation of delineator by Matt Heberger) was used to delineate catchments for every gauge station in the RivRetrieve catalog. The ~58k basins correspond to all gauges whose coordinates have more than 3 decimal places of precision and whose outlets fall inside the MERIT-Hydro basin domain. ~565 outlets could not be snapped to the MERIT-Hydro network and are excluded.

Coverage

Country Gauges delineated
USA 23,841
Canada 7,630
Australia 6,241
France 5,327
Norway 4,541
Brazil 4,579
Poland 1,301
South Africa 1,285
Czech Republic 825
Japan 816
Slovenia 713
Chile 529
Germany (Berlin) 189
Lithuania 95
Portugal 73
Total ~58,000

GeoPackage schema (two layers per file)

Layer Key columns Geometry
watersheds gauge_id, gauge_name, gauge_lat, gauge_lon, snap_lat, snap_lon, snap_dist, area, basin, country, low_res POLYGON
rivers gauge_id, comid, uparea, strahler, shreve LINESTRING

Total data size is ~31 GB across 15 GeoPackage files (USA alone is ~10 GB), so per-country lazy downloading is essential.

Proposed hosting

HuggingFace Datasets at rivretrieve/watersheds — free, programmatic access via huggingface_hub, built-in caching.

Proposed API

from rivretrieve import PortugalFetcher

fetcher = PortugalFetcher()

# Returns a GeoDataFrame with the watershed polygon for gauge "04K/04A"
watershed = fetcher.get_watershed("04K/04A")

# Optionally include the upstream river network
watershed, rivers = fetcher.get_watershed("04K/04A", include_rivers=True)

Implementation outline

  1. COUNTRY_CODE class attribute on each fetcher (e.g. PortugalFetcher.COUNTRY_CODE = "portugal"). Fetchers without watershed data (Spain, UK-EA, UK-NRFA) still get the attribute — they produce a clear ValueError listing available countries.

  2. get_watershed() concrete method on RiverDataFetcher base class — all fetchers inherit it automatically:

    def get_watershed(self, gauge_id: str, include_rivers: bool = False):
        from . import watersheds  # lazy import
        return watersheds.query_watershed(
            country_code=self.COUNTRY_CODE,
            gauge_id=gauge_id,
            include_rivers=include_rivers,
        )
  3. New module rivretrieve/watersheds.py — thin wrapper around huggingface_hub.hf_hub_download and geopandas.read_file. Handles per-country gpkg download, caching, and gauge ID lookup.

  4. geopandas as optional dependency:

    # setup.py
    extras_require={
        "geo": ["geopandas>=0.12.0", "huggingface_hub>=0.20.0"],
    }
    pip install rivretrieve[geo]
    
  5. Clear error messages for: gauge not found in database, country not covered, geopandas/huggingface_hub not installed.

  6. Unit tests with a tiny fixture GeoPackage, mocking hf_hub_download.

  7. Docs page docs/watersheds.rst.

Questions for discussion

  1. Is the HuggingFace hosting approach acceptable, or would you prefer a different mechanism (e.g. Zenodo, direct S3)?
  2. Should COUNTRY_CODE be a required abstract class attribute, or is a softer convention acceptable?
  3. Should the rivers layer be exposed at all in v1, or keep it watershed-only?

Credits

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions