Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the relationship between hydrofunctions, Ulmo, dataretrieval, HyRiver, and others? #79

Open
mroberge opened this issue Mar 18, 2021 · 13 comments

Comments

@mroberge
Copy link
Owner

mroberge commented Mar 18, 2021

In response to this issue DOI-USGS/dataretrieval-python#8 and comments from @emiliom @jkreft-usgs @DanCodigaMWRA

There are several open source software projects that allow you to request, parse, and analyze hydrology data from the USGS NWIS. Why are there overlapping projects? My guess is that it is a combination of scientists writing code to meet their very specific research needs, people creating projects without searching for what already exists, and because sometimes people feel uncomfortable trying to work with people they don't know yet. I'd like to work with the maintainers of other projects to eliminate some of the redundancy and improve the cooperation.

My name is Martin Roberge, and I'm the author of hydrofunctions. I do research on stream hydrology and I'm an educator. I made hydrofunctions to meet my specific needs: I download lots of stream gauge data from the USGS which I then analyze inside Pandas dataframes in a Jupyter notebook. Since most of my students do not come from programming backgrounds, I have spent most of my time trying to make hydrofunctions easy for beginners to use.

My main goals for hydrofunctions are:

  • Easy to Use
    • Comprehensive User's manual with lots of examples and applications
    • Extensive docstrings so you can use help() while coding. The docstrings have examples too.
    • Helpful error messages for common mistakes
  • Preserve the information from the NWIS
    • Maintain access to the original WaterML / json
    • Keep data quality flags
    • Access the sensor & station metadata inside the original WaterML
    • Keep track of sensor sampling frequency, in case you measure some things once an hour and other things once a day.
    • Print a quick data summary so you know what you have
  • Create custom dataframes from your request
    • Include one site or many
    • Output data for one sensor or all of the sensors.
    • Output only the data quality flags for QA/QC if you want
  • Code Quality:
    • I find it difficult to develop without lots of unit tests. If you don't have a test for every function and branch in the logic, how will you know if the latest change creates a new bug?
    • Write lots of tests
    • Use continuous integration to make sure everything works all the time.
    • Document everything

The problem is that I am just one person, and every hour I spend adding functionality to hydrofunctions is an hour I could have spent measuring how fast flood waves travel down a river, or whatever I'm up to that day. I would love to collaborate with someone else.

Other projects that work with NWIS data are:

  • PyGeoHydro package is part of the HyRiver project by Taher Chegini @cheginit. In a little more than a year he has created an enormous set of packages for data access, mapping, and analysis. HyRiver is mostly focused on working with mapping webservices, but the PyGeoHydro package allows you to request daily discharge from the NWIS and outputs it in a dataframe. He seems focused on watershed data: he makes it possible to calculate the landcover for your watershed, access the National Inventory of Dams, access meteorology data, make maps from the NHDplus data... he includes a lot of functionality.
  • Ulmo: this is the OG as the kids used to like to say. (first commit in 2011!) It has the ability to request data from at least fifteen different sources. ulmo.usgs.nwis.get_site_data() is the function that requests stream gauge data. Ulmo processes the original WaterML and returns a dictionary that needs further processing to use the data in a dataframe. It can be finicky when you are requesting stream gauge data, and I can't always figure out what is wrong with my requests. Emilio Mayorga @emiliom is the lead developer now.
  • dataretrieval was set up in 2018 as a Python alternative to the dataRetrieval R package. It is tailored specifically to the services provided by the NWIS and is maintained by USGS staff. You can access the different NWIS services easily by specifying the service in your request, and everything gets output to a dataframe. It is maintained by Timothy Hodson @thodson-usgs.
  • Pastas is for working with hydrological time series. It doesn't appear to have any functions that collect data; instead most of its functionality allows you to carefully control how you fill missing data, resample data to different frequencies, create artificial datasets, model how a dependent timeseries responds to an independent timeseries, and conduct various timeseries analyses. It is actively maintained by a large team of collaborators, and is led by Raoul Collenteur @raoulcollenteur
  • pywaterinfo is a package for downloading stream data from the Flanders Environment Agency. It is run by @stijnvanhoey
  • streamstats will retrieve the HUC8 watershed for a given point. It uses the USGS StreamStats API. This is from the very active earthlab group, who seem to do a lot of training and workshops for Earth Scientists. @mbjoseph @lwasser
  • WellApplication is a collection of tools for analyzing groundwater; it can collect data from NWIS. Maintained by Paul Inkenbrandt. @inkenbrandt
  • hydropy is no longer actively maintained.

Please let me know if anyone thinks that I have mischaracterized their project.
I would love to hear your opinion about how these different projects could collaborate or how we could 'stake out ground' so that we don't replicate functionality. Why re-invent the wheel?

@inkenbrandt
Copy link

I think this is a good summary, Martin. I would actually be keen to transferring my maintenance efforts to supporting your script. I personally haven't contributed to other repositories because I don't feel experienced enough to make meaningful improvements. Everyone has their own way of accessing the NWIS services and making them available through commands. I am used to the commands in my library and know exactly what they are doing, but I wouldn't be opposed to collaborating. I like how you have added tests and good documentation to your library. I am also an earth-science educator, but just for introductory geology.

@thodson-usgs
Copy link

Thanks Martin,
You summarize dataretrieval well. In the Unix spirit, dataretrieval tries to do one thing - data retrieval of (mostly) USGS hydrologic data - and do that one thing well. I'm happy to help support additional USGS datasets or to work to standardize and integrate dataretrieval into a more cohesive ecosystem of tools. Although I often leave it on the back burner, I don't plan to abandon it unless a well-supported and stable alternative emerges. At this point it requires minimal maintenance, and I know I can fix it quickly whenever the REST APIs change.
-Tim

@cheginit
Copy link

Thanks Martin,

The summary for the HyRiver is good enough. As you said, I mostly focus on watershed data data. Your package was the reason that I didn't add coverage for all NWIS services and just added daily data which I needed for my research at the time. I designed the HyRiver project with extensibility in mind. Each one of the six packages in this software stack has a specific purpose that can be used as a standalone project and can be used in other packages. For example, PyGeoOGC and PyGeoUtils are the engines of the project that all the other packages rely on for generating queries and conversion to dataframes for other web services. These two packages are general and can be used for any geospatial web services.

Regarding coordination for further development, I agree. @emiliom created hydro_pycommunity repo for this purpose.

I think a good starting point would be establishing a framework for the projects that are within the scope of this collaboration. For example, we can create a repo that provides some guidelines for developers for starting a new project such as a categorized list of existing efforts (an awesome-style repo), and steps for creating a new project. We can come up with a template (maybe using cookiecutter) for projects to have some minimum requirements for linting, documentation, etc.. The README file should include some specific sections for example, Motivation, Scope, Usage, Installation, Credits, etc..

@mroberge mroberge pinned this issue Mar 19, 2021
@lwasser
Copy link

lwasser commented Mar 24, 2021

hi @mroberge et al! i just wanted to mention that i just received funding for @pyOpenSci and we will be starting an effort to help subcommunities organize for exactly this purpose. we also have considered needs such as finding other maintainers and such and i'm super open to what exactly the needs are to better support open source python.

I'm a little under the weather this week with what has happened in my town of Boulder, but wonder if there is a way in a few weeks to circle back and check in on whether pyopensci could facilitate helping you and this group build some community around your (and our) tools. I use hydrofunctions in my courses and really appreciate the package and the effort it takes to maintain a package like this.

@mroberge
Copy link
Owner Author

Hi @lwasser ! Congratulations on getting funding! I've been following earthlab since @mbjoseph contacted me.

I'm interested! @pyOpenSci looks like a great idea and I would be happy to contribute and work with you.

@stijnvanhoey
Copy link

Thanks @mroberge. With respect to pywaterinfo, it is indeed a (small) Python wrapper around the API used by the Flemish environmental agency to access the data available on https://www.waterinfo.be/Meetreeksen/ (they provide stream data, but also water quality parameters). In terms of maintenance and development, I cite @thodson-usgs

I don't plan to abandon it unless a well-supported and stable alternative emerges. At this point it requires minimal maintenance, and I know I can fix it quickly whenever the REST APIs change.

I'm certainly interested in a more community oriented approach. I do have the impression pywaterinfo is the only non-USGS data oriented package, but we can always see on which level some common ground can be found. For example, agree on an output (dataframe) format that would align with the other packages so users can easily reuse a certain analysis on data sets from different sources (waterinfo, USGS,...)?

In terms of documentation/cookiecutter-template/... guidelines as described by @cheginit. This is a very useful proposal, but I think we should build on the excellent work @lwasser and @pyOpenSci are already doing instead of defining a new/separate set of guidelines.

Looking forward to further collaboration.

@lwasser
Copy link

lwasser commented Mar 25, 2021

@mroberge @stijnvanhoey this all sounds great to me! i am guessing we will begin real work in May or June but i'd love to see how PyOS can work with you both and this community. we also have a cookie cutter -- and will be developing better standards in our contributing & dev guide over the next year. i'd love to get this communities input as we develop resources to support exactly this use case!

@cheginit
Copy link

cheginit commented May 25, 2021

Hi all, I got invited to give a 15-min (virtual) talk at Pangeo Showcase about HyRiver, tomorrow (May 25th) at 4 pm EDT. I am going to going to talk about the state of the project and future direction. I think it would be a good opportunity to meet and have a discussion. I would be happy to see you there.

Edit: The correct date as Martin mentioned is May 26th, 4 pm EDT.

@mroberge
Copy link
Owner Author

Looking forward to seeing it! (4pm Wednesday, May 26)

@emiliom
Copy link

emiliom commented May 26, 2021

Thanks for sharing @cheginit. Unfortunately I have a conflict.

While I'm here: so sorry for not chiming in on this great thread 😞 ! I'll use Taher's ping as motivation for finally following up hopefully by the end of this week. Thanks to @mroberge for starting it, and great seeing everyone's input.

@cheginit
Copy link

@emiliom Sure, I understand. The recording and link to the presentation material is here.

@aaraney
Copy link

aaraney commented Jun 18, 2021

@mroberge this is an interesting issue, it's good to see so much community engagement! @cheginit told me about this issue and suggested I mention the project I help develop and maintain, HydroTools, at the Office of Water Prediction.

HydroTools is a namespace, toolbox like, package that is designed with data scientists in mind. As such, we've taken a different approach than it appears hydrofunctions has, we have a canonical pandas data frame format that all of our tools output and comply with and enforce opinionated data representations like naive UTC datetimes and heavy usage of categorical data types. Currently we have tools for retrieving and caching National Water Model forecasts from google cloud, an event based evaluation metrics package, a package of common hydrologic metrics, and a NWIS IV data service which implements caching, asynchronous data retrieval, and include data quality flags.

The two motivations for building our nwis_client was (IMO I can't speak for my colleagues) (1) to support large scale model evaluation activities where NWIS data at the continental scale is often required and (2) enforce UTC time on both input and outputs.

Our work does not access any NWIS services other than the instantaneous value service and we don't offer any plotting, quality control, or data resampling methods.

@mroberge
Copy link
Owner Author

Thank you @aaraney ! I've been looking over HydroTools and love so many of its features:

  • the subpackage structure to import what you want
  • the separate _restclient, with built-in caching and async requests
  • Great docstrings! And they look even better in Sphinx with your Furo style
  • I like the tidy long-form dataframes... and then saving space by making some repetitive strings categorical data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants