Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray and rioxarray compatibility #50

Open
scottyhq opened this issue May 7, 2021 · 8 comments
Open

xarray and rioxarray compatibility #50

scottyhq opened this issue May 7, 2021 · 8 comments

Comments

@scottyhq
Copy link
Contributor

scottyhq commented May 7, 2021

xarray 0.18 and rioxarray 0.4 were just released http://xarray.pydata.org/en/stable/whats-new.html#v0-18-0-6-may-2021 , and they included some important changes with backend configuration. Specifically, you can now load xarray datasets via rioxarray.open_rasterio() like so:

xds = xarray.open_dataset("my.tif", engine="rasterio")

I actually haven't looked closely at the stackstac code yet to know how the dataset opening logic currently compares in these libraries... but it would be great to add some docs on integrating with rioxarray. One nice use-case is saving a small netcdf timeseries subset for using locally in QGIS.

Also, I'm wondering if it's possible to relax (^0) or bump (^0.18.0) the current xarray pin for the next release? Could add some matrix tests against xarray master to ensure things don't break since currently it does seem like minor bumps in xarray are more like major bumps ;) (#26 ).

xarray = "^0.17.0"

@gjoseph92
Copy link
Owner

I separated the xarray version conversation into #51, since I think we can close that quickly, whereas rioxarray compatibility deserves more discussion.

tl;dr: we can and should integrate with rioxarray right now, but I'd prefer to see the functionality you actually want from rioxarray split into a separate, IO-agnostic library, and a simple metadata convention tying them together.

I doubt stackstac will ever actually use rioxarray internally for IO, since they function at very different levels and have different goals. rioxarray aims to produce a DataArray/Dataset just from a GeoTIFF, etc. file, so the metadata comes from opening the file itself. stackstac's whole goal is to never open the files themselves, and get all metadata from STAC.

So I think rioxarray compatibility for stackstac would simply mean following rioxarray's CRS data model (a grid_mapping attr referencing the name of a grid mapping coordinate, which is itself a DataArray containing a spatial_ref or crs_wkt attr).

We can certainly do this, though personally I think it's a bit cumbersome for users. I suspect that most people (at least STAC users) would prefer to not care about the details of CF conventions. Plus, the STAC Projection extension and CF models represent similar information, but are structured differently, and it would be nice if neither had to impose its structure on users of the other. As I mentioned in the meeting a couple months ago (pangeo-data/cog-best-practices#4 (comment)), I agree that it's important to be able to read and write data that follows CF conventions. However, I don't know that users actually want to work with CRSs according to CF conventions in xarray. If the CRS was represented more simply, yet still could be read from and written to a netCDF file without information loss, would users be happy?

I think the takeaway of the meeting a couple months ago was:

We're broadly in agreement that it'd be great if libraries doing I/O (rioxarray, a hypothetical library that has the I/O component of stackstac) exposed geospatial information consistently. Then downstream analysis / visualization libraries can rely on this information, regardless of how the data was loaded.

Basically, I think that as a common interchange interface between libraries, the rioxarray CRS model is more complex and netCDF-specific than we'd ideally want. I don't know though if it's necessary to capture the full range of complexity of CRSs—I'd just love if something simpler could work.

Eventually, I think the answer here will be custom indexes in xarray (pydata/xarray#4979 pydata/xarray#1603 cc @dcherian).

Another thing is that rioxarray isn't as focused on dask for some operations (corteva/rioxarray#119). Since stackstac is entirely dask-oriented, I'm hesitant to encourage users towards a library that won't work as well on their data. I wrote stackstac.reproject_array for this reason. I know @djhoese has been thinking about this with geoxarray/geoxarray#16 / pyresample. I'm curious if geoxarray could become the right namespace to put this IO-agnostic functionality in, and rioxarray could remain rasterio-oriented just for IO? Or maybe rioxarray takes this on and gets more first-class dask support?

All that said though—for now, we might as well use a crs attribute (I'm more reluctant on the grid_mapping stuff) so that the rio accessor can understand stackstac's outputs.

But perhaps this issue can be a place to start discussing that common CRS data model, and where we'd want operations on it to live?

cc @snowman2 @TomAugspurger

@djhoese
Copy link

djhoese commented May 7, 2021

Regarding indexers, also see xarray-contrib/xoak#14 where the xoak devs and I discussed pyresample, geoxarray, xoak, and xarray's new base Indexer class.

Putting this kind of stuff in geoxarray (or a library like it) is what I had hoped it would become. When I first created the repository I had all these ideas of what it could be and how it would work, then I learned that a lot of the stuff I was hoping to depend on (persistence in xarray accessors and/or attributes) just isn't how things work. I basically stopped trying to develop for geoxarray and waited and watched to see how this community developed what it needed before I tried to come up with a solution for something that wasn't being asked for. I think based on this discussion and others that have come up more and more that we're at a point where we (xoak, xarray, rioxarray, pyresample/satpy, stackstac, etc) are asking for a solution like geoxarray.

I suspect that most people (at least STAC users) would prefer to not care about the details of CF conventions.

I agree that most people don't want to be limited by CF conventions, especially when they become cumbersome in situations like grid_mapping attributes, etc.

However, I don't know that users actually want to work with CRSs according to CF conventions in xarray. If the CRS was represented more simply, yet still could be read from and written to a netCDF file without information loss, would users be happy?

I think this is the key question. I think @snowman2 and the rioxarray project have figured out a good chunk of the "gotchas" involved with how this information can/should be represented in xarray objects. I personally wish CF conventions didn't dictate what in-memory/xarray representation looked like. I think a library (perhaps geoxarray) is needed that defines the "preferred" way to represent these properties in xarray objects, provides conversions to and from the various common schemes (CF, lon/lat coordinate arrays, etc), and likely provides wrappers or convenience functions around other functionality related to CRS-aware usage (resampling, indexing, etc).

Maybe we need to define the best case interface we can think of for these types of usages? For example, assume xarray is completely geospatial focused, knows everything about coordinate references systems, and doesn't necessarily need backwards compatibility, how does someone working with geotiffs or stac geo data use it?

my_data_arr.sel(y=..., x=..., crs=some_crs)
resampled_data_arr = my_data_arr.resample(crs, y, x)
my_data_arr.crs.semi_major_radius

@snowman2
Copy link

snowman2 commented May 8, 2021

stackstac's whole goal is to never open the files themselves, and get all metadata from STAC.

This is very similar to how opendatacube works, thought they don't currently use STAC. Someone on the call a while back mentioned they have thought about using STAC. Might be worth connecting with them.

stackstac would simply mean following rioxarray's CRS data model

I think this would be a great path forward as it would bring consistency for the CRS storage mechanism across non-xarray communities as well, including the GDAL stack (rasterio/QGIS/etc..). Had a similar discussion in opendatacube a while back about this ref.

If the CRS was represented more simply, yet still could be read from and written to a netCDF file without information loss

Here are some things to consider about this:

  • There are many ways to lose information in xarray and storing it as a coordinate has historically been the way to keep it around. Though the Index might be another alternative.
  • Information can be lost when exporting to a netCDF file and read in with other software. Currently xarray.open_rasterio stores the CRS as a string in the crs attribute ref. However, since it is in a non-standard location the CRS information is unable to be determined when read with other software (i.e. GDAL/rasterio/QGIS/ArcGIS) when being written with to_netcdf.

We can certainly do this, though personally I think it's a bit cumbersome for users.

This is something you could handle with a method/function where the users don't have to worry about anything similar to rio.write_crs.

I wrote stackstac.reproject_array for this reason.

Nice! Might be worth collaborating with the opendatacube group on this as well as they have some code that uses dask/rasterio to reproject ref.

Or maybe rioxarray takes this on and gets more first-class dask support?

I would definitely like to see dask support improved in rioxarray. Here are some things to consider:

  • Reprojection with dask is something that is likely going to get complicated ref. There are currently 14 resampling methods in GDAL ref, and they will likely add more in the future.

  • Having it in a separate library where users who aren't necessarily interested in using rioxarray such as stacstac/datacube could leverage this without being tied to rioxarray ref.

  • One thing to keep in mind is that rioxarray it has rasterio as a dependency. That means that to use any part of it, you need GDAL. geoxarray/pyresample sounds like a good no-GDAL dask reprojection option as they are already working towards that as you mentioned ref.

@djhoese
Copy link

djhoese commented May 8, 2021

@snowman2 is there anything in rioxarray's method for storing and handling of CRS information that you would change if backwards compatibility didn't matter? If I were to port this functionality to geoxarray, where would you suggest I start looking in rioxarray?

Side question: does anyone know if entrypoints (like xarray now uses for engines) has performance penalties as the environment grows? I know using pkg_resources can be slow the more packages are installed, but not sure on entrypoints.

@snowman2
Copy link

snowman2 commented May 8, 2021

is there anything in rioxarray's method for storing and handling of CRS information that you would change if backwards compatibility didn't matter?

Can't think of anything at the moment I would change. I would recommend looking at the crs, write_crs methods for implementation details.

@RichardScottOZ
Copy link
Contributor

ODC and stac

https://github.com/opendatacube/datacube-stac-example

So they are working on things - DEA has a STAC catalogue, etc.

@RichardScottOZ
Copy link
Contributor

stac_items = satsearch.Search(
    url="https://explorer.sandbox.dea.ga.gov.au/stac/",
    bbox=bbox,
    collections=["s1_gamma0_geotif_scene"],
    datetime="2000-02-01/2021-03-31",
).items()

@sharkinsspatial
Copy link
Contributor

I had posted an issue around some of this discussion on an internal Development Seed repo but I realized that it might be worthwhile to share. I copied it over to this gist and got some great comments and thoughts from @djhoese and @geospatial-jeff around their development plans for the future that might be relevant. Not sure if this thread is the best spot to continue the discussions around this? I know @scottyhq had a more general thread on this topic but I seem to have misplaced it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants