Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr thoughts #203

Open
martindurant opened this issue Sep 8, 2021 · 1 comment
Open

zarr thoughts #203

martindurant opened this issue Sep 8, 2021 · 1 comment

Comments

@martindurant
Copy link
Contributor

This is not an issue, but a place for a conversation, around zarr as a library for reading via reference'd datasets and how xarray interprets it. Wasn't sure where to put this, we would discuss at the next meeting.

  • zarr requires regular chunking throughout. This is unlike dask-array, where chunks can be arbitrary sizes along each dimension, so the pieces themselves are ND cuboid. This limits the range of datasets that xarray(zarr(references)) can be applied to. Workarounds would be complex, involving subselection from read on-disk chunks and/or reading many chunks per virtual chunk
  • astro coordinates (for example) map poorly to the xarray model:
    • you may have a time-series composed of a repeated sequence of bandpass filters. Either each bandpass has its own time coordinate and so the filters cannot do arithmetic, you create a merged time coordinate where most values for any given filter are null, or you pretend that all the timestamps of a given sequence of filter are equal
    • astro (and dicom...) represents coordinates coordinates analytically instead of as materialised coordinates; multiple images of the same region of sky are typically offset in a mosaic (to build up a larger aggregate image, with overlaps), dither (to have the same sky exposed on multiple pixels to average out sensitivity variations) and jitter (natural movement of telescope). Furthermore, the instrument commonly has multiple detectors that may have gaps or overlaps.
    • what an astro (or radiologist) actually wants to do is, for example, "sum the intensity in this geometric area as a function of time" or "find the flux ratio between these two filters for an aperture around this point"
  • xarray assumes various CF conventions such as scale/offset of data, even though zarr has its own filters for dealing with that kind of thing. Other things like "unit" may also have special meaning even though they are more generic terms beyond CF.
@rabernat
Copy link
Contributor

rabernat commented Sep 8, 2021

  • zarr requires regular chunking throughout.

This is up for discussion in the zarr v3 spec (zarr-developers/zarr-specs#40)

  • astro coordinates (for example) map poorly to the xarray model

I believe that most of these can eventually be solved via custom indexers. Under the hood I assume the raw data (not coordinates) are nd-arrays, no?

  • astro (and dicom...) represents coordinates coordinates analytically instead of as materialised coordinates

This is a bit similar to a projection on geospatial data. Merging discrete, irregularly sampled satellite images into a global map and then calculating statistics at each point is basically the killer feature of Google Earth Engine. Some stuff in the STAC space shows how we might do it in FOSS world.

  • xarray assumes various CF conventions such as scale/offset of data

These are all optional (decode_cf=False).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants