zarr thoughts #203

martindurant · 2021-09-08T18:44:28Z

This is not an issue, but a place for a conversation, around zarr as a library for reading via reference'd datasets and how xarray interprets it. Wasn't sure where to put this, we would discuss at the next meeting.

zarr requires regular chunking throughout. This is unlike dask-array, where chunks can be arbitrary sizes along each dimension, so the pieces themselves are ND cuboid. This limits the range of datasets that xarray(zarr(references)) can be applied to. Workarounds would be complex, involving subselection from read on-disk chunks and/or reading many chunks per virtual chunk
astro coordinates (for example) map poorly to the xarray model:
- you may have a time-series composed of a repeated sequence of bandpass filters. Either each bandpass has its own time coordinate and so the filters cannot do arithmetic, you create a merged time coordinate where most values for any given filter are null, or you pretend that all the timestamps of a given sequence of filter are equal
- astro (and dicom...) represents coordinates coordinates analytically instead of as materialised coordinates; multiple images of the same region of sky are typically offset in a mosaic (to build up a larger aggregate image, with overlaps), dither (to have the same sky exposed on multiple pixels to average out sensitivity variations) and jitter (natural movement of telescope). Furthermore, the instrument commonly has multiple detectors that may have gaps or overlaps.
- what an astro (or radiologist) actually wants to do is, for example, "sum the intensity in this geometric area as a function of time" or "find the flux ratio between these two filters for an aperture around this point"
xarray assumes various CF conventions such as scale/offset of data, even though zarr has its own filters for dealing with that kind of thing. Other things like "unit" may also have special meaning even though they are more generic terms beyond CF.

rabernat · 2021-09-08T19:02:07Z

zarr requires regular chunking throughout.

This is up for discussion in the zarr v3 spec (zarr-developers/zarr-specs#40)

astro coordinates (for example) map poorly to the xarray model

I believe that most of these can eventually be solved via custom indexers. Under the hood I assume the raw data (not coordinates) are nd-arrays, no?

astro (and dicom...) represents coordinates coordinates analytically instead of as materialised coordinates

This is a bit similar to a projection on geospatial data. Merging discrete, irregularly sampled satellite images into a global map and then calculating statistics at each point is basically the killer feature of Google Earth Engine. Some stuff in the STAC space shows how we might do it in FOSS world.

xarray assumes various CF conventions such as scale/offset of data

These are all optional (decode_cf=False).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zarr thoughts #203

zarr thoughts #203

martindurant commented Sep 8, 2021

rabernat commented Sep 8, 2021

zarr thoughts #203

zarr thoughts #203

Comments

martindurant commented Sep 8, 2021

rabernat commented Sep 8, 2021