Xarray Cannot Read `ConsolidateMetadata` Reference Output #675

ranchodeluxe · 2024-01-25T23:31:48Z

Versions:

zarr: '2.16.1'
xarray.: '2023.12.0'
pangeo-forge-recipes: @main

Problem:

Given a WriteCombinedReference beam pipeline like this:

    with pipeline as p:
        (
            p
            | beam.Create(pattern.items())
            | OpenWithKerchunk(file_type=pattern.file_type)
            | WriteCombinedReference()
            | ConsolidateMetadata()
        )

we can only read the reference output with zarr.open_consolidated but not xr.open_dataset(..., consolidated=True)

(Pdb) xr.open_dataset(store, engine='zarr', consolidated=True)
*** KeyError: '.zmetadata'

Offending line on deserialization: https://github.com/zarr-developers/zarr-python/blob/main/zarr/storage.py#L2944

indeed looking at the raw reference.json it doesn't contain a .zmetadata key even though zarr.consolidate_metadata should be writing one. Totaly guess, but might have to do with the version branching here: https://github.com/zarr-developers/zarr-python/blob/main/zarr/convenience.py#L1250-L1263

The text was updated successfully, but these errors were encountered:

abarciauskas-bgse · 2024-01-26T03:55:45Z

I also don't know if this is helpful but I'm a bit confused as well how zarr.consolidate_metadata is intended to or ever was working. I was also looking at convenience.py and noticed that consolidate_metadata returns open_consolidated and open_consolidated uses, by default an r+ mode which has a docstring:

'r+' means
read/write (must exist) although only writes to data are allowed,
changes to metadata including creation of new arrays or group
are not allowed.

Which stuck out to me because this method is intended to add a metadata group, I think.

Overall, I'm not seeing where the .zmetadata key would be written to disk, either as its own file or into the reference json file. But I'm probably missing something?

maxrjones · 2024-01-26T16:54:42Z

@martindurant suggested in fsspec/kerchunk#240 (comment) adding consolidation in MultiZarrtoZarr, if I understand correctly. In general, I haven't seen examples of Zarr Python being used to modify kerchunk reference files, so I'm wondering if that functionality would need to be provided by kerchunk?

rabernat · 2024-01-26T17:02:00Z

I also don't know if this is helpful but I'm a bit confused as well how zarr.consolidate_metadata is intended to or ever was working.

You can have a look at how Xarray does it.
https://github.com/pydata/xarray/blob/ca4f12133e9643c197facd17b54d5040a1bda002/xarray/backends/zarr.py#L660-L661

Calling zarr.consolidate_metadata(store) mutates the store. It lists the metadata documents, builds the consolidated metadata dictionary, and writes it as json to the .zmetadata in the store.

For some thoughts on what belongs in Zarr vs. Kerchunk, I shared my perspective here: fsspec/kerchunk#377 (comment)

maxrjones · 2024-01-26T17:41:37Z

For some thoughts on what belongs in Zarr vs. Kerchunk, I shared my perspective here: fsspec/kerchunk#377 (comment)

It was helpful to read your perspective on where features belong, thanks for sharing. I interpreted "So when it comes time to refactor those prototypes into more stable interfaces" as when Zarr V3 is finalized and the base implementation is available in Zarr Python. Does the concept of a chunk manifest have the potential to apply to the V2 implementation in Zarr Python separately from kerchunk, to guide where the modifying references and metadata would happen if this feature were needed in the very near future?

rabernat · 2024-01-26T17:55:17Z

V2 is not extensible. The lack of extension points was one of the main reasons we needed to create V3.

The lack of extensibility of V2 has driven people to implement all kinds of creative hacks at the Store level. Kerchunk reference filesystem is a perfect example of that. But the limitation of the Store API is that it's just a key / value interface. Stores don't "know" anything about Arrays, Groups, etc. This fact has led to Kerchunk essentially having to re-implement a bunch of logic that exists elsewhere in the stack, e.g. for concatenating arrays.

It should certainly be possible to generate V3 metadata for existing V2 datasets without rewriting any chunks. That's the best path for migrating existing data. I don't think V2 will every be able to retroactively gain something like a chunk manifest.

martindurant · 2024-01-26T18:08:44Z

Before going too far, why would you want to consolidate metadata? In the case of kerchunk, all of the metadata is already separated, and "consolidate" would just duplicate information to no benefit.

Having said that, I suppose the only thing missing from the pipeline is to write the references back. As @rabernat said, they have been changed in memory but not saved. Here is an example of amending kerchunk references - it works fine for chunks or metadata!

when Zarr V3 is finalized

A lot of development has been waiting on this.

Kerchunk essentially having to re-implement a bunch of logic that exists elsewhere in the stack

Well, persisting xarray's internal representation would have been better, but not tractable (by me!). That's assuming xarray-compatible datasets, which does not cover all zarrable datasets.

rabernat · 2024-01-26T19:05:45Z

A lot of development has been waiting on this.

Agreed. The slow and uneven rollout of V3 has really been a drag on our whole community. 😞 Lots of lessons to be learned from this.

Fortunately there is a light at the end of the tunnel, thanks to the work @jhamman has been doing to finally get V3 implemented in Zarr Python.

This was referenced Jan 25, 2024

Passing: GPM MERGIR NASA-IMPACT/veda-pforge-job-runner#17

Closed

WIP: Failing Test Case #676

Closed

norlandrhagen mentioned this issue Jan 26, 2024

ConsolidateMetadata2 #678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray Cannot Read `ConsolidateMetadata` Reference Output #675

Xarray Cannot Read `ConsolidateMetadata` Reference Output #675

ranchodeluxe commented Jan 25, 2024 •

edited

Loading

abarciauskas-bgse commented Jan 26, 2024

maxrjones commented Jan 26, 2024

rabernat commented Jan 26, 2024 •

edited

Loading

maxrjones commented Jan 26, 2024

rabernat commented Jan 26, 2024

martindurant commented Jan 26, 2024

rabernat commented Jan 26, 2024

Xarray Cannot Read ConsolidateMetadata Reference Output #675

Xarray Cannot Read ConsolidateMetadata Reference Output #675

Comments

ranchodeluxe commented Jan 25, 2024 • edited Loading

Versions:

Problem:

abarciauskas-bgse commented Jan 26, 2024

maxrjones commented Jan 26, 2024

rabernat commented Jan 26, 2024 • edited Loading

maxrjones commented Jan 26, 2024

rabernat commented Jan 26, 2024

martindurant commented Jan 26, 2024

rabernat commented Jan 26, 2024

Xarray Cannot Read `ConsolidateMetadata` Reference Output #675

Xarray Cannot Read `ConsolidateMetadata` Reference Output #675

ranchodeluxe commented Jan 25, 2024 •

edited

Loading

rabernat commented Jan 26, 2024 •

edited

Loading