[bug] appending to zarr in object store fails #11

jhamman · 2020-08-17T03:24:18Z

I've been trying out the http_xarray_zarr pipeline we recently added to this repository. I think I've run into a bug (or invalid assumption) related to rapid appending to Zarr object stores. Here's a simple example:

import xarray as xr

from zarr.storage import MemoryStore
import gcsfs

fs = gcsfs.GCSFileSystem()

ds = xr.tutorial.load_dataset('rasm')

# create two MutableMapping stores
stores = [
    ('memory', MemoryStore()),
    ('gc', fs.get_mapper('carbonplan-scratch/test.zarr', create=True))
]

# write one year of data at a time to each store.
for name, store in stores:
    for i, (year, part) in enumerate(ds.groupby('time.year')):
        if i > 0:
            append_dim = 'time'
            mode = 'a'
        else:
            append_dim = None
            mode = 'w'
        part.to_zarr(store, append_dim=append_dim, mode=mode)

# check that the two stores are round-trip-copies of the original
for name, store in stores:
    actual = xr.open_zarr(store)
    try:
        xr.testing.assert_equal(actual, ds)
    except AssertionError:
        print(name, 'failed!')
        raise

This raises for the gcs store...

gc failed!
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-0491bf399a91> in <module>
      2     actual = xr.open_zarr(store)
      3     try:
----> 4         xr.testing.assert_equal(actual, ds)
      5     except AssertionError:
      6         print(name, 'failed!')

    [... skipping hidden 1 frame]

AssertionError: Left and right Dataset objects are not equal
Differing dimensions:
    (time: 4, x: 275, y: 205) != (time: 36, x: 275, y: 205)
Differing coordinates:
L * time     (time) object 1980-09-16 12:00:00 ... 1980-12-17 00:00:00
R * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
Differing data variables:
L   Tair     (time, y, x) float64 dask.array<chunksize=(2, 103, 275), meta=np.ndarray>
R   Tair     (time, y, x) float64 nan nan nan nan nan ... 29.8 28.66 28.19 28.21

You'll note the dimensions in the two datasets differ along the time axis (4 vs 36). I first noticed this behavior when using the above mentioned Prefect.Flow to build a zarr store incrementally in GCS. I'm leaning toward this being an issue with object store or file system mapper consistency but I'm not sure how to diagnose that exactly. I'm hoping @rabernat or @martindurant can provide some guidance here.

The text was updated successfully, but these errors were encountered:

martindurant · 2020-08-17T14:46:13Z

I do see the same thing, even if adding .invalidate_cache() and sleep between iterations. Although files are being added to the time coordinate, "/time/.zarray' retains a shape of (4, ). Note that GCS supports "generations" of the contents of a given key, i.e., even if you don't want to keep previous values, it's possible that previous version of a file persist for a while after update. Perhaps a zarr-specific mapper like zarr-developers/zarr-python#546 should ensure that the mapper caches known .z* keys in memory and not rely on the backend?

rabernat · 2020-08-17T15:12:40Z

I would say this bug is consistent with my experience, but I was never able to nail down the problem precisely.

I also have a counter-example in which append did work with gcsfs: https://github.com/pangeo-forge/pangeo-smithy/blob/0c31241198db764ffc3b2979ecd2413b26cba5ad/pipeline.py

jhamman · 2020-08-19T20:33:33Z

I found this doc useful when exploring if this could be a consistency issue: https://cloud.google.com/storage/docs/consistency

tldr; gcs lists read-after-write under strongly consistent operations.

rabernat · 2020-08-19T20:56:57Z

Am I correct in thinking that the main problem is the .zarray file update? If so, one way forward would be to bypass xarray / zarr and just write a script that sequentially updates a json file in gcs. If that still breaks, try taking gcsfs out of the picture

martindurant · 2020-08-20T17:42:31Z

I have been trying to update a key in-place, and I find that the immediate read-after-write consistency is not the case - you can get previous value(s) back from either the public HTTP link or the www.googleapis.com/download/storage endpoint for a long time. I even found things like

In [51]: gcs.pipe('mdtemp/zmetadata', b'hel')
https://www.googleapis.com/upload/storage/v1/b/mdtemp/o None b'--==0==\nContent-Type: application/json; charset=UTF-8\n\n{"name": "zmetadata"}\n--==0==\nContent-Type: application/octet-stream\n\nhel\n--==0==--' {'Content-Type': 'multipart/related; boundary="==0=="', 'User-Agent': 'python-gcsfs/0.6.2+11.gae121f7.dirty', 'authorization': ''} {'uploadType': 'multipart'}

In [52]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[52]: b'hello'

In [53]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[53]: b'hello2'

In [54]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[54]: b'hello'

(i.e., never the most recent value, and not even consistent, and this remained true for minutes later, even though the bytes count in the storage console updated)

One possible mitigation could be: when uploading the file, the API returns the generation ID, which should be unique. One can request contents by specific generation ID and/or can specify that an update should only happen on a generation-specific precondition. How to code this into gcsfs??

martindurant · 2020-08-20T17:44:47Z

It is also true that the directory listing can become out of date, which would give your the wrong file-size when opening a file, but this is not an issue when you want to read the whole file (which the zarr case)

rabernat · 2021-01-22T21:38:33Z

This is resolved by #27, which no longer uses appending.

rabernat mentioned this issue Sep 1, 2020

Best practices for zarr and GCS streaming applications zarr-developers/zarr-python#595

Open

rabernat closed this as completed Jan 22, 2021

rabernat mentioned this issue Dec 5, 2022

Making appending work in the beam refactor #447

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] appending to zarr in object store fails #11

[bug] appending to zarr in object store fails #11

jhamman commented Aug 17, 2020

martindurant commented Aug 17, 2020

rabernat commented Aug 17, 2020

jhamman commented Aug 19, 2020

rabernat commented Aug 19, 2020

martindurant commented Aug 20, 2020 •

edited

martindurant commented Aug 20, 2020

rabernat commented Jan 22, 2021

[bug] appending to zarr in object store fails #11

[bug] appending to zarr in object store fails #11

Comments

jhamman commented Aug 17, 2020

martindurant commented Aug 17, 2020

rabernat commented Aug 17, 2020

jhamman commented Aug 19, 2020

rabernat commented Aug 19, 2020

martindurant commented Aug 20, 2020 • edited

martindurant commented Aug 20, 2020

rabernat commented Jan 22, 2021

martindurant commented Aug 20, 2020 •

edited