Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] appending to zarr in object store fails #11

Closed
jhamman opened this issue Aug 17, 2020 · 7 comments
Closed

[bug] appending to zarr in object store fails #11

jhamman opened this issue Aug 17, 2020 · 7 comments

Comments

@jhamman
Copy link
Collaborator

jhamman commented Aug 17, 2020

I've been trying out the http_xarray_zarr pipeline we recently added to this repository. I think I've run into a bug (or invalid assumption) related to rapid appending to Zarr object stores. Here's a simple example:

import xarray as xr

from zarr.storage import MemoryStore
import gcsfs

fs = gcsfs.GCSFileSystem()

ds = xr.tutorial.load_dataset('rasm')

# create two MutableMapping stores
stores = [
    ('memory', MemoryStore()),
    ('gc', fs.get_mapper('carbonplan-scratch/test.zarr', create=True))
]

# write one year of data at a time to each store.
for name, store in stores:
    for i, (year, part) in enumerate(ds.groupby('time.year')):
        if i > 0:
            append_dim = 'time'
            mode = 'a'
        else:
            append_dim = None
            mode = 'w'
        part.to_zarr(store, append_dim=append_dim, mode=mode)

# check that the two stores are round-trip-copies of the original
for name, store in stores:
    actual = xr.open_zarr(store)
    try:
        xr.testing.assert_equal(actual, ds)
    except AssertionError:
        print(name, 'failed!')
        raise

This raises for the gcs store...

gc failed!
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-0491bf399a91> in <module>
      2     actual = xr.open_zarr(store)
      3     try:
----> 4         xr.testing.assert_equal(actual, ds)
      5     except AssertionError:
      6         print(name, 'failed!')

    [... skipping hidden 1 frame]

AssertionError: Left and right Dataset objects are not equal
Differing dimensions:
    (time: 4, x: 275, y: 205) != (time: 36, x: 275, y: 205)
Differing coordinates:
L * time     (time) object 1980-09-16 12:00:00 ... 1980-12-17 00:00:00
R * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
Differing data variables:
L   Tair     (time, y, x) float64 dask.array<chunksize=(2, 103, 275), meta=np.ndarray>
R   Tair     (time, y, x) float64 nan nan nan nan nan ... 29.8 28.66 28.19 28.21

You'll note the dimensions in the two datasets differ along the time axis (4 vs 36). I first noticed this behavior when using the above mentioned Prefect.Flow to build a zarr store incrementally in GCS. I'm leaning toward this being an issue with object store or file system mapper consistency but I'm not sure how to diagnose that exactly. I'm hoping @rabernat or @martindurant can provide some guidance here.

@martindurant
Copy link
Contributor

I do see the same thing, even if adding .invalidate_cache() and sleep between iterations. Although files are being added to the time coordinate, "/time/.zarray' retains a shape of (4, ). Note that GCS supports "generations" of the contents of a given key, i.e., even if you don't want to keep previous values, it's possible that previous version of a file persist for a while after update. Perhaps a zarr-specific mapper like zarr-developers/zarr-python#546 should ensure that the mapper caches known .z* keys in memory and not rely on the backend?

@rabernat
Copy link
Contributor

I would say this bug is consistent with my experience, but I was never able to nail down the problem precisely.

I also have a counter-example in which append did work with gcsfs: https://github.com/pangeo-forge/pangeo-smithy/blob/0c31241198db764ffc3b2979ecd2413b26cba5ad/pipeline.py

@jhamman
Copy link
Collaborator Author

jhamman commented Aug 19, 2020

I found this doc useful when exploring if this could be a consistency issue: https://cloud.google.com/storage/docs/consistency

tldr; gcs lists read-after-write under strongly consistent operations.

@rabernat
Copy link
Contributor

Am I correct in thinking that the main problem is the .zarray file update? If so, one way forward would be to bypass xarray / zarr and just write a script that sequentially updates a json file in gcs. If that still breaks, try taking gcsfs out of the picture

@martindurant
Copy link
Contributor

martindurant commented Aug 20, 2020

I have been trying to update a key in-place, and I find that the immediate read-after-write consistency is not the case - you can get previous value(s) back from either the public HTTP link or the www.googleapis.com/download/storage endpoint for a long time. I even found things like

In [51]: gcs.pipe('mdtemp/zmetadata', b'hel')
https://www.googleapis.com/upload/storage/v1/b/mdtemp/o None b'--==0==\nContent-Type: application/json; charset=UTF-8\n\n{"name": "zmetadata"}\n--==0==\nContent-Type: application/octet-stream\n\nhel\n--==0==--' {'Content-Type': 'multipart/related; boundary="==0=="', 'User-Agent': 'python-gcsfs/0.6.2+11.gae121f7.dirty', 'authorization': ''} {'uploadType': 'multipart'}

In [52]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[52]: b'hello'

In [53]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[53]: b'hello2'

In [54]: requests.get("https://storage.googleapis.com/mdtemp/zmetadata").content
Out[54]: b'hello'

(i.e., never the most recent value, and not even consistent, and this remained true for minutes later, even though the bytes count in the storage console updated)

One possible mitigation could be: when uploading the file, the API returns the generation ID, which should be unique. One can request contents by specific generation ID and/or can specify that an update should only happen on a generation-specific precondition. How to code this into gcsfs??

@martindurant
Copy link
Contributor

It is also true that the directory listing can become out of date, which would give your the wrong file-size when opening a file, but this is not an issue when you want to read the whole file (which the zarr case)

@rabernat
Copy link
Contributor

This is resolved by #27, which no longer uses appending.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants