Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for streaming Zarr stores for hosted datasets #4096

Closed
jacobbieker opened this issue Apr 5, 2022 · 11 comments
Closed

Add support for streaming Zarr stores for hosted datasets #4096

jacobbieker opened this issue Apr 5, 2022 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@jacobbieker
Copy link

Is your feature request related to a problem? Please describe.
Lots of geospatial data is stored in the Zarr format. This format works well for n-dimensional data and coordinates, and can have good compression. Unfortunately, HF datasets doesn't support streaming in data in Zarr format as far as I can tell. Zarr stores are designed to be easily streamed in from cloud storage, especially with xarray and fsspec. Since geospatial data tends to be very large, and on the order of TBs of data or 10's of TBs of data for a single dataset, it can be difficult to store the dataset locally for users. Just adding Zarr stores with HF git doesn't work well (see #3823) as Zarr splits the data into lots of small chunks for fast loading, and that doesn't work well with git. I've somewhat gotten around that issue by tarring each Zarr store and uploading them as a single file, which seems to be working (see https://huggingface.co/datasets/openclimatefix/gfs-reforecast for example data files, although the script isn't written yet). This does mean that streaming doesn't quite work though. On the other hand, in https://huggingface.co/datasets/openclimatefix/eumetsat_uk_hrv we stream in a Zarr store from a public GCP bucket quite easily.

Describe the solution you'd like
A way to upload Zarr stores for hosted datasets so that we can stream it with xarray and fsspec.

Describe alternatives you've considered
Tarring each Zarr store individually and just extracting them in the dataset script -> Downside this is a lot of data that probably doesn't fit locally for a lot of potential users.
Pre-prepare examples in a format like Parquet -> Would use a lot more storage, and a lot less flexibility, in the eumetsat_uk_hrv, we use the one Zarr store for multiple different configurations.

@jacobbieker jacobbieker added the enhancement New feature or request label Apr 5, 2022
@jacobbieker jacobbieker changed the title Add support for streaming Zarr stores fomr hosted datasets Add support for streaming Zarr stores for hosted datasets Apr 5, 2022
@albertvillanova albertvillanova self-assigned this Apr 6, 2022
@albertvillanova
Copy link
Member

Hi @jacobbieker, thanks for your request and study of possible alternatives.

We are very interested in finding a way to make datasets useful to you.

Looking at the Zarr docs, I saw that among its storage alternatives, there is the ZIP file format: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore

This might be convenient for many reasons:

  • On the one hand, we avoid the Git issue with huge number of small files: chunks files are compressed into a single ZIP file
  • On the other hand, the ZIP file format is specially suited for streaming data because it allows random access to its component files (i.e. it supports random access to its chunks)

Anyway, I think that a Python loading script will be necessary: you need to implement additional logic to select certain chunks (based on date or other criteria).

Please, let me know if this makes sense to you.

@jacobbieker
Copy link
Author

Ah okay, I missed the option of zip files for zarr, I'll try that with our repos and see if it works! Thanks a lot!

@albertvillanova
Copy link
Member

Hi @jacobbieker, does the Zarr ZipStore work for your use case?

@jacobbieker
Copy link
Author

Hi,

Yes, it seems to! I got it working for https://huggingface.co/datasets/openclimatefix/mrms thanks for the help!

@rabernat
Copy link

rabernat commented Apr 21, 2022

On behalf of the Zarr developers, let me say THANK YOU for working to support Zarr on HF! 🙏 Zarr is a 100% open-source and community driven project (fiscally sponsored by NumFocus). We see it as an ideal format for ML training datasets, particularly in scientific domains.

I think the solution of zipping the Zarr store is a reasonable way to balance the constraints of Git LFS with the structure of Zarr.

It would be amazing to get something on the Hugging Face Datasets Docs about how to best work with Zarr. Let me know if there's a way I could help with that effort.

@rabernat
Copy link

Also just noting here that I was able to lazily open @jacobbieker's dataset over the internet from HF hub 🚀 !

import xarray as xr
url = "https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip"
zip_url = 'zip:///::' + url
ds = xr.open_dataset(zip_url, engine='zarr', chunks={})

image

@rabernat
Copy link

rabernat commented Apr 21, 2022

However, I wasn't able to get streaming working using the Datasets api:

from datasets import load_dataset
ds = load_dataset("openclimatefix/mrms", streaming=True, split='train')
item = next(iter(ds))
FileNotFoundError traceback
No config specified, defaulting to: mrms/2021
zip://::https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip
data/2016_001.zarr.zip
zip://2016_001.zarr.zip::https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [1], in <cell line: 3>()
      1 from datasets import load_dataset
      2 ds = load_dataset("openclimatefix/mrms", streaming=True, split='train')
----> 3 item = next(iter(ds))

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/datasets/iterable_dataset.py:497, in IterableDataset.__iter__(self)
    496 def __iter__(self):
--> 497     for key, example in self._iter():
    498         if self.features:
    499             # we encode the example for ClassLabel feature types for example
    500             encoded_example = self.features.encode_example(example)

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/datasets/iterable_dataset.py:494, in IterableDataset._iter(self)
    492 else:
    493     ex_iterable = self._ex_iterable
--> 494 yield from ex_iterable

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/datasets/iterable_dataset.py:87, in ExamplesIterable.__iter__(self)
     86 def __iter__(self):
---> 87     yield from self.generate_examples_fn(**self.kwargs)

File ~/.cache/huggingface/modules/datasets_modules/datasets/openclimatefix--mrms/2a6f697014d7eb3caf586ca137d47ca38785ae2fe36248611b021f8248b59936/mrms.py:150, in MRMS._generate_examples(self, filepath, split)
    147 filepath = "[https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip](https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip%3C/span%3E%3Cspan) style="color:rgb(175,0,0)">"
    148 # TODO: This method handles input defined in _split_generators to yield (key, example) tuples from the dataset.
    149 # The `key` is for legacy reasons (tfds) and is not important in itself, but must be unique for each example.
--> 150 with zarr.storage.FSStore(fsspec.open("zip::" + filepath, mode='r'), mode='r') as store:
    151     data = xr.open_zarr(store)
    152     for key, row in enumerate(data["time"].values):

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/zarr/storage.py:1120, in FSStore.__init__(self, url, normalize_keys, key_separator, mode, exceptions, dimension_separator, **storage_options)
   1117 import fsspec
   1118 self.normalize_keys = normalize_keys
-> 1120 protocol, _ = fsspec.core.split_protocol(url)
   1121 # set auto_mkdir to True for local file system
   1122 if protocol in (None, "file") and not storage_options.get("auto_mkdir"):

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/core.py:514, in split_protocol(urlpath)
    512 def split_protocol(urlpath):
    513     """Return protocol, path pair"""
--> 514     urlpath = stringify_path(urlpath)
    515     if "://" in urlpath:
    516         protocol, path = urlpath.split("://", 1)

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/utils.py:315, in stringify_path(filepath)
    313     return filepath
    314 elif hasattr(filepath, "__fspath__"):
--> 315     return filepath.__fspath__()
    316 elif isinstance(filepath, pathlib.Path):
    317     return str(filepath)

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/core.py:98, in OpenFile.__fspath__(self)
     96 def __fspath__(self):
     97     # may raise if cannot be resolved to local file
---> 98     return self.open().__fspath__()

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/core.py:140, in OpenFile.open(self)
    132 def open(self):
    133     """Materialise this as a real open file without context
    134 
    135     The file should be explicitly closed to avoid enclosed file
   (...)
    138     been deleted; but a with-context is better style.
    139     """
--> 140     out = self.__enter__()
    141     closer = out.close
    142     fobjects = self.fobjects.copy()[:-1]

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/core.py:103, in OpenFile.__enter__(self)
    100 def __enter__(self):
    101     mode = self.mode.replace("t", "").replace("b", "") + "b"
--> 103     f = self.fs.open(self.path, mode=mode)
    105     self.fobjects = [f]
    107     if self.compression is not None:

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/spec.py:1009, in AbstractFileSystem.open(self, path, mode, block_size, cache_options, compression, **kwargs)
   1007 else:
   1008     ac = kwargs.pop("autocommit", not self._intrans)
-> 1009     f = self._open(
   1010         path,
   1011         mode=mode,
   1012         block_size=block_size,
   1013         autocommit=ac,
   1014         cache_options=cache_options,
   1015         **kwargs,
   1016     )
   1017     if compression is not None:
   1018         from fsspec.compression import compr

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/implementations/zip.py:96, in ZipFileSystem._open(self, path, mode, block_size, autocommit, cache_options, **kwargs)
     94 if mode != "rb":
     95     raise NotImplementedError
---> 96 info = self.info(path)
     97 out = self.zip.open(path, "r")
     98 out.size = info["size"]

File /opt/miniconda3/envs/hugginface/lib/python3.9/site-packages/fsspec/archive.py:42, in AbstractArchiveFileSystem.info(self, path, **kwargs)
     40     return self.dir_cache[path + "/"]
     41 else:
---> 42     raise FileNotFoundError(path)

FileNotFoundError:

Is this a bug? Or am I just doing it wrong...

@jacobbieker
Copy link
Author

I'm still messing around with that dataset, so the data might have moved. I currently have each year of MRMS precipitation rate data as it's own zarr, but as they are quite large (on order of 100GB each) I'm working to split them into single days, and as such they are still being moved around, I was just trying to get a proof of concept working originally.

@jacobbieker
Copy link
Author

I've mostly finished rearranging the data now and uploading some more, so this works now:

import datasets
ds = datasets.load_dataset("openclimatefix/mrms", streaming=True, split="train")
item = next(iter(ds))
print(item.keys())
print(item["timestamp"])

The MRMS data now goes most of 2016-2022, with quite a few gaps I'm working on filling in

@jacobbieker
Copy link
Author

Hi @albertvillanova, I noticed there is now the HFFileSystem, where the docs show an example of writing a Zarr store directly to the Hub, and no mention of having too many files. Is there still the restriction on lots of files in datasets? It would be more convenient to be able to have the geospatial data in one large Zarr store, rather than in multiple smaller ones, but happy to continue using zipped Zarrs if thats the recommended way.

@albertvillanova
Copy link
Member

albertvillanova commented Dec 7, 2023

Hi @jacobbieker.

Thanks for coming back to this pending issue.

In fact, we are now using the fsspec API in our HFFileSystem, which was not the case when you created this issue.
On the other hand, I am not sure of the current limitations, both in terms of the number of files or performance when loading.

  • If I remember correctly, I think there is a limit in the maximum number of files per directory: 10k

I think it would be best to try a POC again and discuss any issues that arise and whether we can fix them on our end (both datasets and the Hub).
We would really like to support the Zarr format 100% and that the Hub is really convenient for your use case. So do not hesitate to report any problem: you can ping me on the Hub as @albertvillanova

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants