-
Notifications
You must be signed in to change notification settings - Fork 2.6k
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for streaming Zarr stores for hosted datasets #4096
Comments
Hi @jacobbieker, thanks for your request and study of possible alternatives. We are very interested in finding a way to make Looking at the Zarr docs, I saw that among its storage alternatives, there is the ZIP file format: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore This might be convenient for many reasons:
Anyway, I think that a Python loading script will be necessary: you need to implement additional logic to select certain chunks (based on date or other criteria). Please, let me know if this makes sense to you. |
Ah okay, I missed the option of zip files for zarr, I'll try that with our repos and see if it works! Thanks a lot! |
Hi @jacobbieker, does the Zarr ZipStore work for your use case? |
Hi, Yes, it seems to! I got it working for https://huggingface.co/datasets/openclimatefix/mrms thanks for the help! |
On behalf of the Zarr developers, let me say THANK YOU for working to support Zarr on HF! 🙏 Zarr is a 100% open-source and community driven project (fiscally sponsored by NumFocus). We see it as an ideal format for ML training datasets, particularly in scientific domains. I think the solution of zipping the Zarr store is a reasonable way to balance the constraints of Git LFS with the structure of Zarr. It would be amazing to get something on the Hugging Face Datasets Docs about how to best work with Zarr. Let me know if there's a way I could help with that effort. |
Also just noting here that I was able to lazily open @jacobbieker's dataset over the internet from HF hub 🚀 ! import xarray as xr
url = "https://huggingface.co/datasets/openclimatefix/mrms/resolve/main/data/2016_001.zarr.zip"
zip_url = 'zip:///::' + url
ds = xr.open_dataset(zip_url, engine='zarr', chunks={}) |
However, I wasn't able to get streaming working using the Datasets api: from datasets import load_dataset
ds = load_dataset("openclimatefix/mrms", streaming=True, split='train')
item = next(iter(ds)) FileNotFoundError traceback
Is this a bug? Or am I just doing it wrong... |
I'm still messing around with that dataset, so the data might have moved. I currently have each year of MRMS precipitation rate data as it's own zarr, but as they are quite large (on order of 100GB each) I'm working to split them into single days, and as such they are still being moved around, I was just trying to get a proof of concept working originally. |
I've mostly finished rearranging the data now and uploading some more, so this works now: import datasets
ds = datasets.load_dataset("openclimatefix/mrms", streaming=True, split="train")
item = next(iter(ds))
print(item.keys())
print(item["timestamp"]) The MRMS data now goes most of 2016-2022, with quite a few gaps I'm working on filling in |
Hi @albertvillanova, I noticed there is now the HFFileSystem, where the docs show an example of writing a Zarr store directly to the Hub, and no mention of having too many files. Is there still the restriction on lots of files in |
Hi @jacobbieker. Thanks for coming back to this pending issue. In fact, we are now using the
I think it would be best to try a POC again and discuss any issues that arise and whether we can fix them on our end (both |
Is your feature request related to a problem? Please describe.
Lots of geospatial data is stored in the Zarr format. This format works well for n-dimensional data and coordinates, and can have good compression. Unfortunately, HF datasets doesn't support streaming in data in Zarr format as far as I can tell. Zarr stores are designed to be easily streamed in from cloud storage, especially with xarray and fsspec. Since geospatial data tends to be very large, and on the order of TBs of data or 10's of TBs of data for a single dataset, it can be difficult to store the dataset locally for users. Just adding Zarr stores with HF git doesn't work well (see #3823) as Zarr splits the data into lots of small chunks for fast loading, and that doesn't work well with git. I've somewhat gotten around that issue by tarring each Zarr store and uploading them as a single file, which seems to be working (see https://huggingface.co/datasets/openclimatefix/gfs-reforecast for example data files, although the script isn't written yet). This does mean that streaming doesn't quite work though. On the other hand, in https://huggingface.co/datasets/openclimatefix/eumetsat_uk_hrv we stream in a Zarr store from a public GCP bucket quite easily.
Describe the solution you'd like
A way to upload Zarr stores for hosted datasets so that we can stream it with xarray and fsspec.
Describe alternatives you've considered
Tarring each Zarr store individually and just extracting them in the dataset script -> Downside this is a lot of data that probably doesn't fit locally for a lot of potential users.
Pre-prepare examples in a format like Parquet -> Would use a lot more storage, and a lot less flexibility, in the eumetsat_uk_hrv, we use the one Zarr store for multiple different configurations.
The text was updated successfully, but these errors were encountered: