Use of Zarr on HPC file systems #659

pbranson · 2019-06-27T14:58:59Z

When using Zarr on HPC filesystems for large datasets, frequently a large number of files are created depending on the chunk_size utilised.

On HPC often file inode limits are placed as many small files can adversely affect the stability of the filesystem and place significant load on the metadata servers.

After creating the Zarr datasets in a directory, they can be stored into zip files (without compression as chunks should already be compressed) and the zarr.storage.ZipStore utilised to access the files and this works well, i.e.:

ds = xr.open_zarr(zarr.storage.ZipStore(zipfile,mode='r'))

Also the recently added zarr.consolidate_metadata works too.

Whilst this works well, often I have found the it is more efficient to still partition the dataset into years/months depending on the source files and size of the dataset.

I am wondering about what the recommended way to consolidate metadata across multiple zarr ZipStores might be? Is it possible to create an intake catalog for this? I think that would require some minor alterations xr.open_zarr to check for .zip file extension and use ZipStore. Alternatively is it possible to create some sort of .zmetadata file that consolidates the metadata?

ww3.aus_4m.197901.zip
ww3.aus_4m.197902.zip
ww3.aus_4m.197903.zip
ww3.aus_4m.197904.zip
ww3.aus_4m.197905.zip
ww3.aus_4m.197906.zip

At present I use the following boilerplate code to load a mult-zipstore dataset:

from dask import delayed, compute

def open_mf_zarrzipstore(zips,keep_vars=None):
    @delayed
    def open_zip_store(z,keep_vars=None):
        store = zarr.storage.ZipStore(z,mode='r')
        ds = xr.open_zarr(store,consolidated=True)
        if keep_vars is not None:
            ds = ds[keep_vars]
        return ds
    
    mf_ds = compute([open_zip_store(z,keep_vars) for z in zips])[0]
    ds = xr.concat(mf_ds,dim='time')
    return ds

ds = open_mf_zarrzipstore(zips,keep_vars=['hs',])

Thanks for any advice

The text was updated successfully, but these errors were encountered:

guillaumeeb · 2019-06-27T20:22:13Z

Hi @pbranson, that's really interesting to see other people using Zarr on HPC systems. cc @tinaok @fbriol @apatlpo.

First of all

When using Zarr on HPC filesystems for large datasets, frequently a large number of files are created depending on the chunk_size utilised.

Can't you just increase the chunk_size to have reasonably sized files? I've not personaly used Zarr a lot on big datasets, but I've seen users (ccied above) generate millions of files on our system with incorrect chunk size (10's TB dataset with 7MB chunks...). My first advice was to increase the chunk size so as to have at least 100MB chunks or even more. This recommandation is also valid on cloud object store. But I imagine you are probably aware of this.

Whilst this works well, often I have found the it is more efficient to still partition the dataset into years/months depending on the source files and size of the dataset.

Do you know where this comes from? Is Xarray able to correctly work with chunks from inside zip files? e.g. accessing different parts of one zip file from multiple process?

Sorry about this reply, I'm merely asking more questions and not answering the major part of you issue...

pbranson · 2019-06-28T01:06:52Z

Do you know where this comes from?

In my instance it is mainly due to the poor performance of using open_mf_dataset on many poorly chunked netcdf files. The dask graph creates many many file open requests to the netcdf files. And I think there was some sort of threading/file locking issue which would cause these file opens to take many seconds each

Working on a file by file basis to make 7GB zips comprised of approximately 50-100MB chunks could be sent to the scheduler in a job array and a small dask cluster tackling each.

Regarding parallel access to the zip files - yep it works and it seems to add only minimal overhead (I havent quantified it yet), my understanding is that the zip header has all the chunks and their binary offsets in the file, so it reads the header and seeks to chunk location.

pbranson · 2019-07-03T08:09:38Z

I wonder if @martindurant has some suggestions on this?

martindurant · 2019-07-03T14:58:54Z

Yes indeed, reading files from within a zip archive ought to add little overhead, but obviously the system is having to do the same offsetting somewhere along the way. Just don't use tar :)

My first question: is there anything wrong with your current approach and the little code you posted? It would be reasonable to have something like this as an option in intake-xarray, although I'm not sure how specialised it is.

Since ZipFileSystem
is now implemented in fsspec, it would be possible to use URLs only and have Dask/xarray sort things out; I should say will be possible in the near future.

I don't know about storing the sets of metadata outside the archives. Obviously, it could be done, something along the lines of the existing consolidation mechanism, or something in Intake, or a new class of dict-store. Effectively, it's like creating a zarr group out of existing data-sets.
However, it would take some planning. From the point of view of the original problem here, perhaps the current two-stage consolidation (write zarr as normal, then create single metadata for the data-set) could be short-circuited to avoid the creation of the many small .z* files, at the cost of making the dataset unreadable without using the consolidate metadata and possibly hard to change.

pbranson · 2019-07-03T17:31:32Z

Yes I suspect that there is likely performance implications for the stripe size at the (in my case lustre) storage layer that will influence performance of parallel seeks within the zips files. There isn't really anything wrong perse with what I'm doing, reading 480 consolidated metadata files doesn't take too long, rather the xr.concat actually takes most of the time (30-40s), which is the step I was hoping to circumvent by a precompiled full set of metadata. Then a user of the dataset doesn't need to have several workers running just to open the dataset, when they may subsequently just slice out a few chunks worth for their time/region of interest.

…

On Wed., 3 Jul. 2019, 10:59 pm Martin Durant, ***@***.***> wrote: Yes indeed, reading files from within a zip archive ought to add little overhead, but obviously the system is having to do the same offsetting *somewhere* along the way. Just don't use tar :) My first question: is there anything wrong with your current approach and the little code you posted? It would be reasonable to have something like this as an option in intake-xarray, although I'm not sure how specialised it is. Since ZipFileSystem <https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/zip.py#L8> is now implemented in fsspec, it would be possible to use URLs only and have Dask/xarray sort things out; I should say *will* be possible in the near future. I don't know about storing the sets of metadata outside the archives. Obviously, it could be done, something along the lines of the existing consolidation mechanism, or something in Intake, or a new class of dict-store. Effectively, it's like creating a zarr group out of existing data-sets. However, it would take some planning. From the point of view of the original problem here, perhaps the current two-stage consolidation (write zarr as normal, then create single metadata for the data-set) could be short-circuited to avoid the creation of the many small .z* files, at the cost of making the dataset unreadable without using the consolidate metadata and possibly hard to change. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#659?email_source=notifications&email_token=ADG5WQDAFU6E3W3O23H62CDP5S5DLA5CNFSM4H345FRKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZEXHMQ#issuecomment-508130226>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADG5WQDGUI3NFFAWIICIMKTP5S5DLANCNFSM4H345FRA> .

martindurant · 2019-07-03T17:36:51Z

I'll raise the idea of a "metadata only" zarr group, which could consolidate the consolidated metadatas of its members and how they are stored, at today's zarr meeting. You could also raise an issue at zarr for something like this - or restate your problem, to see if they have any better ideas.

tinaok · 2019-07-25T12:26:52Z

Hi, what is your stripe count for your Zarr file on your lustre file system? Depending on your chunk size, and number of chunks you have in your Zarr file, modifying stripe count from your system default might improve the performance.

stale · 2019-09-23T12:29:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2019-09-30T12:41:00Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

pbranson mentioned this issue Jul 7, 2019

Use of Zarr on HPC file systems zarr-developers/zarr-python#451

Closed

stale bot added the stale label Sep 23, 2019

stale bot closed this as completed Sep 30, 2019

mkitti mentioned this issue Oct 30, 2023

ZEP0002 Review zarr-developers/zarr-specs#254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of Zarr on HPC file systems #659

Use of Zarr on HPC file systems #659

pbranson commented Jun 27, 2019 •

edited

Loading

guillaumeeb commented Jun 27, 2019

pbranson commented Jun 28, 2019 •

edited

Loading

pbranson commented Jul 3, 2019 •

edited

Loading

martindurant commented Jul 3, 2019

pbranson commented Jul 3, 2019 via email

martindurant commented Jul 3, 2019

tinaok commented Jul 25, 2019 •

edited

Loading

stale bot commented Sep 23, 2019

stale bot commented Sep 30, 2019

Use of Zarr on HPC file systems #659

Use of Zarr on HPC file systems #659

Comments

pbranson commented Jun 27, 2019 • edited Loading

guillaumeeb commented Jun 27, 2019

pbranson commented Jun 28, 2019 • edited Loading

pbranson commented Jul 3, 2019 • edited Loading

martindurant commented Jul 3, 2019

pbranson commented Jul 3, 2019 via email

martindurant commented Jul 3, 2019

tinaok commented Jul 25, 2019 • edited Loading

stale bot commented Sep 23, 2019

stale bot commented Sep 30, 2019

pbranson commented Jun 27, 2019 •

edited

Loading

pbranson commented Jun 28, 2019 •

edited

Loading

pbranson commented Jul 3, 2019 •

edited

Loading

tinaok commented Jul 25, 2019 •

edited

Loading