Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of Zarr on HPC file systems #659

Closed
pbranson opened this issue Jun 27, 2019 · 9 comments
Closed

Use of Zarr on HPC file systems #659

pbranson opened this issue Jun 27, 2019 · 9 comments
Labels

Comments

@pbranson
Copy link
Member

pbranson commented Jun 27, 2019

When using Zarr on HPC filesystems for large datasets, frequently a large number of files are created depending on the chunk_size utilised.

On HPC often file inode limits are placed as many small files can adversely affect the stability of the filesystem and place significant load on the metadata servers.

After creating the Zarr datasets in a directory, they can be stored into zip files (without compression as chunks should already be compressed) and the zarr.storage.ZipStore utilised to access the files and this works well, i.e.:

ds = xr.open_zarr(zarr.storage.ZipStore(zipfile,mode='r'))

Also the recently added zarr.consolidate_metadata works too.

Whilst this works well, often I have found the it is more efficient to still partition the dataset into years/months depending on the source files and size of the dataset.

I am wondering about what the recommended way to consolidate metadata across multiple zarr ZipStores might be? Is it possible to create an intake catalog for this? I think that would require some minor alterations xr.open_zarr to check for .zip file extension and use ZipStore. Alternatively is it possible to create some sort of .zmetadata file that consolidates the metadata?

ww3.aus_4m.197901.zip
ww3.aus_4m.197902.zip
ww3.aus_4m.197903.zip
ww3.aus_4m.197904.zip
ww3.aus_4m.197905.zip
ww3.aus_4m.197906.zip

At present I use the following boilerplate code to load a mult-zipstore dataset:

from dask import delayed, compute

def open_mf_zarrzipstore(zips,keep_vars=None):
    @delayed
    def open_zip_store(z,keep_vars=None):
        store = zarr.storage.ZipStore(z,mode='r')
        ds = xr.open_zarr(store,consolidated=True)
        if keep_vars is not None:
            ds = ds[keep_vars]
        return ds
    
    mf_ds = compute([open_zip_store(z,keep_vars) for z in zips])[0]
    ds = xr.concat(mf_ds,dim='time')
    return ds

ds = open_mf_zarrzipstore(zips,keep_vars=['hs',])

Thanks for any advice

@guillaumeeb
Copy link
Member

Hi @pbranson, that's really interesting to see other people using Zarr on HPC systems. cc @tinaok @fbriol @apatlpo.

First of all

When using Zarr on HPC filesystems for large datasets, frequently a large number of files are created depending on the chunk_size utilised.

Can't you just increase the chunk_size to have reasonably sized files? I've not personaly used Zarr a lot on big datasets, but I've seen users (ccied above) generate millions of files on our system with incorrect chunk size (10's TB dataset with 7MB chunks...). My first advice was to increase the chunk size so as to have at least 100MB chunks or even more. This recommandation is also valid on cloud object store. But I imagine you are probably aware of this.

Whilst this works well, often I have found the it is more efficient to still partition the dataset into years/months depending on the source files and size of the dataset.

Do you know where this comes from? Is Xarray able to correctly work with chunks from inside zip files? e.g. accessing different parts of one zip file from multiple process?

Sorry about this reply, I'm merely asking more questions and not answering the major part of you issue...

@pbranson
Copy link
Member Author

pbranson commented Jun 28, 2019

Do you know where this comes from?

In my instance it is mainly due to the poor performance of using open_mf_dataset on many poorly chunked netcdf files. The dask graph creates many many file open requests to the netcdf files. And I think there was some sort of threading/file locking issue which would cause these file opens to take many seconds each

Working on a file by file basis to make 7GB zips comprised of approximately 50-100MB chunks could be sent to the scheduler in a job array and a small dask cluster tackling each.

Regarding parallel access to the zip files - yep it works and it seems to add only minimal overhead (I havent quantified it yet), my understanding is that the zip header has all the chunks and their binary offsets in the file, so it reads the header and seeks to chunk location.

@pbranson
Copy link
Member Author

pbranson commented Jul 3, 2019

I wonder if @martindurant has some suggestions on this?

@martindurant
Copy link
Contributor

Yes indeed, reading files from within a zip archive ought to add little overhead, but obviously the system is having to do the same offsetting somewhere along the way. Just don't use tar :)

My first question: is there anything wrong with your current approach and the little code you posted? It would be reasonable to have something like this as an option in intake-xarray, although I'm not sure how specialised it is.

Since ZipFileSystem
is now implemented in fsspec, it would be possible to use URLs only and have Dask/xarray sort things out; I should say will be possible in the near future.

I don't know about storing the sets of metadata outside the archives. Obviously, it could be done, something along the lines of the existing consolidation mechanism, or something in Intake, or a new class of dict-store. Effectively, it's like creating a zarr group out of existing data-sets.
However, it would take some planning. From the point of view of the original problem here, perhaps the current two-stage consolidation (write zarr as normal, then create single metadata for the data-set) could be short-circuited to avoid the creation of the many small .z* files, at the cost of making the dataset unreadable without using the consolidate metadata and possibly hard to change.

@pbranson
Copy link
Member Author

pbranson commented Jul 3, 2019 via email

@martindurant
Copy link
Contributor

I'll raise the idea of a "metadata only" zarr group, which could consolidate the consolidated metadatas of its members and how they are stored, at today's zarr meeting. You could also raise an issue at zarr for something like this - or restate your problem, to see if they have any better ideas.

@tinaok
Copy link
Contributor

tinaok commented Jul 25, 2019

Hi, what is your stripe count for your Zarr file on your lustre file system? Depending on your chunk size, and number of chunks you have in your Zarr file, modifying stripe count from your system default might improve the performance.

@stale
Copy link

stale bot commented Sep 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Sep 23, 2019
@stale
Copy link

stale bot commented Sep 30, 2019

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants