Skip to content

Conversation

@lmtroper
Copy link
Contributor

@lmtroper lmtroper commented Apr 9, 2024

Changelogs

Implementing a listings caching system using fsspec's dircache to reduce the number of ls calls made during zarr upload/download.

The fsspec info() method was the method that was resulting in all of the ls calls being made during upload so I tried targeting this method by using a listings caching. Given that listings caching is activated, each call to ls checks the path for cached listings and if none are found, it will check the listings for the parent path. This was an idea to solve the problem of checking child paths even though we learn that the parent path is empty:

[]
ls_path:  /storage/dataset/lmtroper/93a16e4534324def975d5a97285722c5/ls/data.zarr
[]
ls_path:  /storage/dataset/lmtroper/93a16e4534324def975d5a97285722c5/ls/data.zarr/.zgroup
[]
pipe_path:  /storage/dataset/lmtroper/93a16e4534324def975d5a97285722c5/data.zarr/.zgroup
ls_path:  /storage/dataset/lmtroper/93a16e4534324def975d5a97285722c5/ls/data.zarr
[{'name': 'dataset/lmtroper/93a16e4534324def975d5a97285722c5/data.zarr/.zgroup', 'size': 24, 'type': 'file'}]

Profiling with listing caching:

Profiling Report
====================================================================================================
Date: 2024-04-09
Time: 15:12:45
Size: 953.67 MB
Repeats: 1
Polaris version: dev
Zarr version: 2.16.1
====================================================================================================
                         Creating the Zarr archive: 0:00:01.336539 ± 0:00:00
         Creating dataset from Source Zarr archive: 0:00:01.979446 ± 0:00:00
                      Uploading dataset to the Hub: 0:14:26.429550 ± 0:00:00
                          Loading dataset from Hub: 0:00:03.912605 ± 0:00:00
                          Caching dataset to local: 0:23:45.847385 ± 0:00:00
           Baseline Zarr only upload to Cloudflare: 0:04:38.541630 ± 0:00:00
             Baseline dataset upload to Cloudflare: 0:05:00.708451 ± 0:00:00
       Baseline Zarr only download from Cloudflare: 0:01:16.668102 ± 0:00:00
         Baseline dataset download from Cloudflare: 0:01:12.491876 ± 0:00:00
====================================================================================================
                        Actual / Baseline - Upload: 2.881 ± 0.000
                      Actual / Baseline - Download: 19.669 ± 0.000

@lmtroper lmtroper added the feature Annotates any PR that adds new features; Used in the release process label Apr 9, 2024
@lmtroper lmtroper requested a review from cwognum April 9, 2024 15:07
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice optimization @lmtroper !

There is a more official implementation that simplifies the code further. It's not very well documented, but what I did was to go through an existing implementation to understand how this can be used.

The good news - You reached a very similar solution! Great minds think alike! 😉

@lmtroper lmtroper requested a review from cwognum April 9, 2024 16:32
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks great!

However, in your experience it has been slower, right?

I'm not sure why it would be... Maybe leave the PR open for now while we investigate?

@lmtroper lmtroper merged commit cbe5f9f into main Apr 9, 2024
@lmtroper lmtroper deleted the feat/polarisfs-listings-dircache branch April 9, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Annotates any PR that adds new features; Used in the release process

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants