Skip to content

Conversation

@lmtroper
Copy link
Contributor

@lmtroper lmtroper commented Mar 26, 2024

Changelogs

Incorporated Zarr function that consolidates metadata when opening a Zarr group in read mode. This results in reduced number of ls calls and increases the speed in reading from the Zarr group.

Profiling without consolidation

====================================================================================================
Date: 2024-03-28
Time: 16:21:26
Size: 8.0 KB
Repeats: 1
Polaris version: 0.0.2.dev191+g82e7db2
Zarr version: 2.17.1
====================================================================================================
                         Creating the Zarr archive: 0:00:00.002254 ± 0:00:00
         Creating dataset from Source Zarr archive: 0:00:00.003214 ± 0:00:00
                      Uploading dataset to the Hub: 0:00:27.666530 ± 0:00:00
                          Loading dataset from Hub: 0:00:01.549233 ± 0:00:00
                   Iterating over dataset (remote): 0:00:25.385904 ± 0:00:00
                          Caching dataset to local: 0:00:18.538832 ± 0:00:00
                    Iterating over dataset (local): 0:00:00.001493 ± 0:00:00
           Baseline Zarr only upload to Cloudflare: 0:00:00.636373 ± 0:00:00
             Baseline dataset upload to Cloudflare: 0:00:03.207451 ± 0:00:00
       Baseline Zarr only download from Cloudflare: 0:00:00.251135 ± 0:00:00
         Baseline dataset download from Cloudflare: 0:00:00.432873 ± 0:00:00
====================================================================================================
                        Actual / Baseline - Upload: 8.626 ± 0.000
                      Actual / Baseline - Download: 42.827 ± 0.000



Profiling with consolidation

Profiling Report
====================================================================================================
Date: 2024-04-04
Time: 11:15:12
Size: 8.0 KB
Repeats: 2
Polaris version: dev
Zarr version: 2.16.1
====================================================================================================
                         Creating the Zarr archive: 0:00:00.018600 ± 0:00:00.007932
         Creating dataset from Source Zarr archive: 0:00:00.021886 ± 0:00:00.002237
                      Uploading dataset to the Hub: 0:00:35.845067 ± 0:00:02.073791
                          Loading dataset from Hub: 0:00:01.928347 ± 0:00:00.003685
                          Caching dataset to local: 0:00:03.662794 ± 0:00:00.299349
                    Iterating over dataset (local): 0:00:00.005156 ± 0:00:00.000391
           Baseline Zarr only upload to Cloudflare: 0:00:00.973820 ± 0:00:00.110363
             Baseline dataset upload to Cloudflare: 0:00:03.534744 ± 0:00:00.056176
       Baseline Zarr only download from Cloudflare: 0:00:00.447640 ± 0:00:00.131711
         Baseline dataset download from Cloudflare: 0:00:00.386626 ± 0:00:00.024372
====================================================================================================
                        Actual / Baseline - Upload: 10.134 ± 0.426
                      Actual / Baseline - Download: 9.463 ± 0.178

@lmtroper lmtroper added the enhancement New feature or request label Mar 26, 2024
@lmtroper lmtroper requested a review from cwognum March 26, 2024 15:32
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To optimize performance, we don't want to consolidate the meta-data everytime.

From the Zarr docs:

>>> zarr.consolidate_metadata(store)  

This creates a special key with a copy of all of the metadata from all of the metadata objects in the store.

The key here refers to a single file for us (i.e. by default named .zmetadata, although this can be combined with the metadata_key parameter in the consolidate methods).

I think what we would want to do is:

  1. Consolidate the archive locally.
  2. Then copy over the consolidated archive to the Hub with zarr.convenience.copy_all.
  3. If this doesn't work (i.e. I assume it would copy over the .zmetadata file, but maybe not?), then we should find out a way to copy over this single file manually.

Could you look into the above? I would be curious to know if this is possible!

Not having to consolidate everything on the Hub would make things a lot faster!

@cwognum
Copy link
Collaborator

cwognum commented Mar 26, 2024

Now that #83 is merged, could you actually look into adding the consolidation in the flow for using Zarr datasets from the Hub:

  • We want to consolidate the Zarr archive just before uploading it to the Hub (here)
  • We assume that any archive uploaded to the Hub has been consolidated, so when we open it here it should use open_consolidated. Maybe we add an as_consolidated parameter to client.open_zarr_file()?

@lmtroper lmtroper requested a review from cwognum March 28, 2024 14:20
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Let's assume that any Zarr archive that is loaded for a dataset has been consolidated. This means that we should also change these lines of code to load in consolidated mode!

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there!

The test cases are failing though because the test Zarr archive is not consolidated! The formatting also fails right now.

@cwognum cwognum merged commit e918577 into main Mar 28, 2024
@lmtroper lmtroper linked an issue Mar 28, 2024 that may be closed by this pull request
3 tasks
@cwognum cwognum deleted the 74-optimizing-polarisfs-for-zarr-files-up-to-10gb branch March 28, 2024 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimizing PolarisFS for Zarr files up to 10GB

3 participants