Bugfix: Set equal chunking for shapes and dataset #211

CSSFrancis · 2024-01-17T14:57:41Z

Description of the change

I noticed this morning that I was having issues loading larger arrays of vectors using the distributed-dask backend.

The problem is that the shape array is not saved with the same chunking scheme so when you try to load the shape array you have to split the memory between all of the different cores. This is slow, probably partially because dask doesn't know the size of the array which

Progress of the PR

Change implemented (can be split into several points),
add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
add tests,
ready for review.

Minimal example of the bug fix or the new feature

pks = s.find_peaks()
pks.save("ragged.zspy")
pks = hs.load("radded.zspy", lazy=True)
pks.compute()

For a 1024x1024 array of peak positions the old saving scheme took 1 minute to load with the new scheme that becomes ~1 second with the new scheme.

codecov · 2024-01-17T15:11:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c85f768) 86.22% compared to head (68b72bf) 86.17%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
- Coverage   86.22%   86.17%   -0.05%     
==========================================
  Files          82       82              
  Lines       10549    10548       -1     
  Branches     2293     2293              
==========================================
- Hits         9096     9090       -6     
- Misses        933      938       +5     
  Partials      520      520

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ericpre · 2024-01-17T15:57:16Z

Looks good, do you want to add a changelog entry

CSSFrancis · 2024-01-17T16:01:34Z

Looks good, do you want to add a changelog entry

Yea I'll add a changelog entry, but I might just try it out today to make sure there aren't any other bugs related to this!

ericpre · 2024-01-17T16:03:03Z

rsciio/_hierarchical.py

-            shape = shape.rechunk(data.chunks)
+            shape = da.from_array(
+                ragged_shape, chunks=data.chunks
+            )  # same chunks as data


Update the comment to also mention the reason?

ericpre · 2024-01-23T16:21:37Z

Ping @CSSFrancis! 😉

Just to avoid conflicts with #213.

CSSFrancis · 2024-01-23T16:24:08Z

Ping @CSSFrancis! 😉

Just to avoid conflicts with #213.

@ericpre Oops let me add that comment and then we can merge. This has been working for the last week (at least well enough that I forgot about coming back to the fix :))

ericpre · 2024-01-23T16:28:48Z

And a changelog entry please!

CSSFrancis added 2 commits January 17, 2024 08:40

Bugfix: Set equal chunking for shapes and dataset

4301190

Bugfix: from array with data chunks.

995818c

CSSFrancis force-pushed the fix_vector_loading_distributed branch from 63b145b to 995818c Compare January 17, 2024 15:00

ericpre reviewed Jan 17, 2024

View reviewed changes

Add Changelog and document reason for changes.

68b72bf

ericpre approved these changes Jan 23, 2024

View reviewed changes

ericpre added this to the v0.4 milestone Jan 23, 2024

ericpre merged commit fb675f3 into hyperspy:main Jan 23, 2024
28 of 31 checks passed

CSSFrancis deleted the fix_vector_loading_distributed branch January 23, 2024 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Set equal chunking for shapes and dataset #211

Bugfix: Set equal chunking for shapes and dataset #211

CSSFrancis commented Jan 17, 2024

codecov bot commented Jan 17, 2024 •

edited

ericpre commented Jan 17, 2024

CSSFrancis commented Jan 17, 2024

ericpre Jan 17, 2024

ericpre commented Jan 23, 2024

CSSFrancis commented Jan 23, 2024

ericpre commented Jan 23, 2024

Bugfix: Set equal chunking for shapes and dataset #211

Bugfix: Set equal chunking for shapes and dataset #211

Conversation

CSSFrancis commented Jan 17, 2024

Description of the change

Progress of the PR

Minimal example of the bug fix or the new feature

codecov bot commented Jan 17, 2024 • edited

Codecov Report

ericpre commented Jan 17, 2024

CSSFrancis commented Jan 17, 2024

ericpre Jan 17, 2024

Choose a reason for hiding this comment

ericpre commented Jan 23, 2024

CSSFrancis commented Jan 23, 2024

ericpre commented Jan 23, 2024

codecov bot commented Jan 17, 2024 •

edited