Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Set equal chunking for shapes and dataset #211

Merged
merged 3 commits into from Jan 23, 2024

Conversation

CSSFrancis
Copy link
Member

Description of the change

I noticed this morning that I was having issues loading larger arrays of vectors using the distributed-dask backend.

The problem is that the shape array is not saved with the same chunking scheme so when you try to load the shape array you have to split the memory between all of the different cores. This is slow, probably partially because dask doesn't know the size of the array which

Progress of the PR

  • Change implemented (can be split into several points),
  • add a changelog entry in the upcoming_changes folder (see upcoming_changes/README.rst),
  • Check formatting of the changelog entry (and eventual user guide changes) in the docs/readthedocs.org:rosettasciio build of this PR (link in github checks)
  • add tests,
  • ready for review.

Minimal example of the bug fix or the new feature

pks = s.find_peaks()
pks.save("ragged.zspy")
pks = hs.load("radded.zspy", lazy=True)
pks.compute() 

For a 1024x1024 array of peak positions the old saving scheme took 1 minute to load with the new scheme that becomes ~1 second with the new scheme.

@CSSFrancis CSSFrancis force-pushed the fix_vector_loading_distributed branch from 63b145b to 995818c Compare January 17, 2024 15:00
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c85f768) 86.22% compared to head (68b72bf) 86.17%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #211      +/-   ##
==========================================
- Coverage   86.22%   86.17%   -0.05%     
==========================================
  Files          82       82              
  Lines       10549    10548       -1     
  Branches     2293     2293              
==========================================
- Hits         9096     9090       -6     
- Misses        933      938       +5     
  Partials      520      520              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ericpre
Copy link
Member

ericpre commented Jan 17, 2024

Looks good, do you want to add a changelog entry

@CSSFrancis
Copy link
Member Author

Looks good, do you want to add a changelog entry

Yea I'll add a changelog entry, but I might just try it out today to make sure there aren't any other bugs related to this!

shape = shape.rechunk(data.chunks)
shape = da.from_array(
ragged_shape, chunks=data.chunks
) # same chunks as data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment to also mention the reason?

@ericpre
Copy link
Member

ericpre commented Jan 23, 2024

Ping @CSSFrancis! 😉

Just to avoid conflicts with #213.

@CSSFrancis
Copy link
Member Author

Ping @CSSFrancis! 😉

Just to avoid conflicts with #213.

@ericpre Oops let me add that comment and then we can merge. This has been working for the last week (at least well enough that I forgot about coming back to the fix :))

@ericpre
Copy link
Member

ericpre commented Jan 23, 2024

And a changelog entry please!

@ericpre ericpre added this to the v0.4 milestone Jan 23, 2024
@ericpre ericpre merged commit fb675f3 into hyperspy:main Jan 23, 2024
28 of 31 checks passed
@CSSFrancis CSSFrancis deleted the fix_vector_loading_distributed branch January 23, 2024 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants