New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Set equal chunking for shapes and dataset #211
Bugfix: Set equal chunking for shapes and dataset #211
Conversation
63b145b
to
995818c
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #211 +/- ##
==========================================
- Coverage 86.22% 86.17% -0.05%
==========================================
Files 82 82
Lines 10549 10548 -1
Branches 2293 2293
==========================================
- Hits 9096 9090 -6
- Misses 933 938 +5
Partials 520 520 ☔ View full report in Codecov by Sentry. |
Looks good, do you want to add a changelog entry |
Yea I'll add a changelog entry, but I might just try it out today to make sure there aren't any other bugs related to this! |
rsciio/_hierarchical.py
Outdated
shape = shape.rechunk(data.chunks) | ||
shape = da.from_array( | ||
ragged_shape, chunks=data.chunks | ||
) # same chunks as data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment to also mention the reason?
Ping @CSSFrancis! 😉 Just to avoid conflicts with #213. |
@ericpre Oops let me add that comment and then we can merge. This has been working for the last week (at least well enough that I forgot about coming back to the fix :)) |
And a changelog entry please! |
Description of the change
I noticed this morning that I was having issues loading larger arrays of vectors using the distributed-dask backend.
The problem is that the shape array is not saved with the same chunking scheme so when you try to load the shape array you have to split the memory between all of the different cores. This is slow, probably partially because dask doesn't know the size of the array which
Progress of the PR
upcoming_changes
folder (seeupcoming_changes/README.rst
),docs/readthedocs.org:rosettasciio
build of this PR (link in github checks)Minimal example of the bug fix or the new feature
For a 1024x1024 array of peak positions the old saving scheme took 1 minute to load with the new scheme that becomes ~1 second with the new scheme.