New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dset.iter_chunks() with selection #2381
Fix dset.iter_chunks() with selection #2381
Conversation
Resolves #2341 |
for more information, see https://pre-commit.ci
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #2381 +/- ##
=======================================
Coverage 89.53% 89.53%
=======================================
Files 17 17
Lines 2380 2380
=======================================
Hits 2131 2131
Misses 249 249 ☔ View full report in Codecov by Sentry. |
Looks like codecov had an issue uploading, rerunning the failing jobs. |
Adds a small runtime hit when there is no selection. I'm wondering that if import h5py
import numpy
a = numpy.arange(25).reshape(5, 5)
h5f = h5py.File('iterchunks.h5', 'w')
d = h5f.create_dataset('test1', data=a, chunks=(2, 2)) This branch In [22]: %timeit -n 1000000 -r 7 [x for x in d.iter_chunks()]
37.3 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) Master In [10]: %timeit -n 1000000 -r 7 [x for x in d.iter_chunks()]
32 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each) |
h5py/_hl/dataset.py
Outdated
# we still have room to extend along this dimensions | ||
return tuple(slices) | ||
break | ||
|
||
if dim > 0: | ||
# reset to the start and continue iterating with higher dimension | ||
self._chunk_index[dim] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code in master at present, this is the bit I'd be looking to change. When it hits the end of the selection in one dimension, this is telling it to go back to the start in that dimension - but the overall start of the dimension rather than the start of the selection. So I would try something like this:
self._chunk_index[dim] = 0 | |
self._chunk_index[dim] = s.start // self._layout[dim] |
If that works, I suspect it might fix it without the added complexity elsewhere in the class. In particular, I'm suspicious of the line self._chunk_index[dim] = 1
you've added a few lines above. I suspect this hardcoded 1 works for the particular case you've added as a test, but will still be wrong in other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will get this tomorrow. I hope it's simpler bc I felt what I was doing was complicated but it was only solution I was conceptualizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git checkout master -- h5py/_hl/dataset.py
and swapping that does indeed make it pass for the additional test case.
This is good because I was uncomfortable with the slowness I introduced in the common case.
for more information, see https://pre-commit.ci
…ki/h5py into bad_slices_Dataset_iter_chunks
Thanks! Yes, I think the failures are where it's trying to test with numpy 2, which is in beta now, so nothing to do with this. #2386 is working on that. |
Close/reopen to re-run tests with that PR merged. |
Found a solution that works with other test cases.
The way I saw the problem, we would calculate a chunk_slice that was not a subset of the requested slice. So the I added a check that checks that the calculated chunk slice is a subset of the requested slice.