Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dset.iter_chunks() with selection #2381

Merged
merged 7 commits into from Mar 24, 2024

Conversation

Delengowski
Copy link
Contributor

Found a solution that works with other test cases.

The way I saw the problem, we would calculate a chunk_slice that was not a subset of the requested slice. So the I added a check that checks that the calculated chunk slice is a subset of the requested slice.

@Delengowski
Copy link
Contributor Author

Resolves #2341

Copy link

codecov bot commented Feb 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.53%. Comparing base (6b512e5) to head (c537b8c).
Report is 17 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2381   +/-   ##
=======================================
  Coverage   89.53%   89.53%           
=======================================
  Files          17       17           
  Lines        2380     2380           
=======================================
  Hits         2131     2131           
  Misses        249      249           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aragilar
Copy link
Member

Looks like codecov had an issue uploading, rerunning the failing jobs.

@Delengowski
Copy link
Contributor Author

Adds a small runtime hit when there is no selection. I'm wondering that if souce_sel is passed as None we should keep that information and not perform that check at all. This is all assuming my check is the correct way to go about it and there's not a simpler solution.

import h5py
import numpy
a = numpy.arange(25).reshape(5, 5)
h5f = h5py.File('iterchunks.h5', 'w')
d = h5f.create_dataset('test1', data=a, chunks=(2, 2))

This branch

In [22]: %timeit -n 1000000 -r 7 [x for x in d.iter_chunks()]
37.3 µs ± 1.25 µs per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Master

In [10]: %timeit -n 1000000 -r 7 [x for x in d.iter_chunks()]
32 µs ± 265 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

@takluyver takluyver changed the title Adds test case for bug report and resolves Fix dset.iter_chunks() with selection Mar 13, 2024
# we still have room to extend along this dimensions
return tuple(slices)
break

if dim > 0:
# reset to the start and continue iterating with higher dimension
self._chunk_index[dim] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code in master at present, this is the bit I'd be looking to change. When it hits the end of the selection in one dimension, this is telling it to go back to the start in that dimension - but the overall start of the dimension rather than the start of the selection. So I would try something like this:

Suggested change
self._chunk_index[dim] = 0
self._chunk_index[dim] = s.start // self._layout[dim]

If that works, I suspect it might fix it without the added complexity elsewhere in the class. In particular, I'm suspicious of the line self._chunk_index[dim] = 1 you've added a few lines above. I suspect this hardcoded 1 works for the particular case you've added as a test, but will still be wrong in other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will get this tomorrow. I hope it's simpler bc I felt what I was doing was complicated but it was only solution I was conceptualizing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git checkout master -- h5py/_hl/dataset.py and swapping that does indeed make it pass for the additional test case.

This is good because I was uncomfortable with the slowness I introduced in the common case.

@Delengowski
Copy link
Contributor Author

@takluyver
Copy link
Member

Thanks! Yes, I think the failures are where it's trying to test with numpy 2, which is in beta now, so nothing to do with this. #2386 is working on that.

@takluyver
Copy link
Member

Close/reopen to re-run tests with that PR merged.

@takluyver takluyver closed this Mar 22, 2024
@takluyver takluyver reopened this Mar 22, 2024
@takluyver takluyver merged commit 060a0d0 into h5py:master Mar 24, 2024
42 checks passed
@takluyver takluyver added this to the 3.11 milestone Apr 5, 2024
@takluyver takluyver added the bug label Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants