Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MultiManager and low level H5Dread_multi call #2351

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

mattjala
Copy link

@mattjala mattjala commented Dec 5, 2023

This implements the function read_multi in h5d.pyx, in order to call HDF5's H5Dread_multi. Because it is meant to work on multiple datasets, I figured it made more sense not to implement it as a method on DatasetID.

To avoid duplicating type conversion handling, I modified dset_rw to handle both the single and multi dataset cases.

After review, I plan to implement a similar function for H5Dwrite_multi, and then high-level calls for both, perhaps through a MultiManager class similar to AttributeManager.

Addresses #2298

h5py/h5d.pyx Outdated
IF HDF5_VERSION >= (1, 14, 0):
@with_phil
def read_multi(count, list dataset_ids, list mspace_ids, list fspace_ids,
list type_ids, list bufs not None, PropID dxpl=None):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are lists instead of numpy arrays to avoid an issue with the destination buffers.

Specifically, the individual destination buffer for each dset read should be a numpy array. If bufs were itself also a numpy array containing these arrays, then it would copy those arrays instead of pointing to the same memory, and the original destination arrays the user created wouldn't get populated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes sense.

The Python function probably shouldn't take count as the first parameter - it's necessary in C to tell you how many items you can read after a pointer, but lists have a length.

But if that's the case, I'd probably do a check up front that the lists are all the same length, and fail with a clear error if not.

Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Attention: Patch coverage is 92.77778% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 89.80%. Comparing base (4c01efa) to head (8ba2be3).
Report is 65 commits behind head on master.

Files Patch % Lines
h5py/_hl/dataset.py 92.73% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2351      +/-   ##
==========================================
+ Coverage   89.53%   89.80%   +0.26%     
==========================================
  Files          17       17              
  Lines        2380     2570     +190     
==========================================
+ Hits         2131     2308     +177     
- Misses        249      262      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@takluyver
Copy link
Member

As dset_rw is kind of critical to h5py, and this introduces extra mallocs in there - i.e. extra places that might fail or slow things down - I wonder if it would make sense to write a separate dset_rw_multi. Even if that means some code duplication, it should mean the addition doesn't make the existing APIs any worse.

dataset = filehandle.create_dataset("data", shape,
dtype=dt, data=data_in)

self.assertTrue(numpy.array_equal(data_in, dataset[...]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertTrue(numpy.array_equal(data_in, dataset[...]))
numpy.testing.assert_array_equal(data_in, dataset[...])

And throughout. Specific assert methods like this can give you more useful information if the thing they're checking fails, whereas assertTrue can only tell you it was not true. E.g.

AssertionError: 
Arrays are not equal

(shapes (5,), (6,) mismatch)
 x: array([0, 1, 2, 3, 4])
 y: array([0, 1, 2, 3, 4, 5])


@ut.skipIf(h5py.version.hdf5_version_tuple < (1, 14, 0),
"read_multi requires HDF5 >= 1.14.0")
class TestReadMulti(TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in the file test_h5d_direct_chunk.py, the rest of which is all about the direct chunk read/write methods. I think this file is small enough that we may as well just rename it to test_h5d.py.

@@ -151,6 +151,63 @@ def open(ObjectID loc not None, char* name, PropID dapl=None):
"""
return DatasetID(H5Dopen(loc.id, name, pdefault(dapl)))

IF HDF5_VERSION >= (1, 14, 0):
@with_phil
def rw_multi(list dataset_ids, list mspace_ids, list fspace_ids,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be broken up into read_multi and write_multi for readability, but they would be identical except for one parameter passed to dset_rw_multi.

@mattjala
Copy link
Author

mattjala commented Jan 2, 2024

I have an implementation of a high-level MultiManager class that's ready for review. Would it be best to bring that into this PR, or make a separate request after this is merged?

@mattjala mattjala changed the title Implement low level H5Dread_multi call Implement MultiManager and low level H5Dread_multi call Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants