Implement MultiManager and low level H5Dread_multi call #2351

mattjala · 2023-12-05T23:04:05Z

This implements the function read_multi in h5d.pyx, in order to call HDF5's H5Dread_multi. Because it is meant to work on multiple datasets, I figured it made more sense not to implement it as a method on DatasetID.

To avoid duplicating type conversion handling, I modified dset_rw to handle both the single and multi dataset cases.

After review, I plan to implement a similar function for H5Dwrite_multi, and then high-level calls for both, perhaps through a MultiManager class similar to AttributeManager.

Addresses #2298

mattjala · 2023-12-05T23:11:46Z

h5py/h5d.pyx

+IF HDF5_VERSION >= (1, 14, 0):
+    @with_phil
+    def read_multi(count, list dataset_ids, list mspace_ids, list fspace_ids,
+        list type_ids, list bufs not None, PropID dxpl=None):


These parameters are lists instead of numpy arrays to avoid an issue with the destination buffers.

Specifically, the individual destination buffer for each dset read should be a numpy array. If bufs were itself also a numpy array containing these arrays, then it would copy those arrays instead of pointing to the same memory, and the original destination arrays the user created wouldn't get populated.

Thanks, that makes sense.

The Python function probably shouldn't take count as the first parameter - it's necessary in C to tell you how many items you can read after a pointer, but lists have a length.

But if that's the case, I'd probably do a check up front that the lists are all the same length, and fail with a clear error if not.

codecov · 2023-12-11T17:53:20Z

Codecov Report

Attention: Patch coverage is 92.77778% with 13 lines in your changes are missing coverage. Please review.

Project coverage is 89.80%. Comparing base (4c01efa) to head (8ba2be3).
Report is 65 commits behind head on master.

Files	Patch %	Lines
h5py/_hl/dataset.py	92.73%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2351      +/-   ##
==========================================
+ Coverage   89.53%   89.80%   +0.26%     
==========================================
  Files          17       17              
  Lines        2380     2570     +190     
==========================================
+ Hits         2131     2308     +177     
- Misses        249      262      +13

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

takluyver · 2023-12-18T17:54:04Z

As dset_rw is kind of critical to h5py, and this introduces extra mallocs in there - i.e. extra places that might fail or slow things down - I wonder if it would make sense to write a separate dset_rw_multi. Even if that means some code duplication, it should mean the addition doesn't make the existing APIs any worse.

takluyver · 2023-12-18T18:12:02Z

h5py/tests/test_h5d_direct_chunk.py

+            dataset = filehandle.create_dataset("data", shape,
+                                                dtype=dt, data=data_in)
+
+            self.assertTrue(numpy.array_equal(data_in, dataset[...]))


Suggested change

self.assertTrue(numpy.array_equal(data_in, dataset[...]))

numpy.testing.assert_array_equal(data_in, dataset[...])

And throughout. Specific assert methods like this can give you more useful information if the thing they're checking fails, whereas assertTrue can only tell you it was not true. E.g.

AssertionError: Arrays are not equal (shapes (5,), (6,) mismatch) x: array([0, 1, 2, 3, 4]) y: array([0, 1, 2, 3, 4, 5])

takluyver · 2023-12-18T18:13:02Z

h5py/tests/test_h5d_direct_chunk.py

+
+@ut.skipIf(h5py.version.hdf5_version_tuple < (1, 14, 0),
+           "read_multi requires HDF5 >= 1.14.0")
+class TestReadMulti(TestCase):


This is in the file test_h5d_direct_chunk.py, the rest of which is all about the direct chunk read/write methods. I think this file is small enough that we may as well just rename it to test_h5d.py.

mattjala · 2023-12-19T21:38:23Z

h5py/h5d.pyx

@@ -151,6 +151,63 @@ def open(ObjectID loc not None, char* name, PropID dapl=None):
    """
    return DatasetID(H5Dopen(loc.id, name, pdefault(dapl)))

+IF HDF5_VERSION >= (1, 14, 0):
+    @with_phil
+    def rw_multi(list dataset_ids, list mspace_ids, list fspace_ids,


This could be broken up into read_multi and write_multi for readability, but they would be identical except for one parameter passed to dset_rw_multi.

mattjala · 2024-01-02T15:42:08Z

I have an implementation of a high-level MultiManager class that's ready for review. Would it be best to bring that into this PR, or make a separate request after this is merged?

Also expanded test cases

for more information, see https://pre-commit.ci

Implement low level H5Dread_multi call

652b0c2

mattjala force-pushed the dset_multi branch from e25aba5 to 652b0c2 Compare December 5, 2023 23:09

mattjala commented Dec 5, 2023

View reviewed changes

Handle HDF5 versions < 1.14.0

e336b9c

takluyver reviewed Dec 18, 2023

View reviewed changes

mattjala added 2 commits December 19, 2023 09:55

Separate read_multi case into dset_rw_multi

f0dea55

Support write/read_multi through rw_multi

4da8e24

mattjala force-pushed the dset_multi branch from 37820a5 to 4da8e24 Compare December 19, 2023 21:33

mattjala commented Dec 19, 2023

View reviewed changes

Implement high-level MultiManager for multi r/w

7e8c92e

mattjala force-pushed the dset_multi branch from aeff7cb to 7e8c92e Compare February 15, 2024 15:28

mattjala changed the title ~~Implement low level H5Dread_multi call~~ Implement MultiManager and low level H5Dread_multi call Feb 15, 2024

Fix scalar broadcasting on MultiManager

097bf48

Also expanded test cases

mattjala force-pushed the dset_multi branch from 1c3c81d to 097bf48 Compare February 21, 2024 16:52

mattjala and others added 8 commits February 21, 2024 17:08

Add datatype tests

186be96

Document MultiManager usage

590446d

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3c28b6

for more information, see https://pre-commit.ci

Publicly expose MultiManager

66a8b33

Support distinct selections on MultiManager

1a8cd8a

Document MultiManager multi-selection

a751e8f

Test multi selection with a field

fef44c3

[pre-commit.ci] auto fixes from pre-commit.com hooks

8ba2be3

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MultiManager and low level H5Dread_multi call #2351

Implement MultiManager and low level H5Dread_multi call #2351

mattjala commented Dec 5, 2023 •

edited

mattjala Dec 5, 2023

takluyver Dec 18, 2023

codecov bot commented Dec 11, 2023 •

edited

takluyver commented Dec 18, 2023

takluyver Dec 18, 2023

takluyver Dec 18, 2023

mattjala Dec 19, 2023

mattjala commented Jan 2, 2024

	self.assertTrue(numpy.array_equal(data_in, dataset[...]))
	numpy.testing.assert_array_equal(data_in, dataset[...])

Implement MultiManager and low level H5Dread_multi call #2351

Are you sure you want to change the base?

Implement MultiManager and low level H5Dread_multi call #2351

Conversation

mattjala commented Dec 5, 2023 • edited

mattjala Dec 5, 2023

Choose a reason for hiding this comment

takluyver Dec 18, 2023

Choose a reason for hiding this comment

codecov bot commented Dec 11, 2023 • edited

Codecov Report

takluyver commented Dec 18, 2023

takluyver Dec 18, 2023

Choose a reason for hiding this comment

takluyver Dec 18, 2023

Choose a reason for hiding this comment

mattjala Dec 19, 2023

Choose a reason for hiding this comment

mattjala commented Jan 2, 2024

mattjala commented Dec 5, 2023 •

edited

codecov bot commented Dec 11, 2023 •

edited