Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing the attrs size limitation solution to h5py.Group API #2311

Closed
wants to merge 1 commit into from

Conversation

selmanozleyen
Copy link

@selmanozleyen selmanozleyen commented Sep 17, 2023

Hi,

This is for #1053. As @aragilar says #1053 (comment):

I'm going to tag this as a good first issue, the patch(es) will need to add H5Pset_attr_phase_change to https://github.com/h5py/h5py/blob/master/h5py/api_functions.txt, add a wrapper function to https://github.com/h5py/h5py/blob/master/h5py/h5a.pyx, and then expose a pythonic version via https://github.com/h5py/h5py/blob/master/h5py/_hl/attrs.py (you don't need to do everything in one PR).

We still need to expose a more pythonic version to solve this. However from reading the merge I don't see how this can be solved in attrs.py. Seems like we would need to add this as an option when creating the group. This merge: firedrakeproject/firedrake#2432 does that by creating a wrapper Group class.

For this reason I decided to start this PR as having an option for fixing this seems crucial in some applications (see scverse/anndata#874).

Remaining Tasks:

  • Run simple static checks with tox -e pre-commit
  • Run the tests with e.g. tox -e py37-test-deps
  • If your change is visible to someone using or building h5py, add a release
    note in the news/ folder.

ping: @ivirshup

@codecov
Copy link

codecov bot commented Sep 17, 2023

Codecov Report

Patch coverage: 62.50% and project coverage change: -0.10% ⚠️

Comparison is base (8bed19b) 89.82% compared to head (7ee9db0) 89.72%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2311      +/-   ##
==========================================
- Coverage   89.82%   89.72%   -0.10%     
==========================================
  Files          17       17              
  Lines        2397     2403       +6     
==========================================
+ Hits         2153     2156       +3     
- Misses        244      247       +3     
Files Changed Coverage Δ
h5py/_hl/group.py 95.60% <62.50%> (-1.03%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@takluyver
Copy link
Member

I understand that downstream projects like Firedrake may need to just work around this issue, but within h5py I don't think this is the right approach. We already have the pieces in the low-level API (#1638) to work around this, and it's not a regression, so I don't think it's urgent to do anything right away.

From the investigation in the Firedrake PR, it seems like there's a bug in HDF5 where the call to H5Pset_attr_phase_change() is ignored unless you also call H5Pset_attr_creation_order(). So I think the next steps are to work out if that's still the case with current HDF5 (e.g. by building h5py against a recent version of HDF5), and if so whether it can be fixed in HDF5.

@selmanozleyen
Copy link
Author

selmanozleyen commented Sep 25, 2023

Thanks for your reply @takluyver !

I understand that downstream projects like Firedrake may need to just work around this issue, but within h5py I don't think this is the right approach.

Yes I could imagine. As a workaround I will use a custom create function for groups using the low level API in my current work scverse/anndata#874. But did I get this correctly: We can't do this fix after the Group is created? Or we can't fix this in AttributeManager level?

Assuming that was the case I just wanted to put this PR here in case we didn't had such a PR because nobody cared yet :D. I understand the hesitation to add this option as an argument to the h5py.Group creator as it would be a big change.

@takluyver
Copy link
Member

Yes, I believe this needs to be done when creating the object (group or dataset) to which you're going to attach attributes; I don't think there's a way to do it later (but HDF5 is a complex thing and it changes over time - it's possible there's something I don't know).

I've just realised that now we've avoided the other bug that was affecting the Firedrake tests ("record is not in B-tree", see #2274), creating objects with track_order=True seems to be enough to work around the "object header message too large" error, at least in h5py 3.9 with HDF5 1.12.2. I don't understand why that works, and it's possible it will break again, but seemingly it allows an easy workaround.

@ivirshup
Copy link

ivirshup commented Oct 2, 2023

I don't understand why that works, and it's possible it will break again, but seemingly it allows an easy workaround.

I could have sworn the first time I saw size limitation errors was because we were using track_order=True. I think track_order=True is actually the reason we're writing this set of info to our attributes (column names in a dataframe) instead of putting it elsewhere. I'll check if I can dig this up....

@erikhuck
Copy link

@selmanozleyen Thank you for proposing this solution. I need this for my project which uses a dataset with over 14,000 columns. I see you've added the attrs_limit parameter to the create_group method. Do you think it'd be possible to have that same parameter as part of the create_dataset method as well? Or is create_dataset deprecated at this point?

@takluyver
Copy link
Member

takluyver commented Feb 27, 2024

Have you tried the track_order=True workaround? If that's not working for you, can you check your h5py & HDF5 versions (print(h5py.version.info)), because I thought this workaround was sufficient with recent versions of h5py.

This PR is also just a slightly different variant of the workaround, not really solving the underlying bug, which is probably in HDF5. Since I believe we can already do the same thing with track_order, I'm not inclined to add another option. It's longstanding enough that we could just document 'if you run into this error when creating attributes, try setting track_order=True'.

We could also look at setting track_order=True as the default, working around this unless you deliberately turn it off. But before doing that, I'd want someone to collect some info on the impacts of it - e.g. file size and write performance - and try to understand why HDF5 doesn't track creation order by default (as in Chesterton's fence).

@erikhuck
Copy link

@takluyver The low level h5p module has the ability to configure HDF5 to allow for a greater than 65 kB header limit, correct? And isn't the h5py module a wrapper over the low-level API? So couldn't we just update h5py to use the solution in the low level API?

@takluyver
Copy link
Member

There's a confusing set of things going on here. At one point we thought that the call to set_attr_phase_change was necessary to store large attributes, but it seemed to be ignored unless you also enabled attribute order tracking. Then it turned out that enabling order tracking is both necessary and sufficient - you don't need the call to set_attr_phase_change at all.

I've just tried the code below, and I can create 4 MB attributes on both groups and datasets by setting track_order=True when creating them. If I don't pass track_order, then I get the 'object header message too large' error.

import h5py
import numpy as np

N = 1_000_000

f = h5py.File('large-attr.h5', 'w')

grp = f.create_group('g', track_order=True)
for i in range(10):
    grp.attrs[f'a{i}'] = np.arange(N, dtype=np.uint32)

dset = grp.create_dataset('ds', shape=(1,), dtype=np.uint32, track_order=True)
for i in range(10):
    dset.attrs[f'a{i}'] = np.arange(N, dtype=np.uint32)

@erikhuck
Copy link

erikhuck commented Mar 1, 2024

@takluyver Thank you for the explanation. It appears that the change needed is really a documentation change rather than a code change. And it would appear that the issue may be with the HDF5 format itself rather than h5py.

@takluyver
Copy link
Member

Good idea! I've opened PR #2390 to add this to the docs.

@ivirshup
Copy link

ivirshup commented Mar 4, 2024

I can't seem to find any old case where we had problems with track_order, so it's very possible I'm completley misremembering.

I thought the docs on dense attribute storage had some useful info about possible downsides (under "Special Issues"):

Notably, not much about ordering.

@takluyver
Copy link
Member

I thought the docs on dense attribute storage had some useful info about possible downsides (under "Special Issues"):

They mention that large attributes may not work for older HDF5 versions, but it seems this was added in 1.8, so at this point it's pretty safe to rely on them. They also point out that attributes can't be compressed, unlike datasets, which is still valid, but you always have to explicitly enable compression in HDF5, so if you know you want it, you'll quickly find out that you can't have it with attributes.

Notably, not much about ordering.

No, in theory you can specify that attributes should be in dense storage, allowing larger attributes, without turning on order tracking (which I assume uses a little bit of extra space). In practice, what we observe is that you need to turn order tracking on, but that's a workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants