New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing the attrs size limitation solution to h5py.Group API #2311
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #2311 +/- ##
==========================================
- Coverage 89.82% 89.72% -0.10%
==========================================
Files 17 17
Lines 2397 2403 +6
==========================================
+ Hits 2153 2156 +3
- Misses 244 247 +3
☔ View full report in Codecov by Sentry. |
I understand that downstream projects like Firedrake may need to just work around this issue, but within h5py I don't think this is the right approach. We already have the pieces in the low-level API (#1638) to work around this, and it's not a regression, so I don't think it's urgent to do anything right away. From the investigation in the Firedrake PR, it seems like there's a bug in HDF5 where the call to |
Thanks for your reply @takluyver !
Yes I could imagine. As a workaround I will use a custom create function for groups using the low level API in my current work scverse/anndata#874. But did I get this correctly: We can't do this fix after the Group is created? Or we can't fix this in AttributeManager level? Assuming that was the case I just wanted to put this PR here in case we didn't had such a PR because nobody cared yet :D. I understand the hesitation to add this option as an argument to the h5py.Group creator as it would be a big change. |
Yes, I believe this needs to be done when creating the object (group or dataset) to which you're going to attach attributes; I don't think there's a way to do it later (but HDF5 is a complex thing and it changes over time - it's possible there's something I don't know). I've just realised that now we've avoided the other bug that was affecting the Firedrake tests ("record is not in B-tree", see #2274), creating objects with |
I could have sworn the first time I saw size limitation errors was because we were using |
@selmanozleyen Thank you for proposing this solution. I need this for my project which uses a dataset with over 14,000 columns. I see you've added the |
Have you tried the This PR is also just a slightly different variant of the workaround, not really solving the underlying bug, which is probably in HDF5. Since I believe we can already do the same thing with We could also look at setting |
@takluyver The low level h5p module has the ability to configure HDF5 to allow for a greater than 65 kB header limit, correct? And isn't the h5py module a wrapper over the low-level API? So couldn't we just update h5py to use the solution in the low level API? |
There's a confusing set of things going on here. At one point we thought that the call to I've just tried the code below, and I can create 4 MB attributes on both groups and datasets by setting import h5py
import numpy as np
N = 1_000_000
f = h5py.File('large-attr.h5', 'w')
grp = f.create_group('g', track_order=True)
for i in range(10):
grp.attrs[f'a{i}'] = np.arange(N, dtype=np.uint32)
dset = grp.create_dataset('ds', shape=(1,), dtype=np.uint32, track_order=True)
for i in range(10):
dset.attrs[f'a{i}'] = np.arange(N, dtype=np.uint32) |
@takluyver Thank you for the explanation. It appears that the change needed is really a documentation change rather than a code change. And it would appear that the issue may be with the HDF5 format itself rather than h5py. |
Good idea! I've opened PR #2390 to add this to the docs. |
I can't seem to find any old case where we had problems with I thought the docs on dense attribute storage had some useful info about possible downsides (under "Special Issues"): Notably, not much about ordering. |
They mention that large attributes may not work for older HDF5 versions, but it seems this was added in 1.8, so at this point it's pretty safe to rely on them. They also point out that attributes can't be compressed, unlike datasets, which is still valid, but you always have to explicitly enable compression in HDF5, so if you know you want it, you'll quickly find out that you can't have it with attributes.
No, in theory you can specify that attributes should be in dense storage, allowing larger attributes, without turning on order tracking (which I assume uses a little bit of extra space). In practice, what we observe is that you need to turn order tracking on, but that's a workaround. |
Hi,
This is for #1053. As @aragilar says #1053 (comment):
We still need to expose a more pythonic version to solve this. However from reading the merge I don't see how this can be solved in
attrs.py
. Seems like we would need to add this as an option when creating the group. This merge: firedrakeproject/firedrake#2432 does that by creating a wrapper Group class.For this reason I decided to start this PR as having an option for fixing this seems crucial in some applications (see scverse/anndata#874).
Remaining Tasks:
tox -e pre-commit
tox -e py37-test-deps
note in the news/ folder.
ping: @ivirshup