-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unusual vlen type Attribute can be created but not read. #1817
Comments
Hi @frejanordsiek, The good news is, I think I've worked out how to fix this - see PR #1819. The bad news is that I can't really find a workaround to get this data in any reasonable way for h5py 3.0 and 3.1. I've managed to guess and hack together a way that seems to get the data if you're really desperate, but I can't recommend using this in any serious code. There's a potential for segfaults if it goes wrong, and it may leak memory even if it works. Hack to read vlen array of stringsattr = f.attrs.get_id('test')
storage_size = attr.get_storage_size() # Should be 16 * number of entries
index_buf = np.zeros(storage_size, dtype=np.uint8)
attr.read(index_buf, mtype=attr.get_type()) # Read with no conversion
# index_buf now contains lengths & pointers to the real data
# Can we dereference a pointer in Python?
import struct, ctypes
length, ptr = struct.unpack('<QQ', index_buf[:16])
data_buf = ctypes.create_string_buffer(length)
ctypes.memmove(data_buf, ptr, length)
res = np.frombuffer(data_buf, dtype='S1')
# res is now array([b'a', b'b'], dtype='|S1') I'll reiterate: this is a terrible hack and not suitable for serious use. It was interesting to work out, and maybe it can be useful in very specific circumstances, but I wouldn't rely on it. If your data type is vlen arrays of length-1 strings, is it practical to use vlen strings instead? I think these are logically equivalent, but they're better tested in h5py. |
P.S.
The low-level API should be stable and usable - it is part of h5py's API, and it is documented. It's not as convenient as the high-level API, of course, but you are welcome to use it if you want to. Unfortunately the bug here is in an even lower level, so it doesn't help much. This doc page is a good starting point for the low-level API: https://docs.h5py.org/en/stable/high/lowlevel.html |
I've also hit this bug in our project's HDF5 format. We write and read an array of compound datasets, each element of which includes a variable-length string constructed with the standard way to define variable-length strings in HDF5: hid_t datatype = H5Tcopy(H5T_C_S1);
H5Tset_size(datatype, H5T_VARIABLE);
H5Tset_strpad(datatype(), H5T_STR_NULLTERM); Thanks for looking into this @takluyver ! |
@sethrj this specific issue shouldn't affect vlen strings (it's in the conversion pathway for vlen arrays other than strings). Could you try with #1819 and see if your issue is also fixed or if there's something else to fix? Docs on installing h5py from source: https://docs.h5py.org/en/stable/build.html#source-installation |
Our CI also builds wheels, which offer another option for testing out a PR. You can find them from PR #1819 here: Download the archive for the platform you want, unpack, and |
Oh, ok, the data structure in question has another non-string varlen array, each element of which is a compound datatype 😬 Effectively in C++ it looks like:
so it's not surprising that this one is a troublemaker. I don't think I have time to test today, but I will try. 😞 |
😨 That's an exciting extra level of complexity, but it is plausible that you're hitting this same issue, then. From my investigation earlier today, HDF5 is not particularly efficient at storing or accessing structured data like this. Each vlen field (string or array) basically has a pointer inside the file to where it's stored, so it will have to do a separate little read for each element. There's extra inefficiency for vlen arrays in h5py, because we have the overhead of creating lots of little numpy arrays. |
@takluyver Thank you for the fix and the work around. Looking at it, I can see what you mean about it having segfault and memory leak risks. I will have to think about implementing that or just having the package have less functionality on h5py versions that can't read the Attribute correctly. As for the Attribute's type, it wasn't my choice to format it like this. My package provides compatibility with an HDF5 schema made by software of another party (in case anyone is interested, it is the |
@takluyver I managed to make a slightly improved version of the code you suggested for reading it more manually.
which outputs
It does a quick check that the size makes sense and doesn't use There are a couple things I worry about, though. This works quite well on a 64-bit little endian system (I've incorporated this into my package and run the unit tests many times with no issues), but I wonder if the type for |
I decided to dive into the official HDF5 example code for a vlen Attribute (https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5-examples/browse/1_10/C/H5T/h5ex_t_vlenatt.c) and the header
which should work on any endian-ness and bit-depth of pointers supported by numpy and HDF5. |
Nice! I don't think you need the phil lock, because the low-level functions should also acquire it. I found a way to dereference a pointer in ctypes without carr = (ctypes.c_char * length).from_address(ptr)
np.frombuffer(carr, dtype='S1') I'd still be hesitant to use this technique seriously, but that's up to you! |
Thanks for the suggested improvement. It seems to work, though I am putting a copy operation after the frombuffer because I want the ndarray to still be valid after the file is closed and everything is cleaned up.
About the phil lock, I worry about I am hesitant about the technique a well, to be honest. Seeing as it is what one would do in C code based on the HDF Group's example code, I am a bit less worried. Still, it is indeed risky, but it is necessary in order for the package to support the two most recent h5py releases with no functionality loss since that can surprise users and makes the unit tests a harder. |
Reopening as this probably shouldn't have been closed yet... |
If you explicitly close a file object from the high-level API, that will invalidate all objects belonging to that file. Closing a file in the low-level can also do this, but it's not the default. As far as I know, that's the only thing that would break an attribute reference you're holding (short of code actively trying to break stuff). I'd like to aim for a 3.2 release soon, so hopefully you won't need this for too long. @aragilar AFAIK, we did fix the bug (in #1819), and that's when I'd normally close the issue for it. 🙂 |
Ah, cool, I thought it was only a partial fix, I'll close this again. |
No need to re-open this again, but I made a slightly improved version for anyone else reading this who needs the work around. This version avoids making any intermediate copies and instead allocates the
Thanks again. |
Thanks Freja. I suspect there might still be a memory leak, because I don't think anything frees the data you use as the source for |
I've just released 3.2, which should fix this. |
Thank you. The release fixed it. By the way, found out from someone that the workaround I posted earlier segfaults on 32-bit systems. I think it boiled down to the difference in size of the pointers in the file and the system not being taken into account. Anyhow, the following version doesn't segfault on 32 bit or 64 bit little-endian (who knows about big-endian) and also doesn't make the size_t be read as an intp.
|
First, I've done my testing in the following setup
However, other people using the package I wrote that has the unusual Attribute have been reporting the problem in h5py 3.0.0. I've tested the same code with 2.10.0 just earlier today with no problem and have been using the Attribute successfully in every h5py 2.x.y version since 2.3 (first one where this Attribute was even possible to create, since 2.2 did not support writing it).
The Attribute is an array of vlens of type numpy.dtype('S1').
The following code makes and writes the Attribute, but then when it reads it afterwards gets an error
with the following output
Similar results are obtained trying to read it with
f.attrs.get('test')
,f.attrs.items()
, andf.attrs.values()
.I also can't read it at a more level by
and get the same error (probably because those other methods do this under the hood).
For reference, I ran
h5dump data.h5
on the file and got the following resultI do not know if this is a bug or an intentional feature drop or something else. If it is a bug and is fixed in the next release of h5py, I still need a workaround way to read this in the versions that have this problem since I need to support all h5py versions 2.3 to present with full functionality. It is almost surely readable using the lowlevel libhdf5 bindings in h5py, but to be honest I am not sure where to start and how stable those bindings are. How would one read this kind of Attribute in h5py 3.x in its present state?
The text was updated successfully, but these errors were encountered: