Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unusual vlen type Attribute can be created but not read. #1817

Closed
frejanordsiek opened this issue Feb 17, 2021 · 19 comments · Fixed by #1819
Closed

Unusual vlen type Attribute can be created but not read. #1817

frejanordsiek opened this issue Feb 17, 2021 · 19 comments · Fixed by #1819
Labels
Milestone

Comments

@frejanordsiek
Copy link

First, I've done my testing in the following setup

h5py    3.1.0
HDF5    1.12.0
Python  3.9.1 (default, Jan 20 2021, 00:00:00) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.20.0
cython (built with) 0.29.21
numpy (built against) 1.19.3
HDF5 (built against) 1.12.0

However, other people using the package I wrote that has the unusual Attribute have been reporting the problem in h5py 3.0.0. I've tested the same code with 2.10.0 just earlier today with no problem and have been using the Attribute successfully in every h5py 2.x.y version since 2.3 (first one where this Attribute was even possible to create, since 2.2 did not support writing it).

The Attribute is an array of vlens of type numpy.dtype('S1').

The following code makes and writes the Attribute, but then when it reads it afterwards gets an error

>>> import numpy, h5py
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((1, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>> f.attrs['test']

with the following output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b22c92ac9bc4> in <module>
      5 f = h5py.File('data.h5', mode='a')
      6 f.attrs.create('test', a)
----> 7 f.attrs['test']

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/.local/lib/python3.9/site-packages/h5py/_hl/attrs.py in __getitem__(self, name)                                                                               
     75 
     76         arr = numpy.ndarray(shape, dtype=dtype, order='C')
---> 77         attr.read(arr, mtype=htype)
     78 
     79         string_info = h5t.check_string_dtype(dtype)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5a.pyx in h5py.h5a.AttrID.read()

h5py/_proxy.pyx in h5py._proxy.attr_rw()

h5py/_conv.pyx in h5py._conv.vlen2ndarray()

h5py/_conv.pyx in h5py._conv.conv_vlen2ndarray()

ValueError: data type must provide an itemsize

Similar results are obtained trying to read it with f.attrs.get('test'), f.attrs.items(), and f.attrs.values().

I also can't read it at a more level by

>>> id = f.attrs.get_id('test')
>>> b = numpy.zeros(id.shape, dtype=id.dtype)
>>> id.read(b)

and get the same error (probably because those other methods do this under the hood).

For reference, I ran h5dump data.h5 on the file and got the following result

HDF5 "data.h5" {
GROUP "/" {
   ATTRIBUTE "test" {
      DATATYPE  H5T_VLEN { H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }}
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): ("a", "b")
      }
   }
}
}

I do not know if this is a bug or an intentional feature drop or something else. If it is a bug and is fixed in the next release of h5py, I still need a workaround way to read this in the versions that have this problem since I need to support all h5py versions 2.3 to present with full functionality. It is almost surely readable using the lowlevel libhdf5 bindings in h5py, but to be honest I am not sure where to start and how stable those bindings are. How would one read this kind of Attribute in h5py 3.x in its present state?

@takluyver
Copy link
Member

takluyver commented Feb 18, 2021

Hi @frejanordsiek,

The good news is, I think I've worked out how to fix this - see PR #1819.

The bad news is that I can't really find a workaround to get this data in any reasonable way for h5py 3.0 and 3.1. I've managed to guess and hack together a way that seems to get the data if you're really desperate, but I can't recommend using this in any serious code. There's a potential for segfaults if it goes wrong, and it may leak memory even if it works.

Hack to read vlen array of strings
attr = f.attrs.get_id('test')
storage_size = attr.get_storage_size()  # Should be 16 * number of entries
index_buf = np.zeros(storage_size, dtype=np.uint8)
attr.read(index_buf, mtype=attr.get_type())  # Read with no conversion

# index_buf now contains lengths & pointers to the real data
# Can we dereference a pointer in Python?

import struct, ctypes
length, ptr = struct.unpack('<QQ', index_buf[:16])
data_buf = ctypes.create_string_buffer(length)
ctypes.memmove(data_buf, ptr, length)
res = np.frombuffer(data_buf, dtype='S1')
# res is now array([b'a', b'b'], dtype='|S1')

I'll reiterate: this is a terrible hack and not suitable for serious use. It was interesting to work out, and maybe it can be useful in very specific circumstances, but I wouldn't rely on it.

If your data type is vlen arrays of length-1 strings, is it practical to use vlen strings instead? I think these are logically equivalent, but they're better tested in h5py.

@takluyver takluyver added the bug label Feb 18, 2021
@takluyver takluyver added this to the 3.2 milestone Feb 18, 2021
@takluyver
Copy link
Member

P.S.

the lowlevel libhdf5 bindings in h5py, but to be honest I am not sure where to start and how stable those bindings are

The low-level API should be stable and usable - it is part of h5py's API, and it is documented. It's not as convenient as the high-level API, of course, but you are welcome to use it if you want to. Unfortunately the bug here is in an even lower level, so it doesn't help much.

This doc page is a good starting point for the low-level API: https://docs.h5py.org/en/stable/high/lowlevel.html

@sethrj
Copy link
Contributor

sethrj commented Feb 18, 2021

I've also hit this bug in our project's HDF5 format. We write and read an array of compound datasets, each element of which includes a variable-length string constructed with the standard way to define variable-length strings in HDF5:

    hid_t datatype = H5Tcopy(H5T_C_S1);
    H5Tset_size(datatype, H5T_VARIABLE);
    H5Tset_strpad(datatype(), H5T_STR_NULLTERM);

Thanks for looking into this @takluyver !

@takluyver
Copy link
Member

@sethrj this specific issue shouldn't affect vlen strings (it's in the conversion pathway for vlen arrays other than strings). Could you try with #1819 and see if your issue is also fixed or if there's something else to fix?

Docs on installing h5py from source: https://docs.h5py.org/en/stable/build.html#source-installation

@takluyver
Copy link
Member

Our CI also builds wheels, which offer another option for testing out a PR. You can find them from PR #1819 here:

https://dev.azure.com/h5pyappveyor/h5py/_build/results?buildId=975&view=artifacts&pathAsName=false&type=publishedArtifacts

Download the archive for the platform you want, unpack, and pip install the wheel file for the appropriate Python version.

@sethrj
Copy link
Contributor

sethrj commented Feb 18, 2021

Oh, ok, the data structure in question has another non-string varlen array, each element of which is a compound datatype 😬

Effectively in C++ it looks like:

struct Nasty
{
    int id;
    std::string  name;
    std::vector<std::pair<int, double>> values;
    bool flag;
    bool another_flag;
};

so it's not surprising that this one is a troublemaker. I don't think I have time to test today, but I will try. 😞

@takluyver
Copy link
Member

😨 That's an exciting extra level of complexity, but it is plausible that you're hitting this same issue, then.

From my investigation earlier today, HDF5 is not particularly efficient at storing or accessing structured data like this. Each vlen field (string or array) basically has a pointer inside the file to where it's stored, so it will have to do a separate little read for each element. There's extra inefficiency for vlen arrays in h5py, because we have the overhead of creating lots of little numpy arrays.

@frejanordsiek
Copy link
Author

@takluyver Thank you for the fix and the work around. Looking at it, I can see what you mean about it having segfault and memory leak risks. I will have to think about implementing that or just having the package have less functionality on h5py versions that can't read the Attribute correctly.

As for the Attribute's type, it wasn't my choice to format it like this. My package provides compatibility with an HDF5 schema made by software of another party (in case anyone is interested, it is the MATLAB_fields Attribute for structs in Matlab's v7.3 MAT file format).

@frejanordsiek
Copy link
Author

@takluyver I managed to make a slightly improved version of the code you suggested for reading it more manually.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % 16 != 0:
>>>         raise RuntimeError('Data size is not a multiple of 16 bytes.')
>>>     raw_buf = numpy.empty(attr_size // 8, dtype='<u8')
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         buf = ctypes.create_string_buffer(length)
>>>         ctypes.memmove(buf, ptr, length)
>>>         attr[i] = numpy.frombuffer(buf, dtype='S1')
>>> attr

which outputs

array([array([b'a', b'b'], dtype='|S1'),
       array([b'c', b'd', b'e'], dtype='|S1')], dtype=object)

It does a quick check that the size makes sense and doesn't use struct.unpack as an intermediate step. Looking around inside the h5py code including the code for reading Attributes, it looks like most functions need the with_phil wrapper from h5py._objects for threadsafety. I added manual wrapping using the lock.

There are a couple things I worry about, though. This works quite well on a 64-bit little endian system (I've incorporated this into my package and run the unit tests many times with no issues), but I wonder if the type for raw_buf is correct for 32-bit systems and/or big endian systems and should it really be unsigned. And should the attr_id be closed afterwards?

@frejanordsiek
Copy link
Author

I decided to dive into the official HDF5 example code for a vlen Attribute (https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5-examples/browse/1_10/C/H5T/h5ex_t_vlenatt.c) and the header H5Tpublic.h to look at the hvl_t type (https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5/browse/src/H5Tpublic.h?at=refs%2Ftags%2Fhdf5-1_12_0#190). The length field is a size_t which is then followed by a void pointer. So, the proper dtype is 'p'. Then, the code to read it would look like

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype('p')
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % (2 * dt.itemsize) != 0:
>>>         raise RuntimeError('Data size is not a multiple of {0} bytes.'.format(2 * dt.itemsize))
>>>     raw_buf = numpy.empty(attr_size // dt.itemsize, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         buf = ctypes.create_string_buffer(length)
>>>         ctypes.memmove(buf, ptr, length)
>>>         attr[i] = numpy.frombuffer(buf, dtype='S1')
>>> attr

which should work on any endian-ness and bit-depth of pointers supported by numpy and HDF5.

@takluyver
Copy link
Member

Nice! I don't think you need the phil lock, because the low-level functions should also acquire it.

I found a way to dereference a pointer in ctypes without memmove(), which I think might avoid leaking the memory that HDF5 puts the vlen data in:

carr = (ctypes.c_char * length).from_address(ptr)
np.frombuffer(carr, dtype='S1')

I'd still be hesitant to use this technique seriously, but that's up to you!

@frejanordsiek
Copy link
Author

Thanks for the suggested improvement. It seems to work, though I am putting a copy operation after the frombuffer because I want the ndarray to still be valid after the file is closed and everything is cleaned up.

numpy.frombuffer((length * ctypes.c_char).from_address(ptr), dtype='S1').copy()

About the phil lock, I worry about attr_id being pulled out from under the code by other calls to h5py. Are Attribute IDs only cleaned up when they are explicitly closed or collected by the garbage collector or can other external calls unintentionally close them?

I am hesitant about the technique a well, to be honest. Seeing as it is what one would do in C code based on the HDF Group's example code, I am a bit less worried. Still, it is indeed risky, but it is necessary in order for the package to support the two most recent h5py releases with no functionality loss since that can surprise users and makes the unit tests a harder.

@aragilar
Copy link
Member

Reopening as this probably shouldn't have been closed yet...

@aragilar aragilar reopened this Feb 22, 2021
@takluyver
Copy link
Member

If you explicitly close a file object from the high-level API, that will invalidate all objects belonging to that file. Closing a file in the low-level can also do this, but it's not the default. As far as I know, that's the only thing that would break an attribute reference you're holding (short of code actively trying to break stuff).

I'd like to aim for a 3.2 release soon, so hopefully you won't need this for too long.

@aragilar AFAIK, we did fix the bug (in #1819), and that's when I'd normally close the issue for it. 🙂

@aragilar
Copy link
Member

Ah, cool, I thought it was only a partial fix, I'll close this again.

@frejanordsiek
Copy link
Author

No need to re-open this again, but I made a slightly improved version for anyone else reading this who needs the work around. This version avoids making any intermediate copies and instead allocates the ndarray first and uses ctypes.memmove to copy the data directly into it. So there is a much lower chance of a memory leak.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype('p')
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % (2 * dt.itemsize) != 0:
>>>         raise RuntimeError('Data size is not a multiple of {0} bytes.'.format(2 * dt.itemsize))
>>>     raw_buf = numpy.empty(attr_size // dt.itemsize, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         attr[i] = numpy.empty(length, dtype='S1')
>>>         ctypes.memmove(attr[i].ctypes.data, ptr, length)
>>> attr

Thanks again.

@takluyver
Copy link
Member

Thanks Freja. I suspect there might still be a memory leak, because I don't think anything frees the data you use as the source for memmove(). But then we seem to have some memory leak reading vlen data with h5py anyway, per #1747. 😕

@takluyver
Copy link
Member

I've just released 3.2, which should fix this.

@frejanordsiek
Copy link
Author

frejanordsiek commented Mar 14, 2021

Thank you. The release fixed it.

By the way, found out from someone that the workaround I posted earlier segfaults on 32-bit systems. I think it boiled down to the difference in size of the pointers in the file and the system not being taken into account. Anyhow, the following version doesn't segfault on 32 bit or 64 bit little-endian (who knows about big-endian) and also doesn't make the size_t be read as an intp.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype([('length', numpy.uintp), ('pointer', numpy.intp)])
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     raw_buf = numpy.empty(attr_id.shape, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(raw_buf.shape, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[i]['length'])
>>>         ptr = int(raw_buf[i]['pointer'])
>>>         attr[i] = numpy.empty(length, dtype='S1')
>>>         ctypes.memmove(attr[i].ctypes.data, ptr, length)
>>> attr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants