Unusual vlen type Attribute can be created but not read. #1817

frejanordsiek · 2021-02-17T13:34:41Z

First, I've done my testing in the following setup

h5py    3.1.0
HDF5    1.12.0
Python  3.9.1 (default, Jan 20 2021, 00:00:00) 
[GCC 10.2.1 20201125 (Red Hat 10.2.1-9)]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.20.0
cython (built with) 0.29.21
numpy (built against) 1.19.3
HDF5 (built against) 1.12.0

However, other people using the package I wrote that has the unusual Attribute have been reporting the problem in h5py 3.0.0. I've tested the same code with 2.10.0 just earlier today with no problem and have been using the Attribute successfully in every h5py 2.x.y version since 2.3 (first one where this Attribute was even possible to create, since 2.2 did not support writing it).

The Attribute is an array of vlens of type numpy.dtype('S1').

The following code makes and writes the Attribute, but then when it reads it afterwards gets an error

>>> import numpy, h5py
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((1, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>> f.attrs['test']

with the following output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-b22c92ac9bc4> in <module>
      5 f = h5py.File('data.h5', mode='a')
      6 f.attrs.create('test', a)
----> 7 f.attrs['test']

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

~/.local/lib/python3.9/site-packages/h5py/_hl/attrs.py in __getitem__(self, name)                                                                               
     75 
     76         arr = numpy.ndarray(shape, dtype=dtype, order='C')
---> 77         attr.read(arr, mtype=htype)
     78 
     79         string_info = h5t.check_string_dtype(dtype)

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5a.pyx in h5py.h5a.AttrID.read()

h5py/_proxy.pyx in h5py._proxy.attr_rw()

h5py/_conv.pyx in h5py._conv.vlen2ndarray()

h5py/_conv.pyx in h5py._conv.conv_vlen2ndarray()

ValueError: data type must provide an itemsize

Similar results are obtained trying to read it with f.attrs.get('test'), f.attrs.items(), and f.attrs.values().

I also can't read it at a more level by

>>> id = f.attrs.get_id('test')
>>> b = numpy.zeros(id.shape, dtype=id.dtype)
>>> id.read(b)

and get the same error (probably because those other methods do this under the hood).

For reference, I ran h5dump data.h5 on the file and got the following result

HDF5 "data.h5" {
GROUP "/" {
   ATTRIBUTE "test" {
      DATATYPE  H5T_VLEN { H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLPAD;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }}
      DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
      DATA {
      (0): ("a", "b")
      }
   }
}
}

I do not know if this is a bug or an intentional feature drop or something else. If it is a bug and is fixed in the next release of h5py, I still need a workaround way to read this in the versions that have this problem since I need to support all h5py versions 2.3 to present with full functionality. It is almost surely readable using the lowlevel libhdf5 bindings in h5py, but to be honest I am not sure where to start and how stable those bindings are. How would one read this kind of Attribute in h5py 3.x in its present state?

The text was updated successfully, but these errors were encountered:

takluyver · 2021-02-18T14:34:26Z

Hi @frejanordsiek,

The good news is, I think I've worked out how to fix this - see PR #1819.

The bad news is that I can't really find a workaround to get this data in any reasonable way for h5py 3.0 and 3.1. I've managed to guess and hack together a way that seems to get the data if you're really desperate, but I can't recommend using this in any serious code. There's a potential for segfaults if it goes wrong, and it may leak memory even if it works.

Hack to read vlen array of strings

attr = f.attrs.get_id('test')
storage_size = attr.get_storage_size()  # Should be 16 * number of entries
index_buf = np.zeros(storage_size, dtype=np.uint8)
attr.read(index_buf, mtype=attr.get_type())  # Read with no conversion

# index_buf now contains lengths & pointers to the real data
# Can we dereference a pointer in Python?

import struct, ctypes
length, ptr = struct.unpack('<QQ', index_buf[:16])
data_buf = ctypes.create_string_buffer(length)
ctypes.memmove(data_buf, ptr, length)
res = np.frombuffer(data_buf, dtype='S1')
# res is now array([b'a', b'b'], dtype='|S1')

I'll reiterate: this is a terrible hack and not suitable for serious use. It was interesting to work out, and maybe it can be useful in very specific circumstances, but I wouldn't rely on it.

If your data type is vlen arrays of length-1 strings, is it practical to use vlen strings instead? I think these are logically equivalent, but they're better tested in h5py.

takluyver · 2021-02-18T15:54:36Z

P.S.

the lowlevel libhdf5 bindings in h5py, but to be honest I am not sure where to start and how stable those bindings are

The low-level API should be stable and usable - it is part of h5py's API, and it is documented. It's not as convenient as the high-level API, of course, but you are welcome to use it if you want to. Unfortunately the bug here is in an even lower level, so it doesn't help much.

This doc page is a good starting point for the low-level API: https://docs.h5py.org/en/stable/high/lowlevel.html

sethrj · 2021-02-18T18:47:00Z

I've also hit this bug in our project's HDF5 format. We write and read an array of compound datasets, each element of which includes a variable-length string constructed with the standard way to define variable-length strings in HDF5:

    hid_t datatype = H5Tcopy(H5T_C_S1);
    H5Tset_size(datatype, H5T_VARIABLE);
    H5Tset_strpad(datatype(), H5T_STR_NULLTERM);

Thanks for looking into this @takluyver !

takluyver · 2021-02-18T18:51:49Z

@sethrj this specific issue shouldn't affect vlen strings (it's in the conversion pathway for vlen arrays other than strings). Could you try with #1819 and see if your issue is also fixed or if there's something else to fix?

Docs on installing h5py from source: https://docs.h5py.org/en/stable/build.html#source-installation

takluyver · 2021-02-18T18:54:21Z

Our CI also builds wheels, which offer another option for testing out a PR. You can find them from PR #1819 here:

https://dev.azure.com/h5pyappveyor/h5py/_build/results?buildId=975&view=artifacts&pathAsName=false&type=publishedArtifacts

Download the archive for the platform you want, unpack, and pip install the wheel file for the appropriate Python version.

sethrj · 2021-02-18T18:56:29Z

Oh, ok, the data structure in question has another non-string varlen array, each element of which is a compound datatype 😬

Effectively in C++ it looks like:

struct Nasty
{
    int id;
    std::string  name;
    std::vector<std::pair<int, double>> values;
    bool flag;
    bool another_flag;
};

so it's not surprising that this one is a troublemaker. I don't think I have time to test today, but I will try. 😞

takluyver · 2021-02-18T19:07:42Z

😨 That's an exciting extra level of complexity, but it is plausible that you're hitting this same issue, then.

From my investigation earlier today, HDF5 is not particularly efficient at storing or accessing structured data like this. Each vlen field (string or array) basically has a pointer inside the file to where it's stored, so it will have to do a separate little read for each element. There's extra inefficiency for vlen arrays in h5py, because we have the overhead of creating lots of little numpy arrays.

frejanordsiek · 2021-02-20T13:39:46Z

@takluyver Thank you for the fix and the work around. Looking at it, I can see what you mean about it having segfault and memory leak risks. I will have to think about implementing that or just having the package have less functionality on h5py versions that can't read the Attribute correctly.

As for the Attribute's type, it wasn't my choice to format it like this. My package provides compatibility with an HDF5 schema made by software of another party (in case anyone is interested, it is the MATLAB_fields Attribute for structs in Matlab's v7.3 MAT file format).

frejanordsiek · 2021-02-21T20:02:41Z

@takluyver I managed to make a slightly improved version of the code you suggested for reading it more manually.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % 16 != 0:
>>>         raise RuntimeError('Data size is not a multiple of 16 bytes.')
>>>     raw_buf = numpy.empty(attr_size // 8, dtype='<u8')
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         buf = ctypes.create_string_buffer(length)
>>>         ctypes.memmove(buf, ptr, length)
>>>         attr[i] = numpy.frombuffer(buf, dtype='S1')
>>> attr

which outputs

array([array([b'a', b'b'], dtype='|S1'),
       array([b'c', b'd', b'e'], dtype='|S1')], dtype=object)

It does a quick check that the size makes sense and doesn't use struct.unpack as an intermediate step. Looking around inside the h5py code including the code for reading Attributes, it looks like most functions need the with_phil wrapper from h5py._objects for threadsafety. I added manual wrapping using the lock.

There are a couple things I worry about, though. This works quite well on a 64-bit little endian system (I've incorporated this into my package and run the unit tests many times with no issues), but I wonder if the type for raw_buf is correct for 32-bit systems and/or big endian systems and should it really be unsigned. And should the attr_id be closed afterwards?

frejanordsiek · 2021-02-21T20:42:05Z

I decided to dive into the official HDF5 example code for a vlen Attribute (https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5-examples/browse/1_10/C/H5T/h5ex_t_vlenatt.c) and the header H5Tpublic.h to look at the hvl_t type (https://bitbucket.hdfgroup.org/projects/HDFFV/repos/hdf5/browse/src/H5Tpublic.h?at=refs%2Ftags%2Fhdf5-1_12_0#190). The length field is a size_t which is then followed by a void pointer. So, the proper dtype is 'p'. Then, the code to read it would look like

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype('p')
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % (2 * dt.itemsize) != 0:
>>>         raise RuntimeError('Data size is not a multiple of {0} bytes.'.format(2 * dt.itemsize))
>>>     raw_buf = numpy.empty(attr_size // dt.itemsize, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         buf = ctypes.create_string_buffer(length)
>>>         ctypes.memmove(buf, ptr, length)
>>>         attr[i] = numpy.frombuffer(buf, dtype='S1')
>>> attr

which should work on any endian-ness and bit-depth of pointers supported by numpy and HDF5.

takluyver · 2021-02-21T23:24:29Z

Nice! I don't think you need the phil lock, because the low-level functions should also acquire it.

I found a way to dereference a pointer in ctypes without memmove(), which I think might avoid leaking the memory that HDF5 puts the vlen data in:

carr = (ctypes.c_char * length).from_address(ptr)
np.frombuffer(carr, dtype='S1')

I'd still be hesitant to use this technique seriously, but that's up to you!

frejanordsiek · 2021-02-22T10:35:00Z

Thanks for the suggested improvement. It seems to work, though I am putting a copy operation after the frombuffer because I want the ndarray to still be valid after the file is closed and everything is cleaned up.

numpy.frombuffer((length * ctypes.c_char).from_address(ptr), dtype='S1').copy()

About the phil lock, I worry about attr_id being pulled out from under the code by other calls to h5py. Are Attribute IDs only cleaned up when they are explicitly closed or collected by the garbage collector or can other external calls unintentionally close them?

I am hesitant about the technique a well, to be honest. Seeing as it is what one would do in C code based on the HDF Group's example code, I am a bit less worried. Still, it is indeed risky, but it is necessary in order for the package to support the two most recent h5py releases with no functionality loss since that can surprise users and makes the unit tests a harder.

aragilar · 2021-02-22T11:44:28Z

Reopening as this probably shouldn't have been closed yet...

takluyver · 2021-02-22T11:49:13Z

If you explicitly close a file object from the high-level API, that will invalidate all objects belonging to that file. Closing a file in the low-level can also do this, but it's not the default. As far as I know, that's the only thing that would break an attribute reference you're holding (short of code actively trying to break stuff).

I'd like to aim for a 3.2 release soon, so hopefully you won't need this for too long.

@aragilar AFAIK, we did fix the bug (in #1819), and that's when I'd normally close the issue for it. 🙂

aragilar · 2021-02-22T12:10:25Z

Ah, cool, I thought it was only a partial fix, I'll close this again.

frejanordsiek · 2021-02-23T12:45:16Z

No need to re-open this again, but I made a slightly improved version for anyone else reading this who needs the work around. This version avoids making any intermediate copies and instead allocates the ndarray first and uses ctypes.memmove to copy the data directly into it. So there is a much lower chance of a memory leak.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype('p')
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     attr_size = attr_id.get_storage_size()
>>>     if attr_size % (2 * dt.itemsize) != 0:
>>>         raise RuntimeError('Data size is not a multiple of {0} bytes.'.format(2 * dt.itemsize))
>>>     raw_buf = numpy.empty(attr_size // dt.itemsize, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(len(raw_buf) // 2, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[2 * i])
>>>         ptr = int(raw_buf[(2 * i) + 1])
>>>         attr[i] = numpy.empty(length, dtype='S1')
>>>         ctypes.memmove(attr[i].ctypes.data, ptr, length)
>>> attr

Thanks again.

takluyver · 2021-02-23T14:50:53Z

Thanks Freja. I suspect there might still be a memory leak, because I don't think anything frees the data you use as the source for memmove(). But then we seem to have some memory leak reading vlen data with h5py anyway, per #1747. 😕

takluyver · 2021-03-03T10:36:54Z

I've just released 3.2, which should fix this.

frejanordsiek · 2021-03-14T11:14:11Z

Thank you. The release fixed it.

By the way, found out from someone that the workaround I posted earlier segfaults on 32-bit systems. I think it boiled down to the difference in size of the pointers in the file and the system not being taken into account. Anyhow, the following version doesn't segfault on 32 bit or 64 bit little-endian (who knows about big-endian) and also doesn't make the size_t be read as an intp.

>>> import ctypes
>>> import numpy
>>> import h5py
>>> import h5py._objects
>>>
>>> dt = h5py.vlen_dtype(numpy.dtype('S1'))
>>> a = numpy.empty((2, ), dtype=dt)
>>> a[0] = numpy.array([b'a', b'b'], dtype='S1')
>>> a[1] = numpy.array([b'c', b'd', b'e'], dtype='S1')
>>> f = h5py.File('data.h5', mode='a')
>>> f.attrs.create('test', a)
>>>
>>> dt = numpy.dtype([('length', numpy.uintp), ('pointer', numpy.intp)])
>>> with h5py._objects.phil:
>>>     attr_id = f.attrs.get_id('test')
>>>     raw_buf = numpy.empty(attr_id.shape, dtype=dt)
>>>     attr_id.read(raw_buf, mtype=attr_id.get_type())
>>>     attr = numpy.empty(raw_buf.shape, dtype='object')
>>>     for i in range(len(attr)):
>>>         length = int(raw_buf[i]['length'])
>>>         ptr = int(raw_buf[i]['pointer'])
>>>         attr[i] = numpy.empty(length, dtype='S1')
>>>         ctypes.memmove(attr[i].ctypes.data, ptr, length)
>>> attr

takluyver mentioned this issue Feb 18, 2021

Fix reading data with vlen array of fixed-length strings #1819

Merged

takluyver added the bug label Feb 18, 2021

takluyver added this to the 3.2 milestone Feb 18, 2021

sethrj mentioned this issue Feb 18, 2021

Error using 'loadmat' with h5py 3.0 frejanordsiek/hdf5storage#102

Closed

aragilar closed this as completed in #1819 Feb 19, 2021

aragilar reopened this Feb 22, 2021

aragilar closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unusual vlen type Attribute can be created but not read. #1817

Unusual vlen type Attribute can be created but not read. #1817

frejanordsiek commented Feb 17, 2021

takluyver commented Feb 18, 2021 •

edited

Loading

takluyver commented Feb 18, 2021

sethrj commented Feb 18, 2021

takluyver commented Feb 18, 2021

takluyver commented Feb 18, 2021

sethrj commented Feb 18, 2021

takluyver commented Feb 18, 2021

frejanordsiek commented Feb 20, 2021

frejanordsiek commented Feb 21, 2021

frejanordsiek commented Feb 21, 2021

takluyver commented Feb 21, 2021

frejanordsiek commented Feb 22, 2021

aragilar commented Feb 22, 2021

takluyver commented Feb 22, 2021

aragilar commented Feb 22, 2021

frejanordsiek commented Feb 23, 2021

takluyver commented Feb 23, 2021

takluyver commented Mar 3, 2021

frejanordsiek commented Mar 14, 2021 •

edited

Loading

Unusual vlen type Attribute can be created but not read. #1817

Unusual vlen type Attribute can be created but not read. #1817

Comments

frejanordsiek commented Feb 17, 2021

takluyver commented Feb 18, 2021 • edited Loading

takluyver commented Feb 18, 2021

sethrj commented Feb 18, 2021

takluyver commented Feb 18, 2021

takluyver commented Feb 18, 2021

sethrj commented Feb 18, 2021

takluyver commented Feb 18, 2021

frejanordsiek commented Feb 20, 2021

frejanordsiek commented Feb 21, 2021

frejanordsiek commented Feb 21, 2021

takluyver commented Feb 21, 2021

frejanordsiek commented Feb 22, 2021

aragilar commented Feb 22, 2021

takluyver commented Feb 22, 2021

aragilar commented Feb 22, 2021

frejanordsiek commented Feb 23, 2021

takluyver commented Feb 23, 2021

takluyver commented Mar 3, 2021

frejanordsiek commented Mar 14, 2021 • edited Loading

takluyver commented Feb 18, 2021 •

edited

Loading

frejanordsiek commented Mar 14, 2021 •

edited

Loading