Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble setting value for vlen type on compound data #1921

Open
FFY00 opened this issue Jul 6, 2021 · 8 comments
Open

Trouble setting value for vlen type on compound data #1921

FFY00 opened this issue Jul 6, 2021 · 8 comments

Comments

@FFY00
Copy link
Contributor

FFY00 commented Jul 6, 2021

  • Operating System: Arch Linux
  • Where Python was acquired: Arch Linux
$ python -c 'import h5py; print(h5py.version.info)'
Summary of the h5py configuration
---------------------------------

h5py    3.2.1
HDF5    1.12.0
Python  3.9.5 (default, May 24 2021, 12:50:35)
[GCC 11.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.20.3
cython (built with) 0.29.22
numpy (built against) 1.20.1
HDF5 (built against) 1.12.0

Reproducible

import h5py
import numpy as np


f = h5py.File('test.h5','w')

table = f.create_dataset(
    'packets',
    shape=(1,),
    chunks=True,
    maxshape=(None,),
    dtype=[
        ('timestamp', np.float64),
        ('data', h5py.vlen_dtype(np.uint8)),
    ],
)

table[0, 'timestamp'] = 1.5
table[0, 'data'] = np.frombuffer(b'test', dtype=np.uint8)

f.close()
$ python example.py
Traceback (most recent call last):
  File "/home/anubis/git/usbviewer/example.py", line 19, in <module>
    table[0, 'data'] = np.frombuffer(b'test', dtype=np.uint8)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 848, in __setitem__
    val = val.view(numpy.dtype([(names[0], dtype)]))
  File "/usr/lib/python3.9/site-packages/numpy/core/_internal.py", line 459, in _view_is_safe
    raise TypeError("Cannot change data-type for object array.")
TypeError: Cannot change data-type for object array.

I do not understand what I am doing wrong, the array is the correct type. An using the API incorrectly and not seting the value? Admittedly, I am a bit confused by this API and just trying to figure out how to actually write the data 🙃. (also, is there any way to write the whole row as a tuple, eg. table[0] = (1.5, array)?)

@FFY00
Copy link
Contributor Author

FFY00 commented Jul 6, 2021

It seems that we have these the data I provided has a different type. It is dtype('O'), while it was expecting dtype([('data', 'O')]). Are there any significant differences there? AFAIK it's only the label, I would expect this to cast silently.

Anyway, adjusting the code to match the type.

import h5py
import numpy as np


f = h5py.File('test.h5','w')

table = f.create_dataset(
    'packets',
    shape=(1,),
    chunks=True,
    maxshape=(None,),
    dtype=[
        ('timestamp', np.float64),
        ('data', h5py.vlen_dtype(np.uint8)),
    ],
)

table[0, 'timestamp'] = 1.5
table[0, 'data'] = np.frombuffer(b'test', dtype=[('data', np.uint8)])

f.close()

I am now getting.

$ python example.py
Traceback (most recent call last):
  File "/home/anubis/git/usbviewer/example.py", line 19, in <module>
    table[0, 'data'] = np.frombuffer(b'test', dtype=[('data', np.uint8)])
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 946, in __setitem__
    mspace = h5s.create_simple(selection.expand_shape(mshape))
  File "/usr/lib/python3.9/site-packages/h5py/_hl/selections.py", line 269, in expand_shape
    raise TypeError("Can't broadcast %s -> %s" % (source_shape, self.array_shape))  # array shape
TypeError: Can't broadcast (4,) -> ()

@FFY00
Copy link
Contributor Author

FFY00 commented Jul 6, 2021

I tried patching expand_shape to return the source shape if self.array_shape is empty (which seems to be the case of vlen types). No exception is raised, but the data isn't written. I am out of ideas now.

@takluyver
Copy link
Member

What about if you put the array into an object array, something like this:

a = np.array(None, dtype=object)
a[()] = np.frombuffer(b'test', dtype=np.uint8)

vlen data in h5py is somewhere between awkward and a kludge. Compound datatypes are somewhat better, but still not the main thing the API is designed around. So it's not a great surprise that using a vlen inside a compound datatype causes problems. h5py, HDF5 and NumPy are all at their best with homogeneous arrays of numbers.

@FFY00
Copy link
Contributor Author

FFY00 commented Jul 7, 2021

That fails with

Traceback (most recent call last):
  File "/home/anubis/git/usbviewer/example.py", line 20, in <module>
    data[()] = np.frombuffer(b'test', dtype=np.uint8)
ValueError: setting an array element with a sequence.

I also tried doing np.void(b'test') and np.object_(b'test') instead of np.frombuffer(b'test', dtype=np.uint8), but they both fail with

Traceback (most recent call last):
  File "/home/anubis/git/usbviewer/example.py", line 22, in <module>
    table[0, 'data'] = data
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 948, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 232, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 145, in h5py._proxy.dset_rw
  File "h5py/_conv.pyx", line 784, in h5py._conv.ndarray2vlen
AttributeError: 'bytes' object has no attribute 'dtype'

vlen data in h5py is somewhere between awkward and a kludge. Compound datatypes are somewhat better, but still not the main thing the API is designed around. So it's not a great surprise that using a vlen inside a compound datatype causes problems. h5py, HDF5 and NumPy are all at their best with homogeneous arrays of numbers.

I understand, this stack is mostly designed to hold datasets, not for variable data storage like what I am doing.

I think I am gonna go with

import struct

import h5py
import numpy as np


f = h5py.File('test.h5','w')

table = f.create_dataset('packets',
    shape=(1,),
    chunks=True,
    maxshape=(None,),
    dtype=h5py.vlen_dtype(np.uint8),
)

table[0] = np.frombuffer(struct.pack('f', 1.5) + b'test', dtype=np.uint8)
...
((timestamp,), data) = struct.unpack('f', table[0][:4]), table[0][4:].tobytes()

And if I ever need more complex data, pickle.loads/dumps.

import pickle

import h5py
import numpy as np


f = h5py.File('test.h5','w')

table = f.create_dataset('packets',
    shape=(1,),
    chunks=True,
    maxshape=(None,),
    dtype=h5py.vlen_dtype(np.uint8),
)

table[0] = np.frombuffer(pickle.dumps((1.5, b'test')), dtype=np.uint8)
...
timestamp, data = pickle.loads(table[0])

It still would be very nice to see this issue fixed 😕

@takluyver
Copy link
Member

I'd be happy to see it get fixed - but I probably won't prioritise working on it myself. 😉

@FFY00
Copy link
Contributor Author

FFY00 commented Jul 11, 2021

I would be happy to give it a try, but I would probably need some pointers. If/when you, or any of the other maintainers, have time, please let me know.

@takluyver
Copy link
Member

Well, glancing at Dataset.__setitem__, I can see there are a few special cases around vlen data and accessing fields of compound types. I'm guessing they don't interact correctly:

h5py/h5py/_hl/dataset.py

Lines 783 to 803 in a6c6590

# Generally we try to avoid converting the arrays on the Python
# side. However, for compound literals this is unavoidable.
vlen = h5t.check_vlen_dtype(self.dtype)
if vlen is not None and vlen not in (bytes, str):
try:
val = numpy.asarray(val, dtype=vlen)
except ValueError:
try:
val = numpy.array([numpy.array(x, dtype=vlen)
for x in val], dtype=self.dtype)
except ValueError:
pass
if vlen == val.dtype:
if val.ndim > 1:
tmp = numpy.empty(shape=val.shape[:-1], dtype=object)
tmp.ravel()[:] = [i for i in val.reshape(
(numpy.product(val.shape[:-1], dtype=numpy.ulonglong), val.shape[-1]))]
else:
tmp = numpy.array([None], dtype=object)
tmp[0] = val
val = tmp

h5py/h5py/_hl/dataset.py

Lines 808 to 813 in a6c6590

if len(names) == 1 and self.dtype.fields is not None:
# Single field selected for write, from a non-array source
if not names[0] in self.dtype.fields:
raise ValueError("No such field for indexing: %s" % names[0])
dtype = self.dtype.fields[names[0]][0]
cast_compound = True

h5py/h5py/_hl/dataset.py

Lines 851 to 877 in a6c6590

# Make a compound memory type if field-name slicing is required
elif len(names) != 0:
mshape = val.shape
# Catch common errors
if self.dtype.fields is None:
raise TypeError("Illegal slicing argument (not a compound dataset)")
mismatch = [x for x in names if x not in self.dtype.fields]
if len(mismatch) != 0:
mismatch = ", ".join('"%s"'%x for x in mismatch)
raise ValueError("Illegal slicing argument (fields %s not in dataset type)" % mismatch)
# Write non-compound source into a single dataset field
if len(names) == 1 and val.dtype.fields is None:
subtype = h5t.py_create(val.dtype)
mtype = h5t.create(h5t.COMPOUND, subtype.get_size())
mtype.insert(self._e(names[0]), 0, subtype)
# Make a new source type keeping only the requested fields
else:
fieldnames = [x for x in val.dtype.names if x in names] # Keep source order
mtype = h5t.create(h5t.COMPOUND, val.dtype.itemsize)
for fieldname in fieldnames:
subtype = h5t.py_create(val.dtype.fields[fieldname][0])
offset = val.dtype.fields[fieldname][1]
mtype.insert(self._e(fieldname), offset, subtype)

@urig
Copy link

urig commented Jul 18, 2021

I've bumped into a similar issue when trying to insert a compound object made up of vlen strings. I posted it on stackoverflow.com and received an answer that points in a different direction (which worked for me):

You're setting the dtype as a tuple of variable length strings, so you'd set the tuple all at once. By only setting the label element, the other two tuple values aren't being set, so they are not string types.

In other words, perhaps try changing this:

table[0, 'timestamp'] = 1.5
table[0, 'data'] = np.frombuffer(b'test', dtype=np.uint8)

into this:

table[0] = 1.5, np.frombuffer(b'test', dtype=np.uint8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants