Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regression in h5py 3.4.0: fletcher32 filter on variable length strings dataset #1948

Closed
paulmueller opened this issue Aug 23, 2021 · 8 comments

Comments

@paulmueller
Copy link
Contributor

Summary of the h5py configuration:

h5py 3.4.0
HDF5 1.12.1
Python 3.8.10 (default, Jun 2 2021, 10:49:15)
[GCC 9.4.0]
sys.platform linux (I am on Ubuntu 20.04.2 LTS)
sys.maxsize 9223372036854775807
numpy 1.19.5
cython (built with) 0.29.24
numpy (built against) 1.17.5
HDF5 (built against) 1.12.1

The following code works with h5py<3.4.0:

import h5py

dt = h5py.special_dtype(vlen=str)

with h5py.File("test.h5", mode="w") as h5:
    log_dset = h5.create_dataset("peter",
                                 (10,),
                                 dtype=dt,
                                 maxshape=(None,),
                                 chunks=True,
                                 fletcher32=True,
                                 compression="gzip")

With h5py 3.4.0, I get the error:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    log_dset = h5.create_dataset("peter",
  File "/home/paul/repos/dclab/.env/lib/python3.8/site-packages/h5py/_hl/group.py", line 149, in create_dataset
    dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
  File "/home/paul/repos/dclab/.env/lib/python3.8/site-packages/h5py/_hl/dataset.py", line 137, in make_new_dset
    dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 87, in h5py.h5d.create
ValueError: Unable to create dataset (not suitable for filters)

The error goes away when I remove fletcher32=True. But I would like to have that extra check, so this looks like a regression to me.

paulmueller added a commit to DC-analysis/dclab that referenced this issue Aug 23, 2021
@takluyver
Copy link
Member

I think the only relevant change in h5py 3.4 is that the pre-built wheels bundle HDF5 1.12.1, and it's a change in HDF5 itself that's affecting you. The 'not suitable for filters' message does seem to have been added in 1.12.1; the original commit is here:

HDFGroup/hdf5@16349c5

From the code, it looks like it will refuse to apply any filters to vlen strings. I wonder if previously filters were either being ignored, or applied to the pointers to vlen data rather than the data itself, making them mostly pointless.

You can get in touch with HDF group via help@hdfgroup.org or the HDF forum.

@keszybz
Copy link

keszybz commented Sep 24, 2021

Actually this is not related to strings, it fails the same e.g. with ints. IIU the code correctly, hdf5 now rejects fletcher32 for any variable length array:

bad_for_filters = (H5S_NULL == space_class || H5S_SCALAR == space_class
        || H5T_VLEN == type_class ||
        ...;

(This also causes pytables tests to fail with hdf5-1.10.7-2.fc35.x86_64.)

keszybz added a commit to keszybz/PyTables that referenced this issue Sep 25, 2021
hdf5 doesn't like that:
h5py/h5py#1948
HDFGroup/hdf5@16349c5

> The 'not suitable for filters' message does seem to have been added
> in 1.12.1 From the code, it looks like it will refuse to apply any
> filters to vlen strings.

This patch is based on the assumption that the hdf5 code is *correct*, and
indeed fletcher32 shouldn't be used in that vlarrays. But if hdf5 is wrong,
then the fix should be their side. It might make sense to apply this to
get the tests passing again, even if ultimately hdf5 is adjusted too.

Fixes PyTables#845.
@paulmueller
Copy link
Contributor Author

I created a thread at the HDF forum: https://forum.hdfgroup.org/t/fletcher32-filter-on-variable-length-string-datasets-not-suitable-for-filters/9038

paulmueller added a commit to DC-analysis/dclab that referenced this issue Oct 11, 2021
@paulmueller
Copy link
Contributor Author

It seems like what @takluyver hypothesized is true. fletcher32 and compression are pointless with vlen dtype datasets. So this is not an h5py bug.

It would still be nice to have a more elaborate error message in h5py.

@takluyver
Copy link
Member

I'm hesitant to try to be clever from the h5py side. If we add a check and raise an error before creating the dataset, and then a future version of HDF5 makes checksumming vlen data valid, then that's a bug in h5py. And automatically diagnosing errors after the fact is hard.

There are plenty of errors where the message we get from HDF5 is not especially clear or specific (this example is pretty clear compared to some). I'd rather not set a precedent that h5py should be trying to intercept them and provide better error messages, because a) that's a mammoth task, and b) it sounds like a bug minefield.

@paulmueller
Copy link
Contributor Author

That sounds reasonable. Thanks for your help!

@keszybz
Copy link

keszybz commented Oct 12, 2021

Yeah, I agree… Trying to improve this in h5py would only complicate things in the long run.

@takluyver
Copy link
Member

Thanks for being understanding about it! 🙂

mennthor pushed a commit to mennthor/PyTables that referenced this issue Dec 16, 2021
hdf5 doesn't like that:
h5py/h5py#1948
HDFGroup/hdf5@16349c5

> The 'not suitable for filters' message does seem to have been added
> in 1.12.1 From the code, it looks like it will refuse to apply any
> filters to vlen strings.

This patch is based on the assumption that the hdf5 code is *correct*, and
indeed fletcher32 shouldn't be used in that vlarrays. But if hdf5 is wrong,
then the fix should be their side. It might make sense to apply this to
get the tests passing again, even if ultimately hdf5 is adjusted too.

Fixes PyTables#845.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants