Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a interface for registering filter plugins #928

Closed
aragilar opened this issue Sep 19, 2017 · 16 comments
Closed

Provide a interface for registering filter plugins #928

aragilar opened this issue Sep 19, 2017 · 16 comments

Comments

@aragilar
Copy link
Member

Currently filters either need to be either built into HDF5, packaged as part of h5py, or the filter provider has to modify HDF5_PLUGIN_PATH, none of which are ideal, and can lead to hard-to-debug problems (see #923). There exists a function which appends to the internal path (rather than an environment variable), which h5py currently does not wrap. We should wrap this, provide fallbacks for older HDF5 versions, and provide a way of getting the current path (if possible).

See silx-kit/hdf5plugin#14 for further discussion about what the interface should look like.

@aldanor
Copy link

aldanor commented Dec 23, 2017

One way is through setuptools entrypoints. Kind of like pytest registers plugins for it, IIRC.

@aldanor
Copy link

aldanor commented Mar 16, 2018

@aragilar Any thoughts/progress re: this?

@vasole
Copy link
Contributor

vasole commented Mar 16, 2018

I am waiting for HDF5 1.10.2 to be out to see if one can properly deal with plugins.

In principle that version should allow us to check what plugins are available and to add new ones as needed thus solving the problems encountered in the hdf5plugin module (it has to be imported prior to import h5py, it cannot add plugins selectively...)

@aparamon
Copy link
Member

HDF5 lib now provides a set of procedures to customize plugin discovery:
https://support.hdfgroup.org/HDF5/doc/RM/RM_H5PL.html

@aparamon
Copy link
Member

aparamon commented Dec 21, 2018

@tacaswell @takluyver What do you think of providing a set of pre-built popular plugins as a separate pip-installable package, something like https://pypi.org/project/hdf5plugin?

With the new H5PL* procedures it seems that the restriction "import hdf5plugin before importing h5py" can be relieved.

Maybe even better would be to package each plugin as a separate pip package, so user has more fine-grained control over which plugins to install (note that they generally have different licenses). h5py's LZF implementation could be extracted into a one separate package too.

@takluyver
Copy link
Member

takluyver commented Dec 21, 2018 via email

@vasole
Copy link
Contributor

vasole commented Dec 21, 2018

At this point in hdf5plugin we have put the plugins needed to read data from our detectors.
I have in mind to add the blosc family too.

Please consider that in order to be able to supply plugin filters usable everywhere we had to patch the direct calls of those plugins to the HDF5 library under linux and MacOS. If not, they are bound to a particular version of the HDF5 library. The price to pay is that one can only use those filters for reading under linux and MacOS. If one could ship them built with the same library as h5py then the limitation would disappear.

@vasole
Copy link
Contributor

vasole commented Dec 21, 2018

I forgot to add that we would be willing to help.

@aparamon
Copy link
Member

aparamon commented Dec 21, 2018

@takluyver So far I had great success with byte- (or bit-) shuffle + LZ4 combo: my data (mass spectra) gets typically compressed ~10x (slightly worse than zlib), compression speed is fast (unlike zlib), and decompression speed beats uncompressed read from SSD.
Zstandard is also very cool, beating zlib on all fronts (but slower to decompress than LZ4).
Blosc and Blosc2 are promising approaches I'm going to check on my data soon. The general idea of meta-compression pipeline works extremely well on modern CPUs.
The new algorithms pop here'n'there all the time, almost each one claiming to be Pareto-optimal ;-)

@vasole The inconvenient architecture of HDF5 filter plugins (HDF5 library dynamically loads plugins, but plugins link against and dynamically load the HDF5 library) is a known problem, unfortunately:
https://github.com/zarr-developers/zarr/issues/317#issuecomment-431947526
It was also discussed with @epourmal, and she in principle agreed that back-linking to the library from the plugin was a bad architectural design decision.
I was surprised to know that it causes problems on POSIX systems too: I thought it's the most painful on Windows. I now believe it's most valuable if regular h5py users are guarded against these depressing internals, and are able to just pip install the desired plugins.
Did I understand correctly that you have a working solution that makes the plugins work painlessly under the hood?

@vasole
Copy link
Contributor

vasole commented Dec 21, 2018

@aparamon

Well, the LZ4 filter plugin does not suffer from that problem while the Bitshuflle+LZ4 does. My understanding is that if a filter follows the directives of the HDF5 group it does not need to link to the library.

@vasole
Copy link
Contributor

vasole commented Dec 21, 2018

Sorry, comment updated.

@aparamon
Copy link
Member

aparamon commented Dec 21, 2018

My understanding is that if a filter follows the directives of the HDF5 group it does not need to link to the library.

@vasole Quite contrary, see H5allocate_memory docs:
At this time, the only intended use for this function is to allocate memory that will be returned to the library as a data buffer from a third-party filter.
I read it as: your filter should use H5allocate_memory and so link to HDF5 library.

But in practice I found that much more fragile compared to just ensuring the same memory manager is used inside the library and the plugin (on POSIX that's 99.9% reliable assumption anyway). And many plugin authors (incl. your obedient servant) do ignore the guideline and just go ahead with malloc/free.

Another aspect is inspecting the data type/data space. Byte-level compressors rarely need this but it's crucial for e.g. fpzip or MAFISC. For that, linking to HDF5 is the only elegant option (duplicating data type/data space description in the filter parameters not considered "elegant"; plus the filter parameter interfaces are already settled).

@takluyver
Copy link
Member

Aha, so it's mostly about different compression filters?

If you're interested, go ahead. I haven't investigated the H5PL machinery, but it sounds like that gives you a neat way to make it usable. Does this require any change in h5py itself?

I'm not currently interested in working on compression plugins myself at the moment - we want our data files to be readily readable to different tools written in different languages, not just Python, so for now the downsides of requiring a plugin to read data outweigh the benefits.

1 similar comment
@takluyver
Copy link
Member

Aha, so it's mostly about different compression filters?

If you're interested, go ahead. I haven't investigated the H5PL machinery, but it sounds like that gives you a neat way to make it usable. Does this require any change in h5py itself?

I'm not currently interested in working on compression plugins myself at the moment - we want our data files to be readily readable to different tools written in different languages, not just Python, so for now the downsides of requiring a plugin to read data outweigh the benefits.

@vasole
Copy link
Contributor

vasole commented Aug 14, 2020

@aragilar @takluyver

This should be closed by #1166 that was rebased, continued and merged as #1256

@aragilar
Copy link
Member Author

@vasole Thanks for lettings us know (and for your filter work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants