Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Discover and load compression plugins #2163

Open
flying-sheep opened this issue Oct 12, 2022 · 27 comments
Open

[feature request] Discover and load compression plugins #2163

flying-sheep opened this issue Oct 12, 2022 · 27 comments

Comments

@flying-sheep
Copy link
Contributor

flying-sheep commented Oct 12, 2022

As mentioned in #2161 and #928 (comment), the official way for packages to support plugins is using entry points.

By officially supporting plugins that register new compression schemes, unpythonic and error prone mechanisms like hdf5plugin modifying h5py’s global variables state on import can be avoided.

hdf5plugin plugin could then register itself e.g. like this:

[project.entry-points.'h5py.compression']
lz4 = 'hdf5plugin:LZ4'
zstd = 'hdf5plugin:Zstd'

While h5py loads their projects:

from collections import ChainMap
from importlib.metadata import entry_points

FILTERS_BUILTIN = {'gzip': ..., 'lzf': ..., 'szip': ...}
FILTERS_PLUGIN = {ep.name: ep.load() for ep in entry_points(group='h5py.compression')}
FILTERS = ChainMap(FILTERS_BUILTIN, FILTERS_PLUGIN)
@vasole
Copy link
Contributor

vasole commented Oct 12, 2022

| By officially supporting plugins that register new compression schemes, unpythonic and error prone mechanisms like hdf5plugin modifying h5py’s global variables can be avoided.

I have nothing against people promoting their own developments, but not at the cost of giving wrong information (to be kind) about other developments.

As far as I know hdf5plugin DOES NOT modify global variables. It just registers filters to HDF5.

@ivirshup
Copy link

I would support h5py providing an entry points mechanism for registering filters, but would phrase the reasoning a bit differently.

Right now users have to import hdf5plugin to use its filters through h5py.

This means users and applications have to be aware whether the compression of their data is provided by hdf5plugin. If instead, h5py could know that hdf5plugin was installed, it could register it's compressors ahead of time.

Overall this would be a much nicer experience downstream. This request in anndata is an example of something that could be resolved:

@t20100
Copy link
Contributor

t20100 commented Oct 12, 2022

From hdf5plugin side, I don't see an issue for looking at an entry point-based integration with h5py (silx-kit/hdf5plugin#14 (comment)).

unpythonic and error prone mechanisms like hdf5plugin modifying h5py’s global variables can be avoided.

Using entry points will only spare import hdf5plugin (or other package providing compression filters like bitshuffle), it will not change the way compression filters are registered to libhdf5 because compression filters are managed by libhdf5 not h5py.

It's only a question of explicit/implicit loading of compression filters.

@vasole
Copy link
Contributor

vasole commented Oct 12, 2022

This means users and applications have to be aware whether the compression of their data is provided by hdf5plugin.
If instead, h5py could know that hdf5plugin was installed, it could register it's compressors ahead of time.

What would that change respect to HDF5 loading plugins via HDF5_PLUGIN_PATH?

hdf5plugin respects the user environment, thus if a filter is provided via HDF5_PLUGIN_PATH (or any other means), it will be respected.

If one does not want to use hdf5plugin, one simply does not import it. It only provides missing filters.

@vasole
Copy link
Contributor

vasole commented Oct 12, 2022

Right now users have to import hdf5plugin to use its filters through h5py.

Wrong.

They can use HDF5_PLUGIN_PATH that is the default system provided by the HDF5 library.

They can also register filters themselves via the API provided by HDF5 and wrapped by h5py.

@flying-sheep
Copy link
Contributor Author

flying-sheep commented Oct 12, 2022

I have nothing against people promoting their own developments, but not at the cost of giving wrong information (to be kind) about other developments.

True, that’s too strong language, apologies. What I mean is that importing a module should come without side effects, and having the pluggable project doing the registering makes sure that it’s done the right way and at the right point in time.

@ivirshup
Copy link

ivirshup commented Oct 12, 2022

@vasole

If instead, h5py could know that hdf5plugin was installed, it could register it's compressors ahead of time.

What would that change respect to HDF5 loading plugins via HDF5_PLUGIN_PATH?

I don't think it would change anything. Honestly, I would be fine with h5py just importing hdf5plugin if it's available.

I have no opinions about how HDF5 loads plugins, just the usage from python.

Right now users have to import hdf5plugin to use its filters through h5py.

Wrong.

They can use HDF5_PLUGIN_PATH that is the default system provided by the HDF5 library.

I agree there are other ways to deal with the hdf5 library, but to use hdf5plugin I have to import it. It would be nice if users had access to its filters just from it being installed.

@vasole
Copy link
Contributor

vasole commented Oct 12, 2022

Apologies accepted although what it is unpythonic and error prone is the mechanism provided by libhdf5 not hdf5plugin.

Even if you wipe out hdf5plugin, the problem is there.

To me, at the very end, it is a question of preferring implicit or explicit.

From the Zen of Python (https://peps.python.org/pep-0020/), the latter seems to be preferred. In any case, it is up to the h5py developers to decide.

The Zen of Python

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

@ivirshup
Copy link

@vasole

what it is unpythonic and error prone is the mechanism provided by libhdf5 not hdf5plugin.

I also agree with this. This is a big part of why I would like the handling of plugin loading to happen at h5py import time, and not my libraries import time.

I do not have the skill set to fix any issues that arise from libhdf5's process. The developers of h5py and hdf5plugin do.

@t20100 and @vasole

From the Zen of Python (https://peps.python.org/pep-0020/), the latter seems to be preferred. In any case, it is up to the h5py developers to decide.

It's only a question of explicit/implicit loading of compression filters.

Yes. However I would say loading of compression is already implicit both for the ones h5py provides and for things registered to libhdf5. Implicit handling of compression is the norm for many libraries, including pandas and zarr.

I also think an import having happened at any point isn't much more explicit than a package being installed.

I would like it if the level of knowledge required to get implicit support of more hdf5 compressors was "can install a python package" instead of "can manage hdf5 filter plugins".

@t20100
Copy link
Contributor

t20100 commented Oct 12, 2022

However I would say loading of compression is already implicit both for the ones h5py provides and for things registered to libhdf5. Implicit handling of compression is the norm for many libraries, including pandas and zarr.

Indeed that's good argument.

@takluyver
Copy link
Member

I have mixed feelings about this.

HDF5 has its own mechanisms to discover and load plugins dynamically. I totally agree that they're somewhat clumsy - partly because that's just an awkward thing to do in C code, and partly because the mechanisms they're based around - fixed default paths and environment variables - don't really fit well with our world of Python environments.

However, I'm wary of trying to add another layer of smart plugin discovery on top of what's already there - I think it has the potential to be confusing in corner cases, e.g. if HDF5's native plugin machinery would find one version of a plugin and the (hypothetical) Python entry points mechanism would find another. This is compounded by the fact that HDF5 plugins use global state (e.g. the plugin search path and the loaded plugins), so we can't say 'use this filter just for this read' without modifying global state. And it's not always clear when HDF5 is modifying its state - e.g. checking if a filter is available with H5Zfilter_avail can load plugins, so doing this before or after modifying the plugin path could change what filter it uses.

I'm also against loading entry points on import, because this involves scanning the metadata of every installed Python package (including all those that don't offer HDF5 filters), which can be quite slow in a large environment. So if we do go for entry points, I'd want to only look for extra filters when we want to read/write data with a filter that's not already available.

If we do, one question would be whether should h5py register (or add to the search path) all plugins it can find when this mechanism is triggered, or whether the entry points are based on the HDF5 filter ID, and it selectively registers only the plugins needed for the filters you're trying to use.

@flying-sheep
Copy link
Contributor Author

I think it has the potential to be confusing in corner cases, e.g. if HDF5's native plugin machinery would find one version of a plugin and the (hypothetical) Python entry points mechanism would find another

That’s easy to fix by adding a way to output diagnostic data (how and where does each loaded plugin come from) and issuing warnings when that case happens.

which can be quite slow in a large environment

Can you provide statistics for that claim? I never noticed a perceptible delay in any importlib.metadata call.

@takluyver
Copy link
Member

That’s easy to fix by adding a way to output diagnostic data (how and where does each loaded plugin come from) and issuing warnings when that case happens.

I don't think HDF5 exposes APIs to make this possible in h5py. I don't think we can tell where a filter is loaded from, or when there is >1 instance of a filter findable on a search path. I'm not even sure we can get a list of filters already loaded.

Can you provide statistics for that claim? I never noticed a perceptible delay in any importlib.metadata call.

Xarray effectively does this on import (to get its own version number 😞 ), and can often take a couple of seconds to load in realistic settings. It's hard to get good measurements because of all the layers of caching that I can't easily control, but doing hundreds of file lookups on a network filesystem adds up.

I'm not the only maintainer here, but I'm going to be strongly -1 on anything that involves scanning entry points before they're needed.

@flying-sheep
Copy link
Contributor Author

flying-sheep commented Oct 17, 2022

So if we do go for entry points, I'd want to only look for extra filters when we want to read/write data with a filter that's not already available.

That’s exactly what we want to avoid. Modifying global state just before IO makes things really hard to debug, especially since that happens in native code, and especially because of what you say here:

I don't think HDF5 exposes APIs to make this possible in h5py. I don't think we can tell where a filter is loaded from, or when there is >1 instance of a filter findable on a search path. I'm not even sure we can get a list of filters already loaded.

Terrifying. So how about aiming for a fix there first?

Relying on HDF5_PLUGIN_PATH instead means hdf5plugin either does nothing until people configure that env variable (hard to discover), that it modifies that env variable during runtime (generally to be avoided), or something equally confusing.

And as said above, whether we rely on that env variable or not there should be additional ways to debug things.

Xarray effectively does this on import (to get its own version number disappointed ), and can often take a couple of seconds to load in realistic settings.

I know that pkg_resources required a whopping 150ms to import in a small python environment. Switching to importlib.metadata improved that a lot: mtkennerly/dunamai#29

Are you sure your version of Xarray is already using importlib.metadata and your disk cache is warm? Even in a network file system, two seconds for a warm cache sounds like it shouldn’t be a thing.

Also: I reserve my slowest storage for the most sequential data (media files) and my fastest for code and databases. Maybe using a network drive for coding is just not ideal?

@vasole
Copy link
Contributor

vasole commented Oct 17, 2022

Relying on HDF5_PLUGIN_PATH instead means hdf5plugin either does nothing until people configure that env variable (hard
to discover), that it modifies that env variable during runtime (generally to be avoided), or something equally confusing.

Do you really know how the HDF5 library plugin machinery works?

Do you know how the hdf5plugin module works? Because it is exactly the contrary of what you have written

  • hdf5plugin might do nothing if that plugin variable has been configured.
  • hdf5plugin does not modify that environment variable.

Try to answer yourself my two questions above and then you will figure out if you are in condition to evaluate if things are confusing or you did not do your duties. From all the nonsense you have been constantly throwing at the hdf5plugin module I doubt you can answer yes to those two questions.

@flying-sheep
Copy link
Contributor Author

flying-sheep commented Oct 17, 2022

You misinterpreted what I wrote. I gave three mutually exclusive options of how we could make hdf5plugin work. Obviously it doesn’t work in all three ways at the same time now.

From all the nonsense you have been constantly throwing at the hdf5plugin module

Looks like something I wrote offended you in some way. That wasn’t my intention. I’m just here to offer my experience regarding UX from a Python viewpoint. I’m not claiming to have any objective truths. I did spend a long time interacting with Python packages and the (scientific and otherwise) Python community. I therefore think that my intuitions about what’s expected from a good citizen in the Python package world are close to what’s true.

Now when those expectations are violated by some package, that still doesn’t mean that this package’s maintainers are wrong, just that I think there should be reasons why the package diverges from common expectations.

I hope that helps!

@t20100
Copy link
Contributor

t20100 commented Oct 17, 2022

h5py and hdf5plugin are constrained by libhdf5 API and behaviour (as any library wrappers), so it's not always easy/possible to be Pythonic.

Now to focus on entry point/behaviour issue, from hdf5plugin side:

  • We are ready to adapt to an entry point integration if the h5py dev team is taking this way.
  • To me, it is not worth breaking the current behaviour of import hdf5plugin loading HDF5 filters for e.g., import hdf5plugin; hdf5plugin.load_filters().

@aragilar
Copy link
Member

aragilar commented Oct 17, 2022

Looking at this bug, I think two things are being conflated:

  1. The need to explicitly import hdf5plugin (vs. loading via an entry point) when reading (as likely you'll want to import hdf5plugin to get its nice filter wrappers when writing).
  2. Debugging why filters are not working is hard.

For 1, given that as far as I know hdf5plugin is the only python package providing HDF5 filters that h5py can use (PyTables is a bit special), adding an entry-point system is overall worse for everyone (as either you import hdf5plugin and it's not needed or you don't realise you have compressed data and you haven't installed hdf5plugin). Entry-points are not going to help with debugging, as they won't tell you about the state of libhdf5 in the current process.

For 2, hdf5plugin does appear to log almost all of what it does with respect to loading filters (the only bit missing would be an explicit "hdf5plugin has loaded <plugin> at <path>"), and libhdf5 can have its logging enabled as well (by default h5py switches off the stdout/err logging of libhdf5). It doesn't look like there's a easy way (given the libhdf5 API for filters) to work out where filters come from (unless somehow you're got a way to get a reference to the underlying filter function and a way to work out where the filter function comes from, which is likely OS specific), and I don't think there's much chance of that changing (at best, some of that information could be in the name field passed to H5Zregister).

On the hdf5plugin side, I could see adding a debug_info() function (feel free to choose a different name) which gives details of what it has done (e.g. what version it is, what are versions of the filters its loaded, what failed) as a possible improvement?

On the libhdf5 side, as far as I know, there's no way to get details of what libhdf5 has autoloaded (e.g. path to the shared libraries), but if someone (@flying-sheep?) looked at libhdf5 and added functions to extract that info, I don't see any reason why h5py would not wrap them.

@t20100
Copy link
Contributor

t20100 commented Oct 17, 2022

hdf5plugin is the only python package providing HDF5 filters that h5py can use

That's no longer true, bitshuffle now also provides it's HDF5 filter the same way as hdf5plugin does (yet, hdf5plugin still provides the bitshuffle filter).
Except for this, AFAIK there is no other Python package providing filters.

hdf5plugin does appear to log almost all of what it does with respect to loading filters (the only bit missing would be an explicit "hdf5plugin has loaded at ")

On the hdf5plugin side, I could see adding a debug_info() function (feel free to choose a different name) which gives details of what it has done (e.g. what version it is, what are versions of the filters its loaded, what failed) as a possible improvement?

Sure, I opened silx-kit/hdf5plugin#185 for it.

@flying-sheep
Copy link
Contributor Author

flying-sheep commented Oct 17, 2022

h5py and hdf5plugin are constrained by libhdf5 API and behaviour (as any library wrappers), so it's not always easy/possible to be Pythonic.

That’s exactly what I meant in my last paragraph. It makes sense to invest thought into those boundaries, because if it turns out that making things idiomatic is possible, that’s a big win!

either you import hdf5plugin and it's not needed …

Not a problem if importing it has no side effects and takes like 0.1ms and a memory in the kilobyte range or less.

… or you don't realise you have compressed data and you haven't installed hdf5plugin

I think that’s a great origin point to define intended behavior.

In kinda user story format, and combined with the reason for filing this issue:

  • I downloaded a HDF5 file. When attempting to load it with h5py, I want to be informed if I don’t have the necessary filters installed and how to get them. (Maybe a link to some wiki/docs page?)
  • As a library author, I want to allow my users to select a compression filter from a plugin without adding logic that registers that plugin on demand.

@vasole
Copy link
Contributor

vasole commented Oct 17, 2022

@flying-sheep It is much more constructive to state needs than to provide wrong information about other people's work.

In hdf5plugin we tried as best as we could to not be intrusive:

  • It is not necessary to import hdf5plugin prior to h5py
  • hdf5plugin does not try to load a filter that has been already loaded by any other means
  • hdf5plugin does not modify HDF5_PLUGIN_PATH
  • In my machine, the command import h5py; import time; t0 = time.time(); import hdf5plugin; print("Elapsed = %.3f seconds" % (time.time() - t0)) returns 0.011 seconds.

Concerning your needs:

  • For reading, one solution (from a user point of view) could be that h5py tries to import hdf5plugin when encountering the error OSError: Can't read data (can't open directory: /usr/local/hdf5/lib/plugin) . h5py could be configured to make that behavior acceptable or not. The default could be to try to import hdf5plugin but not silently if the behavior was not chosen (at the very least a warning should be issued). Therefore developers would see no influence at all (memory footprint, loading time, dependencies...) until the functionality is actually needed. Given the fact hdf5plugin developers have not developed the plugin filters and are just making them available, I see a danger if h5py is perceived as endorsing the usage of hdf5plugin in detriment of original plugin author's work.

  • For writing, I think things should be explicit.

It's up to h5py developers to take a decision.

@vasole
Copy link
Contributor

vasole commented Oct 17, 2022

It would be so simple if the work done by hdf5plugin developers could be made by the HDF Group ...

@ajelenak
Copy link
Contributor

In kinda user story format, and combined with the reason for filing this issue:

  • I downloaded a HDF5 file. When attempting to load it with h5py, I want to be informed if I don’t have the necessary filters installed and how to get them. (Maybe a link to some wiki/docs page?)

Opening an HDF5 file is a very simple operation. The above would make that operation much more complicated because it would require scanning all the HDF5 datasets in that file to find out their filter information. Opening performance could be very bad so I am not in favor of something like this.

@epourmal
Copy link

In kinda user story format, and combined with the reason for filing this issue:

  • I downloaded a HDF5 file. When attempting to load it with h5py, I want to be informed if I don’t have the necessary filters installed and how to get them. (Maybe a link to some wiki/docs page?)

Opening an HDF5 file is a very simple operation. The above would make that operation much more complicated because it would require scanning all the HDF5 datasets in that file to find out their filter information. Opening performance could be very bad so I am not in favor of something like this.

I agree. It would be much easier to add an "info" API similar to what "h5dump -Hp -A 0" command produces. If an HDF5 file was created with paged allocation info API will be pretty fast.

@flying-sheep
Copy link
Contributor Author

Opening an HDF5 file is a very simple operation. The above would make that operation much more complicated because it would require scanning all the HDF5 datasets in that file to find out their filter information. Opening performance could be very bad so I am not in favor of something like this.

I guess I could have been more specific here: With “load” I meant “read fully and convert to Python types” here, specifically attempting to read some array that has an unknown filter. No need to scan the whole thing, just a good error message (might already be happening for all I know!)

@ivirshup
Copy link

Just pinging back on this with another data point: numcodecs (how zarr does this) defines an entry point, and checks for them at import time. Packages providing the codec are loaded when needed.

@ajelenak
Copy link
Contributor

It's not exactly comparable, because the zarr Python package there is in the same role as the HDF5 library here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants