New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Discover and load compression plugins #2163
Comments
| By officially supporting plugins that register new compression schemes, unpythonic and error prone mechanisms like hdf5plugin modifying h5py’s global variables can be avoided. I have nothing against people promoting their own developments, but not at the cost of giving wrong information (to be kind) about other developments. As far as I know hdf5plugin DOES NOT modify global variables. It just registers filters to HDF5. |
I would support Right now users have to import This means users and applications have to be aware whether the compression of their data is provided by Overall this would be a much nicer experience downstream. This request in |
From
Using entry points will only spare It's only a question of explicit/implicit loading of compression filters. |
What would that change respect to HDF5 loading plugins via HDF5_PLUGIN_PATH? hdf5plugin respects the user environment, thus if a filter is provided via HDF5_PLUGIN_PATH (or any other means), it will be respected. If one does not want to use hdf5plugin, one simply does not import it. It only provides missing filters. |
Wrong. They can use HDF5_PLUGIN_PATH that is the default system provided by the HDF5 library. They can also register filters themselves via the API provided by HDF5 and wrapped by h5py. |
True, that’s too strong language, apologies. What I mean is that importing a module should come without side effects, and having the pluggable project doing the registering makes sure that it’s done the right way and at the right point in time. |
I don't think it would change anything. Honestly, I would be fine with I have no opinions about how HDF5 loads plugins, just the usage from python.
I agree there are other ways to deal with the hdf5 library, but to use hdf5plugin I have to import it. It would be nice if users had access to its filters just from it being installed. |
Apologies accepted although what it is unpythonic and error prone is the mechanism provided by Even if you wipe out hdf5plugin, the problem is there. To me, at the very end, it is a question of preferring implicit or explicit. From the Zen of Python (https://peps.python.org/pep-0020/), the latter seems to be preferred. In any case, it is up to the h5py developers to decide. Beautiful is better than ugly. |
I also agree with this. This is a big part of why I would like the handling of plugin loading to happen at h5py import time, and not my libraries import time. I do not have the skill set to fix any issues that arise from libhdf5's process. The developers of h5py and hdf5plugin do.
Yes. However I would say loading of compression is already implicit both for the ones h5py provides and for things registered to libhdf5. Implicit handling of compression is the norm for many libraries, including I also think an import having happened at any point isn't much more explicit than a package being installed. I would like it if the level of knowledge required to get implicit support of more hdf5 compressors was "can install a python package" instead of "can manage hdf5 filter plugins". |
Indeed that's good argument. |
I have mixed feelings about this. HDF5 has its own mechanisms to discover and load plugins dynamically. I totally agree that they're somewhat clumsy - partly because that's just an awkward thing to do in C code, and partly because the mechanisms they're based around - fixed default paths and environment variables - don't really fit well with our world of Python environments. However, I'm wary of trying to add another layer of smart plugin discovery on top of what's already there - I think it has the potential to be confusing in corner cases, e.g. if HDF5's native plugin machinery would find one version of a plugin and the (hypothetical) Python entry points mechanism would find another. This is compounded by the fact that HDF5 plugins use global state (e.g. the plugin search path and the loaded plugins), so we can't say 'use this filter just for this read' without modifying global state. And it's not always clear when HDF5 is modifying its state - e.g. checking if a filter is available with I'm also against loading entry points on import, because this involves scanning the metadata of every installed Python package (including all those that don't offer HDF5 filters), which can be quite slow in a large environment. So if we do go for entry points, I'd want to only look for extra filters when we want to read/write data with a filter that's not already available. If we do, one question would be whether should h5py register (or add to the search path) all plugins it can find when this mechanism is triggered, or whether the entry points are based on the HDF5 filter ID, and it selectively registers only the plugins needed for the filters you're trying to use. |
That’s easy to fix by adding a way to output diagnostic data (how and where does each loaded plugin come from) and issuing warnings when that case happens.
Can you provide statistics for that claim? I never noticed a perceptible delay in any |
I don't think HDF5 exposes APIs to make this possible in h5py. I don't think we can tell where a filter is loaded from, or when there is >1 instance of a filter findable on a search path. I'm not even sure we can get a list of filters already loaded.
Xarray effectively does this on import (to get its own version number 😞 ), and can often take a couple of seconds to load in realistic settings. It's hard to get good measurements because of all the layers of caching that I can't easily control, but doing hundreds of file lookups on a network filesystem adds up. I'm not the only maintainer here, but I'm going to be strongly -1 on anything that involves scanning entry points before they're needed. |
That’s exactly what we want to avoid. Modifying global state just before IO makes things really hard to debug, especially since that happens in native code, and especially because of what you say here:
Terrifying. So how about aiming for a fix there first? Relying on And as said above, whether we rely on that env variable or not there should be additional ways to debug things.
I know that pkg_resources required a whopping 150ms to import in a small python environment. Switching to Are you sure your version of Xarray is already using importlib.metadata and your disk cache is warm? Even in a network file system, two seconds for a warm cache sounds like it shouldn’t be a thing. Also: I reserve my slowest storage for the most sequential data (media files) and my fastest for code and databases. Maybe using a network drive for coding is just not ideal? |
Do you really know how the HDF5 library plugin machinery works? Do you know how the hdf5plugin module works? Because it is exactly the contrary of what you have written
Try to answer yourself my two questions above and then you will figure out if you are in condition to evaluate if things are confusing or you did not do your duties. From all the nonsense you have been constantly throwing at the hdf5plugin module I doubt you can answer yes to those two questions. |
You misinterpreted what I wrote. I gave three mutually exclusive options of how we could make hdf5plugin work. Obviously it doesn’t work in all three ways at the same time now.
Looks like something I wrote offended you in some way. That wasn’t my intention. I’m just here to offer my experience regarding UX from a Python viewpoint. I’m not claiming to have any objective truths. I did spend a long time interacting with Python packages and the (scientific and otherwise) Python community. I therefore think that my intuitions about what’s expected from a good citizen in the Python package world are close to what’s true. Now when those expectations are violated by some package, that still doesn’t mean that this package’s maintainers are wrong, just that I think there should be reasons why the package diverges from common expectations. I hope that helps! |
Now to focus on entry point/behaviour issue, from
|
Looking at this bug, I think two things are being conflated:
For 1, given that as far as I know For 2, On the On the |
That's no longer true,
Sure, I opened silx-kit/hdf5plugin#185 for it. |
That’s exactly what I meant in my last paragraph. It makes sense to invest thought into those boundaries, because if it turns out that making things idiomatic is possible, that’s a big win!
Not a problem if importing it has no side effects and takes like 0.1ms and a memory in the kilobyte range or less.
I think that’s a great origin point to define intended behavior. In kinda user story format, and combined with the reason for filing this issue:
|
@flying-sheep It is much more constructive to state needs than to provide wrong information about other people's work. In
Concerning your needs:
It's up to h5py developers to take a decision. |
It would be so simple if the work done by |
Opening an HDF5 file is a very simple operation. The above would make that operation much more complicated because it would require scanning all the HDF5 datasets in that file to find out their filter information. Opening performance could be very bad so I am not in favor of something like this. |
I agree. It would be much easier to add an "info" API similar to what "h5dump -Hp -A 0" command produces. If an HDF5 file was created with paged allocation info API will be pretty fast. |
I guess I could have been more specific here: With “load” I meant “read fully and convert to Python types” here, specifically attempting to read some array that has an unknown filter. No need to scan the whole thing, just a good error message (might already be happening for all I know!) |
Just pinging back on this with another data point: |
It's not exactly comparable, because the zarr Python package there is in the same role as the HDF5 library here. |
As mentioned in #2161 and #928 (comment), the official way for packages to support plugins is using entry points.
By officially supporting plugins that register new compression schemes, unpythonic and error prone mechanisms like
hdf5plugin
modifying h5py’s globalvariablesstate on import can be avoided.hdf5plugin plugin could then register itself e.g. like this:
While
h5py
loads their projects:The text was updated successfully, but these errors were encountered: