-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use entrypoints to manage drivers. Add subcommand. #236
Conversation
Before looking at this, I wanted to comment that a data source can specify its driver(s) by giving a fully qualified class name, |
@bryevdv , you may have opinions on managing a plugin system via configuration. Note that the data drivers are just one thing that is pluggable in Intake. |
For my use case, I want data sources to stick with just a name, not a fully-specified DriverClass, to maintain the degree of freedom whereby the consumer of the data source can have their own opinions about which of the potentially-multiple drivers they want to use. |
I am also interested to see more flexible ways to deal with plugins. The current system was basically copied from flask after I did some looking around for better approaches and gave up. The main requirement (in my mind) is that installing a driver (with pip or conda) automatically makes it available in the registry without requiring an explicit registration step from the user. Although most package managers support automated post-install scripts, it is much easier if they are not required. Being able to manage plugins that offer alternative implementations of the same format is interesting. That certainly makes it easier for people to experiment with better ways to read already-supported file formats. |
Yeah, I think it's a good idea to be able to specify which class provides a particular driver, and which are not to be loaded at all, but I don't see an easy way to switch to a model where we don't do the package scan at all, since that's how thins are designed now. On the CLI commands, I think they are basically correct (and exactly the same should be available in API also). The config like drivers:
zarr: intake_xarray.xzarr.ZarrSource
csv: false
netcdf: intake_xarray.netcdf.NetCDFSource where "csv" and |
That way, any name that is not explicitly given in the config (which is an action done by a user), it will be automatically imported without placing special files in any place or calling import-time scripts. |
I agree completely. The breakthrough moment for me was when I learned that this can be done in a plain setup.py via the data_files argument. Other package managers can decide to use post-install hooks instead if they so choose, of course, as Jupyter’s documentation notes, but it’s not necessary.
I agree. We could nonetheless undertake adding this mechanism to all known existing drivers and then add a config option to opt out of the package scan (and thus make import faster and arguably “cleaner”). That option could even potentially become the default in the far future after a long warning cycle. |
It it ends up in |
I have used this pattern in the past: if os.name == 'nt':
_user_conf = os.path.join(os.environ['APPDATA'], 'databroker')
CONFIG_SEARCH_PATH = (_user_conf,)
else:
_user_conf = os.path.join(os.path.expanduser('~'), '.intake')
_sys_prefix_etc = os.path.join(os.path.dirname(os.path.dirname(sys.executable)),
'etc', 'intake')
_system_etc = os.path.join('/', 'etc', 'intake')
CONFIG_SEARCH_PATH = (_user_conf, _sys_prefix_etc, _system_etc) Of course one has to decide how to compose multiple potentially-comflicting configurations, but I think giving each driver a separate file makes this fairly clean. (I believe that was a factor in driving Jupyter to this approach.) The CLI can be explicit about where it found the config that is stipulating the state of each driver. I don't want to make intake's configuration an order of magnitude more complicated all at once, but having a "sys prefix" solution for conda (and environments generally) may well be worth including in the first iteration. |
Exactly where to put stuff is complex enough that there is a module for it, and in intake we want the config directory to be itself configurable via an environment variable. In addition, we need a firm way to prevent different packages from having conflicting file-names for the nugget that they would want to insert into that location. Finally, it's not clear whether the driver preferences should be per install prefix, user, or global. Sorry if I only seem to be raising problems! |
Haha, no problem. Forethought and wariness are called for here, especially as we'll be stuck with maintaining whatever we decide approximately forever.
In this draft, intake expects to find config in a directory named exactly like the package. We can't prevent packages from stepping on another package's name in error, but I think we are safe from collisions between two correct packages, no?
I believe Jupyter looks in all three of those in that order, shows in the startup logs and CLIs' output which files it has found and used, and merges them. Tricky but good for situations where the sysadmin provides some basics by the user wants to customize "just this one thing". [Edit: Or, in a personal-use context, where you have some general settings in Nice, I hadn't encountered appdirs yet. It seems quite lightweight, may be worth using here. |
Sorry to be slow to get to this... Some thoughts, much of which is already covered by this PR.
Previously, individual catalogues could provide a more specific drivers config too (they can still provide what is effectively the first item), this could be considered for reimplementation. |
Sounds good to me, @martindurant. I think these items are currently actionable:
And these need further thought and discussion:
|
Agree totally with that summary, @danielballan . Shall we keep this PR, then, to the direct actionable items, and I'm OK to allow for intake-imported packages to announce themselves via a file-per-package, as you intended. That approach would still require a separate config file too, for driver name clashes. I would expect, for now, packages to directly call the intake API to announce themselves via a conda post-install hook, so we can keep the implementation flexible for the future. |
I'll have to read the whole thread, so apologies if I'm re-stating something. Maybe @minrk or @yuvipanda recall better and have been more involve. I believe one of the reasoning is with regard to packaging. If it's a single file, then you need to execute a script that modify the config file at install/uninstall. if it's a directory then the package can contain a config file which is just added/removed on install/uninstall, and avoid the conda hooks to be executed. |
I don't have a link off the top of my head, but the principle is that most package installers (pip wheels, conda, apt) either can't (wheels) or greatly prefer not to (conda, apt) take actions to modify files at installation time. They like as much as possible for installation to be placing unmodified files in stable locations. So whatever action it is that you want to take place at install time (in our case, enabling an extension), it is much easier for packagers to handle that by placing a file, rather than by editing a file. There have been all kinds of issues with conda's post-link and pre-unlink hooks, which are where we can take these modification actions, and caused headaches (e.g. dependencies being unavailable because they are not ready, maybe the enable/disable step occurred manually already, etc.). So as much as possible, I would encourage actions to be taken at install time to be:
to avoid conflicts between user actions and packager install/uninstall. I'm not sure I can speak to whether conf.d or setuptools entrypoints or whatever is best for your case in terms of discoverability. We've had good success in nbconvert and more recently jupyterhub with using entrypoints purely for the discoverability step, as long as what is to be discovered is e.g. a Python function or class (not the case for nbextensions, but it is the case for nbconvert Exporters). @takluyver's entrypoints package makes using the entrypoints standard established by setuptools super nice without having to invoke pkg_resources and all the baggage that comes with. In our case for extensions, discoverability isn't enough because we don't want the set of enabled extensions to always be the set of installed extensions, which is part of what @danielballan is asking for, I think. But if that is okay for you, then entrypoints is probably a good solution and you can skip defining your own conf.d standard. Even if you want that to be the default behavior, you could probably get away without conf.d and have a user-config extension whitelist (and/or blacklist) for cases where "less than all installed" is desired. If either of these suits your use case well enough, that's probably what I would recommend before conf.d. |
@danielballan , sounds like the many-file option is the clear winner, as you predicted. A general set of preferences is still needed, though (e.g., which driver to use for given data type name), which could be a file unto itself or in the main config. I think the many-files should be in the install directory so that pip and conda are ok with them (and they reflect installed packages local to that env anyway), but the config can stay in the confdir, ~/.intake/ by default, for now. |
Thanks for taking the time to provide that useful input, @Carreau and @minrk. Agreed with all of the above, @martindurant. I think we have enough clarity at this point that I should move the code forward, and then we can kick the tires and discuss some more. |
Thanks to all the people here for developing and for joining the conversation! |
@danielballan , do you still have plans here? |
Yep! Thanks for your patience -- I just got back from a nice long vacation. |
32ed125
to
6224958
Compare
Merge ? |
OK then! |
Thanks, all! I had planned to add tests before the merge, but I'll just open a separate PR. In parallel, I will start adding entrypoints to |
Thanks, @danielballan |
I would like it to be possible to:
intake*
I think the Jupyter notebook serverextension system has settled on a nice way to manage this kind of configuration (after many iterations and pivots over the years). This PR imitates that system. It's just a first pass to evaluate interest and would need more careful thought before being merged.
Demo:
An
intake drivers
subcommand can list the drivers that are added tointake.registry
at import time.A verbose option includes
__file__
locations, potentially useful for untangling issues with environments.Now suppose I want to disable the 'zarr'` driver provided by intake_xarray. Perhaps I have a different implementation that I want to use with 'zarr' and I need to avoid the name collision.
I can later re-enable it:
The enable/disable state is stored in a separate YAML file for each driver in
~/.intake/drivers.d
, imitating the system used by Jupyter. For backward compatibility, drivers in packages that begin withintake*
are included in the registry unless they are explicitly disabled. (That is, they need not have any configuration in~/.intake/drivers.d
.) Drivers in packages with other names can be explicitly enabled:The
enable
command created the following file at~/.intake/drivers.d/offbrand_catalog.MongoMetadataStoreCatalog.yml
:As documented by Jupyter, packages can automatically enable their drivers at install time by using
data_files
insetup.py
to place the corresponding files in~/.intake/drivers.d/
.