Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use entrypoints to manage drivers. Add subcommand. #236

Merged
merged 30 commits into from Jul 22, 2019

Conversation

@danielballan
Copy link
Collaborator

commented Jan 18, 2019

I would like it to be possible to:

  • Provide drivers that are discoverable by intake without necessarily packaging them in a package named intake*
  • Have the option to disable a specific intake driver from getting autodiscovered without uninstalling the package that provides it or disabling other drivers in that package

I think the Jupyter notebook serverextension system has settled on a nice way to manage this kind of configuration (after many iterations and pivots over the years). This PR imitates that system. It's just a first pass to evaluate interest and would need more careful thought before being merged.

Demo:

An intake drivers subcommand can list the drivers that are added to intake.registry at import time.

$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource

A verbose option includes __file__ locations, potentially useful for untangling issues with environments.

$ intake drivers list -v
netcdf                        intake_xarray.netcdf.NetCDFSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/netcdf.py
opendap                       intake_xarray.opendap.OpenDapSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/opendap.py
rasterio                      intake_xarray.raster.RasterIOSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/raster.py
remote-xarray                 intake_xarray.xarray_container.RemoteXarray @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xarray_container.py
zarr                          intake_xarray.xzarr.ZarrSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xzarr.py

Now suppose I want to disable the 'zarr'` driver provided by intake_xarray. Perhaps I have a different implementation that I want to use with 'zarr' and I need to avoid the name collision.

$ intake drivers disable intake_xarray.xzarr.ZarrSource
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
$ python -c "import intake; print('zarr' in intake.registry)"
False

I can later re-enable it:

$ intake drivers enable intake_xarray.xzarr.ZarrSource
$ python -c "import intake; print('zarr' in intake.registry)"
True

The enable/disable state is stored in a separate YAML file for each driver in ~/.intake/drivers.d, imitating the system used by Jupyter. For backward compatibility, drivers in packages that begin with intake* are included in the registry unless they are explicitly disabled. (That is, they need not have any configuration in ~/.intake/drivers.d.) Drivers in packages with other names can be explicitly enabled:

$ intake drivers enable offbrand_catalog.MongoMetadataStoreCatalog
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource
mongo_metadatastore           offbrand_catalog.MongoMetadataStoreCatalog

The enable command created the following file at ~/.intake/drivers.d/offbrand_catalog.MongoMetadataStoreCatalog.yml:

offbrand_catalog.MongoMetadataStoreCatalog:                                                                                                                                                   
  enabled: true

As documented by Jupyter, packages can automatically enable their drivers at install time by using data_files in setup.py to place the corresponding files in ~/.intake/drivers.d/.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

Before looking at this, I wanted to comment that a data source can specify its driver(s) by giving a fully qualified class name, package.module.DriverClass, and that a catalog yaml can give a set of modules to search for drivers at the time the file is parsed. Also, it is conceivable that a data source could be loaded by multiple drivers, and that multiple drivers happen to share the same simple name ( #163 ).

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

@bryevdv , you may have opinions on managing a plugin system via configuration. Note that the data drivers are just one thing that is pluggable in Intake.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 18, 2019

For my use case, I want data sources to stick with just a name, not a fully-specified DriverClass, to maintain the degree of freedom whereby the consumer of the data source can have their own opinions about which of the potentially-multiple drivers they want to use.

@seibert

This comment has been minimized.

Copy link
Collaborator

commented Jan 18, 2019

I am also interested to see more flexible ways to deal with plugins. The current system was basically copied from flask after I did some looking around for better approaches and gave up.

The main requirement (in my mind) is that installing a driver (with pip or conda) automatically makes it available in the registry without requiring an explicit registration step from the user. Although most package managers support automated post-install scripts, it is much easier if they are not required.

Being able to manage plugins that offer alternative implementations of the same format is interesting. That certainly makes it easier for people to experiment with better ways to read already-supported file formats.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

Yeah, I think it's a good idea to be able to specify which class provides a particular driver, and which are not to be loaded at all, but I don't see an easy way to switch to a model where we don't do the package scan at all, since that's how thins are designed now.

On the CLI commands, I think they are basically correct (and exactly the same should be available in API also). The config like

drivers:
  zarr: intake_xarray.xzarr.ZarrSource
  csv: false
  netcdf: intake_xarray.netcdf.NetCDFSource

where "csv" and open_csv will be missing, and "netcdf" represents a choice. intake drivers -v would then give details of a) entries in the config, whether they imported successfully, and b) the discover process, as each name is registered and any that are skipped because of configuration or because of name collision.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

That way, any name that is not explicitly given in the config (which is an action done by a user), it will be automatically imported without placing special files in any place or calling import-time scripts.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 18, 2019

The main requirement (in my mind) is that installing a driver (with pip or conda) automatically makes it available in the registry without requiring an explicit registration step from the user. Although most package managers support automated post-install scripts, it is much easier if they are not required.

I agree completely. The breakthrough moment for me was when I learned that this can be done in a plain setup.py via the data_files argument. Other package managers can decide to use post-install hooks instead if they so choose, of course, as Jupyter’s documentation notes, but it’s not necessary.

I don't see an easy way to switch to a model where we don't do the package scan at all, since that's how thins are designed now.

I agree. We could nonetheless undertake adding this mechanism to all known existing drivers and then add a config option to opt out of the package scan (and thus make import faster and arguably “cleaner”). That option could even potentially become the default in the far future after a long warning cycle.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

this can be done in a plain setup.py via the data_files argument

It it ends up in ~/.intake, then I don't think conda will find it, so then we'd be wanting to search in multiple places. I can imagine it getting somewhat complex, and I'm scared of cases of "I installed X but I can't find it" down the line.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 18, 2019

I have used this pattern in the past:

if os.name == 'nt':
    _user_conf = os.path.join(os.environ['APPDATA'], 'databroker')
    CONFIG_SEARCH_PATH = (_user_conf,)
else:
    _user_conf = os.path.join(os.path.expanduser('~'), '.intake')
    _sys_prefix_etc = os.path.join(os.path.dirname(os.path.dirname(sys.executable)),
                               'etc', 'intake')
    _system_etc = os.path.join('/', 'etc', 'intake')
    CONFIG_SEARCH_PATH = (_user_conf, _sys_prefix_etc, _system_etc)

Of course one has to decide how to compose multiple potentially-comflicting configurations, but I think giving each driver a separate file makes this fairly clean. (I believe that was a factor in driving Jupyter to this approach.) The CLI can be explicit about where it found the config that is stipulating the state of each driver.

I don't want to make intake's configuration an order of magnitude more complicated all at once, but having a "sys prefix" solution for conda (and environments generally) may well be worth including in the first iteration.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 18, 2019

Exactly where to put stuff is complex enough that there is a module for it, and in intake we want the config directory to be itself configurable via an environment variable. In addition, we need a firm way to prevent different packages from having conflicting file-names for the nugget that they would want to insert into that location. Finally, it's not clear whether the driver preferences should be per install prefix, user, or global.

Sorry if I only seem to be raising problems!

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 18, 2019

Haha, no problem. Forethought and wariness are called for here, especially as we'll be stuck with maintaining whatever we decide approximately forever.

we need a firm way to prevent different packages from having conflicting file-names for the nugget that they would want to insert into that location.

In this draft, intake expects to find config in a directory named exactly like the package. We can't prevent packages from stepping on another package's name in error, but I think we are safe from collisions between two correct packages, no?

Finally, it's not clear whether the driver preferences should be per install prefix, user, or global.

I believe Jupyter looks in all three of those in that order, shows in the startup logs and CLIs' output which files it has found and used, and merges them. Tricky but good for situations where the sysadmin provides some basics by the user wants to customize "just this one thing". [Edit: Or, in a personal-use context, where you have some general settings in ~/.intake and some per-env settings in environments to lay over top.]

Nice, I hadn't encountered appdirs yet. It seems quite lightweight, may be worth using here.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 25, 2019

Sorry to be slow to get to this... Some thoughts, much of which is already covered by this PR.

  • the idea to be able to dis/able and list the set of drivers via the CLI is certainly very useful and a definite good addition. The same functionality should be available via the library API too, though
  • with debug logging (e.g., INTAKE_LOG_LEVEL=DEBUG), the decisions of which modules are inspected and which drivers imported from where should be shown, i.e., the stages of all_enabled_drivers need logger output
  • the current autodiscover behaviour must be maintained for backward compatibility for the foreseeable future; they do not need to added to an enabled list for this to happen, but they must not appear on a disabled list; collisions here should cause warnings
  • I agree with the idea of a place where installed packages can put files which then mark these packages for import by intake without being names intake_*. That could be done via the CLI with a script, API call, or special install file hook. I regard the last as very brittle, because it brings the problems of exactly where the files should go: global, user, environment, config-dir, appdirs default ... which are not necessarily fixed at install time; and I don't think there's a way for setuptools and conda to use the same method for placing a file. This needs some thought
  • the config directory can gain a new file detailing data drivers (could in theory be a section of the main config too):
    • a set of specific modules/classes to import; The appearance of a "config.d" file is exactly equivalent to a module entered here
    • a set of specific modules/classes not to import, and
    • a mapping name: [<class path> | 'disabled'] for either expressing a preference for which driver should fulfil a given driver name, or prevent registry of all drivers of the given name.
      This does overlap latter point overlaps with the first, which may be rarely used, therefore. It is handy to prevent the import of a module before finding the name of the contained class and rejecting it, but we would need to decide precedence when something is in both the avoid list and the mapping.

Previously, individual catalogues could provide a more specific drivers config too (they can still provide what is effectively the first item), this could be considered for reimplementation.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 27, 2019

Sounds good to me, @martindurant. I think these items are currently actionable:

  • Back-compat: packages names intake* should be autodiscovered with no changes needed to those packages or special action by the user
  • Make all functionality accessible via API
  • Test
  • Add debug log calls to all_drivers_enabled and anywhere else it makes sense

And these need further thought and discussion:

  • Decide on a place where installed packages can put files which then mark these packages for import by intake without being named intake*, and whether to enable at install time.

    I regard [a special install file hook] as very brittle, because it brings the problems of exactly where the files should go: global, user, environment, config-dir, appdirs default ... which are not necessarily fixed at install time.

    The API call or equivalent CLI invocation should allow the user to differentiate with flags like --user, --sys-prefix, --global. Maybe it would be safest to start with the position, "Users must manually enable plugins whose names do not match intake*," and forgo the install hook for now. Then in a second step we can discuss whether it makes sense to follow Jupyter down the path of automatically enabling at the --sys-prefix level in `setup.py.

    I don't think there's a way for setuptools and conda to use the same method for placing a file.

    In the Jupyter example linked above, I believe conda would successfully discover and package the placed files, because they go inside conda's prefix.

  • Consider trade-offs between placing these settings in one central file (x global/user/sys-prefix) or distributed across one-file-per-driver, in the init.d style that I have implemented here. I recall that IPython/Jupyter moved from one central file to one-file-per-extension and trying the former for a release or two and finding it problematci. I am not sure I correctly recall their reasoning. @Carreau Would you have time to direct us the old issue or PR with the discussion on this, or summarize lessons learned?

  • Review how Catalogs themselves can specify driver configuration, and carefully consider how this interacts with the proposed new configuration scheme.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

Agree totally with that summary, @danielballan .

Shall we keep this PR, then, to the direct actionable items, and I'm OK to allow for intake-imported packages to announce themselves via a file-per-package, as you intended. That approach would still require a separate config file too, for driver name clashes. I would expect, for now, packages to directly call the intake API to announce themselves via a conda post-install hook, so we can keep the implementation flexible for the future.

@Carreau

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2019

Would you have time to direct us the old issue or PR with the discussion on this, or summarize lessons learned?

I'll have to read the whole thread, so apologies if I'm re-stating something. Maybe @minrk or @yuvipanda recall better and have been more involve.

I believe one of the reasoning is with regard to packaging. If it's a single file, then you need to execute a script that modify the config file at install/uninstall. if it's a directory then the package can contain a config file which is just added/removed on install/uninstall, and avoid the conda hooks to be executed.

@minrk

This comment has been minimized.

Copy link

commented Jan 30, 2019

I recall that IPython/Jupyter moved from one central file to one-file-per-extension and trying the former for a release or two and finding it problematic. I am not sure I correctly recall their reasoning.

I don't have a link off the top of my head, but the principle is that most package installers (pip wheels, conda, apt) either can't (wheels) or greatly prefer not to (conda, apt) take actions to modify files at installation time. They like as much as possible for installation to be placing unmodified files in stable locations. So whatever action it is that you want to take place at install time (in our case, enabling an extension), it is much easier for packagers to handle that by placing a file, rather than by editing a file. There have been all kinds of issues with conda's post-link and pre-unlink hooks, which are where we can take these modification actions, and caused headaches (e.g. dependencies being unavailable because they are not ready, maybe the enable/disable step occurred manually already, etc.).

So as much as possible, I would encourage actions to be taken at install time to be:

  1. plopping a file in a location
  2. user-overrides to this not involving editing the same file, but editing a different file that takes precedence

to avoid conflicts between user actions and packager install/uninstall.

I'm not sure I can speak to whether conf.d or setuptools entrypoints or whatever is best for your case in terms of discoverability. We've had good success in nbconvert and more recently jupyterhub with using entrypoints purely for the discoverability step, as long as what is to be discovered is e.g. a Python function or class (not the case for nbextensions, but it is the case for nbconvert Exporters). @takluyver's entrypoints package makes using the entrypoints standard established by setuptools super nice without having to invoke pkg_resources and all the baggage that comes with. In our case for extensions, discoverability isn't enough because we don't want the set of enabled extensions to always be the set of installed extensions, which is part of what @danielballan is asking for, I think. But if that is okay for you, then entrypoints is probably a good solution and you can skip defining your own conf.d standard. Even if you want that to be the default behavior, you could probably get away without conf.d and have a user-config extension whitelist (and/or blacklist) for cases where "less than all installed" is desired. If either of these suits your use case well enough, that's probably what I would recommend before conf.d.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

@danielballan , sounds like the many-file option is the clear winner, as you predicted. A general set of preferences is still needed, though (e.g., which driver to use for given data type name), which could be a file unto itself or in the main config. I think the many-files should be in the install directory so that pip and conda are ok with them (and they reflect installed packages local to that env anyway), but the config can stay in the confdir, ~/.intake/ by default, for now.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jan 30, 2019

Thanks for taking the time to provide that useful input, @Carreau and @minrk.

Agreed with all of the above, @martindurant. I think we have enough clarity at this point that I should move the code forward, and then we can kick the tires and discuss some more.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jan 30, 2019

Thanks to all the people here for developing and for joining the conversation!

@martindurant

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

@danielballan , do you still have plans here?

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 25, 2019

Yep! Thanks for your patience -- I just got back from a nice long vacation.

@danielballan danielballan force-pushed the danielballan:drivers-subcommand branch from 32ed125 to 6224958 Jul 8, 2019

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 10, 2019

Maybe @stuartcampbell, @tacaswell, @gwbischof and/or @jklynch would take a look.

# Discover drivers via entrypoints.
group = entrypoints.get_group_named('intake.drivers', path=path)
group_all = entrypoints.get_group_all('intake.drivers', path=path)
if group_all != group:

This comment has been minimized.

Copy link
@Carreau

Carreau Jul 11, 2019

Contributor

group_all:list , group:dict, so this will likely always be false ? – would need to check.
Force set-coercion, or compare length ?

This comment has been minimized.

Copy link
@danielballan

danielballan Jul 11, 2019

Author Collaborator

Ah, I meant this to be a len comparison.

if len(matches) != 1:
winner = group[name]
logger.debug(
"There are %d 'intake.driver' entrypoints for the name "

This comment has been minimized.

Copy link
@Carreau

Carreau Jul 11, 2019

Contributor

f- all the strings ?

This comment has been minimized.

Copy link
@Carreau

Carreau Jul 11, 2019

Contributor

Scratch that, is a log.debug.

@Carreau

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2019

Thanks, it's a good start. 🤣

danielballan added some commits Jul 11, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

In as far as this doesn't break existing functionality, I'm minded to merge. First we should have some tests, though, which in practice would mean working in a temporary config directory and calling both API and CLI functions to register/remove/forbid drivers. Also some tests which edit the config files directly? That tests the specific implementation of the config, which may change later.

@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 16, 2019

OK, great, I'll move ahead with adding tests. FYI, I'm on vacation the rest of this week, so don't expect commits to be rolling in until next week.

plugin_prefix : str
DEPRECATED. Default is 'intake_'.
do_package_scan : boolean
Default is True.

This comment has been minimized.

Copy link
@jsignell

jsignell Jul 19, 2019

Member

Will this be deprecated as well? It seems like the default will at least be changed at some point. Not sure how best to expose that though.

This comment has been minimized.

Copy link
@danielballan

danielballan Jul 22, 2019

Author Collaborator

Good catch. Added in dcc84a0. Thanks!


.. code-block:: bash
intake drivers disable some_format_name

This comment has been minimized.

Copy link
@jsignell

jsignell Jul 19, 2019

Member

How does this work for name collisions? Would they all be disabled?

This comment has been minimized.

Copy link
@danielballan

danielballan Jul 22, 2019

Author Collaborator

Yes. The user could then enable one driver specifically or leave no drivers enabled for that name.

@martindurant

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

Sounds like everything is ready here.
Does anything remain to be done?

@Carreau

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2019

Does anything remain to be done?

Merge ?

@martindurant

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

OK then!
This may be a very nice little thing to write a blog about.

@martindurant martindurant merged commit 4a5cdf2 into intake:master Jul 22, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@danielballan

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 22, 2019

Thanks, all! I had planned to add tests before the merge, but I'll just open a separate PR. In parallel, I will start adding entrypoints to intake-* packages. (If they are used with an old version of intake, pre-dating this PR, they will do no harm.)

@danielballan danielballan deleted the danielballan:drivers-subcommand branch Jul 22, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

Thanks, @danielballan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.