Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use entrypoints to manage drivers. Add subcommand. #236

Merged
merged 30 commits into from
Jul 22, 2019

Conversation

danielballan
Copy link
Member

I would like it to be possible to:

  • Provide drivers that are discoverable by intake without necessarily packaging them in a package named intake*
  • Have the option to disable a specific intake driver from getting autodiscovered without uninstalling the package that provides it or disabling other drivers in that package

I think the Jupyter notebook serverextension system has settled on a nice way to manage this kind of configuration (after many iterations and pivots over the years). This PR imitates that system. It's just a first pass to evaluate interest and would need more careful thought before being merged.

Demo:

An intake drivers subcommand can list the drivers that are added to intake.registry at import time.

$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource

A verbose option includes __file__ locations, potentially useful for untangling issues with environments.

$ intake drivers list -v
netcdf                        intake_xarray.netcdf.NetCDFSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/netcdf.py
opendap                       intake_xarray.opendap.OpenDapSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/opendap.py
rasterio                      intake_xarray.raster.RasterIOSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/raster.py
remote-xarray                 intake_xarray.xarray_container.RemoteXarray @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xarray_container.py
zarr                          intake_xarray.xzarr.ZarrSource @ /home/dallan/Repos/bnl/intake-xarray/intake_xarray/xzarr.py

Now suppose I want to disable the 'zarr'` driver provided by intake_xarray. Perhaps I have a different implementation that I want to use with 'zarr' and I need to avoid the name collision.

$ intake drivers disable intake_xarray.xzarr.ZarrSource
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
$ python -c "import intake; print('zarr' in intake.registry)"
False

I can later re-enable it:

$ intake drivers enable intake_xarray.xzarr.ZarrSource
$ python -c "import intake; print('zarr' in intake.registry)"
True

The enable/disable state is stored in a separate YAML file for each driver in ~/.intake/drivers.d, imitating the system used by Jupyter. For backward compatibility, drivers in packages that begin with intake* are included in the registry unless they are explicitly disabled. (That is, they need not have any configuration in ~/.intake/drivers.d.) Drivers in packages with other names can be explicitly enabled:

$ intake drivers enable offbrand_catalog.MongoMetadataStoreCatalog
$ intake drivers list
netcdf                        intake_xarray.netcdf.NetCDFSource
opendap                       intake_xarray.opendap.OpenDapSource
rasterio                      intake_xarray.raster.RasterIOSource
remote-xarray                 intake_xarray.xarray_container.RemoteXarray
zarr                          intake_xarray.xzarr.ZarrSource
mongo_metadatastore           offbrand_catalog.MongoMetadataStoreCatalog

The enable command created the following file at ~/.intake/drivers.d/offbrand_catalog.MongoMetadataStoreCatalog.yml:

offbrand_catalog.MongoMetadataStoreCatalog:                                                                                                                                                   
  enabled: true

As documented by Jupyter, packages can automatically enable their drivers at install time by using data_files in setup.py to place the corresponding files in ~/.intake/drivers.d/.

@martindurant
Copy link
Member

Before looking at this, I wanted to comment that a data source can specify its driver(s) by giving a fully qualified class name, package.module.DriverClass, and that a catalog yaml can give a set of modules to search for drivers at the time the file is parsed. Also, it is conceivable that a data source could be loaded by multiple drivers, and that multiple drivers happen to share the same simple name ( #163 ).

@martindurant
Copy link
Member

@bryevdv , you may have opinions on managing a plugin system via configuration. Note that the data drivers are just one thing that is pluggable in Intake.

@danielballan
Copy link
Member Author

For my use case, I want data sources to stick with just a name, not a fully-specified DriverClass, to maintain the degree of freedom whereby the consumer of the data source can have their own opinions about which of the potentially-multiple drivers they want to use.

@seibert
Copy link
Collaborator

seibert commented Jan 18, 2019

I am also interested to see more flexible ways to deal with plugins. The current system was basically copied from flask after I did some looking around for better approaches and gave up.

The main requirement (in my mind) is that installing a driver (with pip or conda) automatically makes it available in the registry without requiring an explicit registration step from the user. Although most package managers support automated post-install scripts, it is much easier if they are not required.

Being able to manage plugins that offer alternative implementations of the same format is interesting. That certainly makes it easier for people to experiment with better ways to read already-supported file formats.

@martindurant
Copy link
Member

Yeah, I think it's a good idea to be able to specify which class provides a particular driver, and which are not to be loaded at all, but I don't see an easy way to switch to a model where we don't do the package scan at all, since that's how thins are designed now.

On the CLI commands, I think they are basically correct (and exactly the same should be available in API also). The config like

drivers:
  zarr: intake_xarray.xzarr.ZarrSource
  csv: false
  netcdf: intake_xarray.netcdf.NetCDFSource

where "csv" and open_csv will be missing, and "netcdf" represents a choice. intake drivers -v would then give details of a) entries in the config, whether they imported successfully, and b) the discover process, as each name is registered and any that are skipped because of configuration or because of name collision.

@martindurant
Copy link
Member

That way, any name that is not explicitly given in the config (which is an action done by a user), it will be automatically imported without placing special files in any place or calling import-time scripts.

@danielballan
Copy link
Member Author

The main requirement (in my mind) is that installing a driver (with pip or conda) automatically makes it available in the registry without requiring an explicit registration step from the user. Although most package managers support automated post-install scripts, it is much easier if they are not required.

I agree completely. The breakthrough moment for me was when I learned that this can be done in a plain setup.py via the data_files argument. Other package managers can decide to use post-install hooks instead if they so choose, of course, as Jupyter’s documentation notes, but it’s not necessary.

I don't see an easy way to switch to a model where we don't do the package scan at all, since that's how thins are designed now.

I agree. We could nonetheless undertake adding this mechanism to all known existing drivers and then add a config option to opt out of the package scan (and thus make import faster and arguably “cleaner”). That option could even potentially become the default in the far future after a long warning cycle.

@martindurant
Copy link
Member

this can be done in a plain setup.py via the data_files argument

It it ends up in ~/.intake, then I don't think conda will find it, so then we'd be wanting to search in multiple places. I can imagine it getting somewhat complex, and I'm scared of cases of "I installed X but I can't find it" down the line.

@danielballan
Copy link
Member Author

I have used this pattern in the past:

if os.name == 'nt':
    _user_conf = os.path.join(os.environ['APPDATA'], 'databroker')
    CONFIG_SEARCH_PATH = (_user_conf,)
else:
    _user_conf = os.path.join(os.path.expanduser('~'), '.intake')
    _sys_prefix_etc = os.path.join(os.path.dirname(os.path.dirname(sys.executable)),
                               'etc', 'intake')
    _system_etc = os.path.join('/', 'etc', 'intake')
    CONFIG_SEARCH_PATH = (_user_conf, _sys_prefix_etc, _system_etc)

Of course one has to decide how to compose multiple potentially-comflicting configurations, but I think giving each driver a separate file makes this fairly clean. (I believe that was a factor in driving Jupyter to this approach.) The CLI can be explicit about where it found the config that is stipulating the state of each driver.

I don't want to make intake's configuration an order of magnitude more complicated all at once, but having a "sys prefix" solution for conda (and environments generally) may well be worth including in the first iteration.

@martindurant
Copy link
Member

Exactly where to put stuff is complex enough that there is a module for it, and in intake we want the config directory to be itself configurable via an environment variable. In addition, we need a firm way to prevent different packages from having conflicting file-names for the nugget that they would want to insert into that location. Finally, it's not clear whether the driver preferences should be per install prefix, user, or global.

Sorry if I only seem to be raising problems!

@danielballan
Copy link
Member Author

danielballan commented Jan 18, 2019

Haha, no problem. Forethought and wariness are called for here, especially as we'll be stuck with maintaining whatever we decide approximately forever.

we need a firm way to prevent different packages from having conflicting file-names for the nugget that they would want to insert into that location.

In this draft, intake expects to find config in a directory named exactly like the package. We can't prevent packages from stepping on another package's name in error, but I think we are safe from collisions between two correct packages, no?

Finally, it's not clear whether the driver preferences should be per install prefix, user, or global.

I believe Jupyter looks in all three of those in that order, shows in the startup logs and CLIs' output which files it has found and used, and merges them. Tricky but good for situations where the sysadmin provides some basics by the user wants to customize "just this one thing". [Edit: Or, in a personal-use context, where you have some general settings in ~/.intake and some per-env settings in environments to lay over top.]

Nice, I hadn't encountered appdirs yet. It seems quite lightweight, may be worth using here.

@martindurant
Copy link
Member

Sorry to be slow to get to this... Some thoughts, much of which is already covered by this PR.

  • the idea to be able to dis/able and list the set of drivers via the CLI is certainly very useful and a definite good addition. The same functionality should be available via the library API too, though
  • with debug logging (e.g., INTAKE_LOG_LEVEL=DEBUG), the decisions of which modules are inspected and which drivers imported from where should be shown, i.e., the stages of all_enabled_drivers need logger output
  • the current autodiscover behaviour must be maintained for backward compatibility for the foreseeable future; they do not need to added to an enabled list for this to happen, but they must not appear on a disabled list; collisions here should cause warnings
  • I agree with the idea of a place where installed packages can put files which then mark these packages for import by intake without being names intake_*. That could be done via the CLI with a script, API call, or special install file hook. I regard the last as very brittle, because it brings the problems of exactly where the files should go: global, user, environment, config-dir, appdirs default ... which are not necessarily fixed at install time; and I don't think there's a way for setuptools and conda to use the same method for placing a file. This needs some thought
  • the config directory can gain a new file detailing data drivers (could in theory be a section of the main config too):
    • a set of specific modules/classes to import; The appearance of a "config.d" file is exactly equivalent to a module entered here
    • a set of specific modules/classes not to import, and
    • a mapping name: [<class path> | 'disabled'] for either expressing a preference for which driver should fulfil a given driver name, or prevent registry of all drivers of the given name.
      This does overlap latter point overlaps with the first, which may be rarely used, therefore. It is handy to prevent the import of a module before finding the name of the contained class and rejecting it, but we would need to decide precedence when something is in both the avoid list and the mapping.

Previously, individual catalogues could provide a more specific drivers config too (they can still provide what is effectively the first item), this could be considered for reimplementation.

@danielballan
Copy link
Member Author

danielballan commented Jan 27, 2019

Sounds good to me, @martindurant. I think these items are currently actionable:

  • Back-compat: packages names intake* should be autodiscovered with no changes needed to those packages or special action by the user
  • Make all functionality accessible via API
  • Test
  • Add debug log calls to all_drivers_enabled and anywhere else it makes sense

And these need further thought and discussion:

  • Decide on a place where installed packages can put files which then mark these packages for import by intake without being named intake*, and whether to enable at install time.

    I regard [a special install file hook] as very brittle, because it brings the problems of exactly where the files should go: global, user, environment, config-dir, appdirs default ... which are not necessarily fixed at install time.

    The API call or equivalent CLI invocation should allow the user to differentiate with flags like --user, --sys-prefix, --global. Maybe it would be safest to start with the position, "Users must manually enable plugins whose names do not match intake*," and forgo the install hook for now. Then in a second step we can discuss whether it makes sense to follow Jupyter down the path of automatically enabling at the --sys-prefix level in `setup.py.

    I don't think there's a way for setuptools and conda to use the same method for placing a file.

    In the Jupyter example linked above, I believe conda would successfully discover and package the placed files, because they go inside conda's prefix.

  • Consider trade-offs between placing these settings in one central file (x global/user/sys-prefix) or distributed across one-file-per-driver, in the init.d style that I have implemented here. I recall that IPython/Jupyter moved from one central file to one-file-per-extension and trying the former for a release or two and finding it problematci. I am not sure I correctly recall their reasoning. @Carreau Would you have time to direct us the old issue or PR with the discussion on this, or summarize lessons learned?

  • Review how Catalogs themselves can specify driver configuration, and carefully consider how this interacts with the proposed new configuration scheme.

@martindurant
Copy link
Member

Agree totally with that summary, @danielballan .

Shall we keep this PR, then, to the direct actionable items, and I'm OK to allow for intake-imported packages to announce themselves via a file-per-package, as you intended. That approach would still require a separate config file too, for driver name clashes. I would expect, for now, packages to directly call the intake API to announce themselves via a conda post-install hook, so we can keep the implementation flexible for the future.

@Carreau
Copy link
Contributor

Carreau commented Jan 29, 2019

Would you have time to direct us the old issue or PR with the discussion on this, or summarize lessons learned?

I'll have to read the whole thread, so apologies if I'm re-stating something. Maybe @minrk or @yuvipanda recall better and have been more involve.

I believe one of the reasoning is with regard to packaging. If it's a single file, then you need to execute a script that modify the config file at install/uninstall. if it's a directory then the package can contain a config file which is just added/removed on install/uninstall, and avoid the conda hooks to be executed.

@minrk
Copy link

minrk commented Jan 30, 2019

I recall that IPython/Jupyter moved from one central file to one-file-per-extension and trying the former for a release or two and finding it problematic. I am not sure I correctly recall their reasoning.

I don't have a link off the top of my head, but the principle is that most package installers (pip wheels, conda, apt) either can't (wheels) or greatly prefer not to (conda, apt) take actions to modify files at installation time. They like as much as possible for installation to be placing unmodified files in stable locations. So whatever action it is that you want to take place at install time (in our case, enabling an extension), it is much easier for packagers to handle that by placing a file, rather than by editing a file. There have been all kinds of issues with conda's post-link and pre-unlink hooks, which are where we can take these modification actions, and caused headaches (e.g. dependencies being unavailable because they are not ready, maybe the enable/disable step occurred manually already, etc.).

So as much as possible, I would encourage actions to be taken at install time to be:

  1. plopping a file in a location
  2. user-overrides to this not involving editing the same file, but editing a different file that takes precedence

to avoid conflicts between user actions and packager install/uninstall.

I'm not sure I can speak to whether conf.d or setuptools entrypoints or whatever is best for your case in terms of discoverability. We've had good success in nbconvert and more recently jupyterhub with using entrypoints purely for the discoverability step, as long as what is to be discovered is e.g. a Python function or class (not the case for nbextensions, but it is the case for nbconvert Exporters). @takluyver's entrypoints package makes using the entrypoints standard established by setuptools super nice without having to invoke pkg_resources and all the baggage that comes with. In our case for extensions, discoverability isn't enough because we don't want the set of enabled extensions to always be the set of installed extensions, which is part of what @danielballan is asking for, I think. But if that is okay for you, then entrypoints is probably a good solution and you can skip defining your own conf.d standard. Even if you want that to be the default behavior, you could probably get away without conf.d and have a user-config extension whitelist (and/or blacklist) for cases where "less than all installed" is desired. If either of these suits your use case well enough, that's probably what I would recommend before conf.d.

@martindurant
Copy link
Member

@danielballan , sounds like the many-file option is the clear winner, as you predicted. A general set of preferences is still needed, though (e.g., which driver to use for given data type name), which could be a file unto itself or in the main config. I think the many-files should be in the install directory so that pip and conda are ok with them (and they reflect installed packages local to that env anyway), but the config can stay in the confdir, ~/.intake/ by default, for now.

@danielballan
Copy link
Member Author

Thanks for taking the time to provide that useful input, @Carreau and @minrk.

Agreed with all of the above, @martindurant. I think we have enough clarity at this point that I should move the code forward, and then we can kick the tires and discuss some more.

@martindurant
Copy link
Member

Thanks to all the people here for developing and for joining the conversation!

@martindurant
Copy link
Member

@danielballan , do you still have plans here?

@danielballan
Copy link
Member Author

Yep! Thanks for your patience -- I just got back from a nice long vacation.

@Carreau
Copy link
Contributor

Carreau commented Jul 22, 2019

Does anything remain to be done?

Merge ?

@martindurant
Copy link
Member

OK then!
This may be a very nice little thing to write a blog about.

@martindurant martindurant merged commit 4a5cdf2 into intake:master Jul 22, 2019
@danielballan
Copy link
Member Author

Thanks, all! I had planned to add tests before the merge, but I'll just open a separate PR. In parallel, I will start adding entrypoints to intake-* packages. (If they are used with an old version of intake, pre-dating this PR, they will do no harm.)

@danielballan danielballan deleted the drivers-subcommand branch July 22, 2019 21:14
@martindurant
Copy link
Member

Thanks, @danielballan

danielballan added a commit to danielballan/intake-parquet that referenced this pull request Jul 26, 2019
danielballan added a commit to danielballan/intake-splunk that referenced this pull request Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants