Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-14191: Merge butler configuration files with defaults #39

Merged
merged 26 commits into from May 17, 2018

Conversation

timj
Copy link
Member

@timj timj commented May 9, 2018

  • Also rewrite doImport

timj added 10 commits May 9, 2018 12:58
* schema is now a top level item in config file
* repository yaml correctly finds registry and datastore defaults
  on a per-class basis.
Python docs really don't want you to use __import__ and strongly
prefer importlib. I think this implementation is arguably easier
to understand and has a key advantage of not hiding ImportError
from the caller (this is important when loading plugins for
formatters and the like since you want to know why your
formatter failed to import).
It was not used anywhere.
and also validate that some keys are present.
This simplifies the constructor code since it no longer
requires that you know the subset and you know that the
subclass has the right bit of the config.

For example, allow a SchemaConfig to be created from a butler config file
and locate the "schema" component.
----------
config : `Config`, optional
By default checks itself, but if ``config`` is given, this
config will be checked instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to make validate a public class method instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below on validation of ConfigSubset but would you prefer class attributes Config.requiredKeys and a public method in Config.validate which used requiredKeys but could be overridden? Why a class method and not an instance method?

pass
class DatastoreConfig(ConfigSubset):
component = "datastore"
requiredKeys = ("cls",)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this check now duplicated in Butler(Config)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ButlerConfig doesn't inherit from ConfigSubset so no. I could move the validation code in ConfigSubset.__init__() to Config.__init__() and always have it as an option if you set requiredKeys.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good.

else:
raise ValueError("Incompatible Registry configuration: {}".format(config))
cls = doImport(config['cls'])
cls = doImport(config['registry.cls'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If config now is a RegistryConfig doesn't it have just the registry subset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, because we want the full Butler config for it. Maybe then explicitly specify that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think it's better for RegistryConfig to only have registry stuff and SchemaConfig to only have schema stuff. I'll change that.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; mostly very minor comments. I would like to make sure we figure out where to put the StorageClass->Formatter mapping defaults so those don't have to be repeated in each repository (and that raises the question of which config values, if any, should always go in the repo config instead of being pulled from the defaults).

# Find the butler configs
self.defaultsDir = None
if "DAF_BUTLER_DIR" in os.environ:
self.defaultsDir = os.path.join(os.environ["DAF_BUTLER_DIR"], "config")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.getPackageDir?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that. utils would work fine but I do wonder about adding LSST dependencies for no real gain. We only use utils in tests and I would argue that we don't need that dependency. That leaves only lsst.log and sphgeom stopping us from putting this package directly on pypi for everyone to use without needing any non-standard dependencies. (I'm thinking about world domination here...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some of the same thoughts about world domination (see RFC-430 and RFC-460 vis a vis getting that sphgeom dependency out on pypi). But I think using DAF_BUTLER_DIR here just avoids an explicit dependency on utils via an implicit dependency on EUPS and perhaps sconsUtils, and I think the utils version may be slightly easier to rip out and replace with something unencumbered later.


Parameters
----------
other : `str`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to rename this argument to something more descriptive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it that because that's what it was called in Config (which came from daf_persistence). I'll have a think.

cls: lsst.daf.butler.datastores.posixDatastore.PosixDatastore
records:
table: PosixDatastoreRecords
create: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised to not see the default mapping from StorageClass to Formatter here. Those are specific for PosixDatastore, right?

I don't actually care if they go here or elsewhere, but we need an easy way to construct a Butler that knows about those defaults.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't entirely sure what defaults would mean since the formatters really do lock in the file format, but you are right that FITS is where we are so our default for "how to write an Exposure" should be in this file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of those defaults currently live in the gen3 module of ci_hsc (on the gen3-middleware branch). I'd love to get those moved out of Python code into YAML. I'd also be happy to take care of making a ci_hsc branch compatible with this ticket once you're done with the daf_butler changes.

# Look for class specific defaults
for section in ("datastore", "registry"):
k = f"{section}.cls"
print("Checking {}: {}".format(k, butlerConfig[k]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or use logging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Sorry. Should have deleted it.


# Default schema
schema = Config(os.path.join(self.defaultsDir, "registry", "default_schema.yaml"))
self.update(schema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be using RegistryConfig here? ("No" is a perfectly good answer; I'm just curious why not.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Before I added ConfigSubset there wasn't any difference. Now there is so I'll revisit that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although, these configs have to fit in the global structure so should not have the schema stripped. The advantage of using SchemaConfig is you get the validation so I have to see if I can read the subset and bind it to the larger config hierarchy.

@pschella
Copy link
Collaborator

Since they are all defaults we may want to rename default_schema.yaml to something else.

# Find the butler configs
self.defaultsDir = None
if "DAF_BUTLER_DIR" in os.environ:
self.defaultsDir = os.path.join(os.environ["DAF_BUTLER_DIR"], "config")
Copy link
Collaborator

@pschella pschella May 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder if having just one level of overrides is future proof enough. As opposed to a (colon-separated) $DAF_BUTLER_CONFIG_PATH.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that. I thought we agreed that once we have the merging code in one place we can easily add it but since I'm there I'll add this other variable as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you envisage the $DAF_BUTLER_CONFIG_PATH containing a set of search paths that look identical to the on in daf_butler/config (so we look for storageClasses.yaml, etc)? And that the defaults should be filled in by reading butler configs first, then the path in reverse order?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should hold off on this until we have a use case for it. I tend to think that having extra paths that aren't connected to either a repo or a software version could do more harm than good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this last night... How about I push the change this morning as a new commit and you can take a look.

@timj
Copy link
Member Author

timj commented May 10, 2018

Regarding default_schema.yaml I'm happy to delete the registry directory and call the file registry_schema.yaml.

timj added 3 commits May 14, 2018 16:09
* Each configuration subclass now knows its root component
  and name of defaults configuration file.
* Search DAF_BUTLER_CONFIG_PATHS for files.
* Registry.fromConfig can now take a separate schema config
  since Schema is in a different area of the model.
# code.
# Should this be a log message? Are we using lsst.log?
# print("Checking path {} for {} ({})".format(pathDir, configClass, configFile))
if not os.path.isabs(configFile):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this apply absolute paths over and over? Maybe that's not a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It does. Hadn't thought of that. I don't think it's a problem as such, but it will slow things down.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply invert the loop order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. since the configs read from the paths are subsets of the config I can loop over configs and then paths.

are constructed by reading first the global defaults, and then adding
in overrides from each entry in the colon-separated
``$DAF_BUTLER_CONFIG_PATH`` in reverse order such that the entries ahead
in the list take priority. The registry and datastore configurations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this opposite behavior to shell $PATH? If so why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not meant to be the opposite order. In $PATH things early in the path take precedence over things later in the path, so I thought I had to reverse the order when reading in the defaults so that earlier ones in the path override defaults appearing late in the path.

specializedConfigs = (DatastoreConfig, RegistryConfig)

# Look for class specific default files to be added to the file list
# These class definitions must come from the suppliedbutler config and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supplied butler.

# Look for class specific default files to be added to the file list
# These class definitions must come from the suppliedbutler config and
# are not defaulted.
for componentConfig in specializedConfigs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply loop over all components and have those subclasses determine if they are special?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. by having a component member.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying I should look through all the top level keys in the supplied butler config looking for .cls members, then ask the imported classes for their config files and associated config classes? (the latter they don't know but they could).

butlerConfig = Config(other)

# All the configs that can be associated with default files
configComponents = (SchemaConfig, StorageClassConfig, DatastoreConfig, RegistryConfig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this a class member?

# code.
# Should this be a log message? Are we using lsst.log?
# print("Checking path {} for {} ({})".format(pathDir, configClass, configFile))
if not os.path.isabs(configFile):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply invert the loop order?

@@ -94,7 +94,7 @@ def __init__(self, other=None):
def ppprint(self):
"""helper function for debugging, prints a config out in a readable way in the debugger.

use: pdb> print myConfigObject.pprint()
use: pdb> print myConfigObject.ppprint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ppprint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't write the method...

"""

def __init__(self, other=None):
super().__init__(other)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if other is a ConfigSubset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I missing?

>>> from lsst.daf.butler import ConfigSubset
>>> x = ConfigSubset({"A": 5, "b": 6})
>>> x
{'A': 5, 'b': 6}
>>> y = ConfigSubset(x)
>>> y
{'A': 5, 'b': 6}
>>> x["c"] = 3
>>> x
{'A': 5, 'b': 6, 'c': 3}
>>> y
{'A': 5, 'b': 6}

@@ -33,7 +34,8 @@ class DataUnitRegistryTestCase(lsst.utils.tests.TestCase):
"""
def setUp(self):
self.testDir = os.path.dirname(__file__)
self.schemaFile = os.path.join(self.testDir, "../config/registry/default_schema.yaml")
self.schemaFile = os.path.join(lsst.utils.getPackageDir("daf_butler"), "config",
SchemaConfig.defaultConfigFile)
self.config = SchemaConfig(self.schemaFile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to specify the schemaFile? Or does calling SchemaConfig() load the default file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the only defaulting is in ButlerConfig. I can see how each subclass doing its own defaulting and getting the search path from a shared routine might be helpful (and more obvious).

@@ -51,7 +52,7 @@ def setUp(self):
self.configFile = os.path.join(self.testDir, "config/basic/butler.yaml")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a default file for Butler too perhaps?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If only for users to copy. But it might be useful to do butler = Butler() and simply get something working with a local SqliteRegistry and PosixDatastore in a default local subdirectory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps even more useful if that then can write the default butler.yaml config in it (something like git init .) that you can subsequently modify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But maybe the default constructor is not the best for that. And instead you want an explicit class method e.g. butler = Butler.makeDefault() (with an optional path).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd go with an explicit class method with a required path argument (which could accept ".", I suppose).

inputPosixDatastore = PosixDatastore(config=inputConfig, registry=self.registry)
outputConfig = inputConfig.copy()
outputConfig['datastore.root'] = os.path.join(self.testDir, "./test_output_datastore")
outputConfig['root'] = os.path.join(self.testDir, "./test_output_datastore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we now assuming that Datastore and Registry always have the same root? What if (for a particular datastore) there is no such thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. This is a DatastoreConfig so by definition it's a datastore config not a registry config. Or are you asking a different question?

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've made some minor comments, but like @pschella, from this point I'm happy for you to merge when you think it's in good shape.
If you do want to merge before moving the StorageClass->Formatter lookup defaults to a non-test config file, I'll go ahead and add those on DM-14378 (there are a number of additional mappings I want to move to daf_butler from ci_hsc anyway, once I have a place to put them in daf_butler).

@@ -94,7 +94,7 @@ def __init__(self, other=None):
def ppprint(self):
"""helper function for debugging, prints a config out in a readable way in the debugger.

use: pdb> print myConfigObject.pprint()
use: pdb> print myConfigObject.ppprint()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think pdb requires parentheses after print these days.

the configuration supplied might contain more keys than you are interested
in and you only want one bit of it. For example, your config might contain
``schema`` if it's part of a global config but it might be the contents
of ``schema``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the end of this sentence, I think.


requiredKeys = ()
"""Keys that are required to be specified in the configuration.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these relative to the top level or the component? (should be documented)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relative to the Config root of this config (which does not include the component name).

# this checks a specific part of the tree
# We may need to turn off validation since we do not
# require that each defaults file found is fully
# consistent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we'll need to defer validation until the full config has been loaded and merged.

ModuleNotFoundError
No module in the supplied import string could be found.
ImportError
pythonType is found but can not be imported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be best to catch and re-raise any of these as ImportError; it's much more likely that the caller will want to catch anything (without having a too-general except Exception block) than try to distinguish between them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleNotFoundError is a subclass of ImportError. I was pleased to discover that the simplification of doImport is only possible because we are using python 3.6 (previous versions did not have ModuleNotFoundError). I was trying to keep the interface looking like import. I suppose I could convert AttributeError to ImportError (although I thought I had managed to get import to throw that at one point, maybe via from).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using a hierarchy of exceptions is the best scenario. If you can convert the AttributeError to ImportError, I do think that'd be best, even if that is a bit unlike import itself.

defaultConfigFile = None
"""Path to configuration defaults. Relative to $DAF_BUTLER_DIR/config or
absolute path. Can be None if no defaults specified.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not actually have a default for SqlRegistry configuration?

@TallJimbo
Copy link
Member

TallJimbo commented May 15, 2018

Regarding default_schema.yaml I'm happy to delete the registry directory and call the file registry_schema.yaml.

Should we just call it schema.yaml, so overrides can use the same filename? (Overrides should be rare, but probably permitted in the same way StorareClass overrides are).

@timj
Copy link
Member Author

timj commented May 15, 2018

Everything can be overridden using $DAF_BUTLER_CONFIG_PATH if they use the same file name as is in the butler config directory. I'm okay with schema.yaml.

timj added 6 commits May 15, 2018 17:13
This simplifies catch logic for doImport and most people do
not need to understand AttributeError vs ImportError in this
context.
Now each configuration determines its own defaults
automatically. A ButlerConfig now simply asks each
subconfig for a merge of the defaults with the
externally supplied values.
Does not completely work yet because "run" can not be defaulted.
timj added 5 commits May 17, 2018 10:51
* Clearer statement on priority of search paths
* Add explicit search path override to support tests
* Merge defaults before overriding with external values
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; one question and one doc bug.

Also, could you rename registry_schema.yaml to just schema.yaml?


# Select the part we need from it
if self.component is not None and self.component in externalConfig:
externalConfig.data = externalConfig.data[self.component]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we merge the wrong externalConfig values when these conditions are not met, or does that mean we're assuming that externalConfig already corresponds to just the stuff under this component?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that if you give a config with a root of "datastore" and the config knows it's a datastore then your config will take the contents of the "datastore" part of the tree. If there is no "datastore" in the tree it assumes it's all meant for this config (and will then check for mandatory keys). ie you can give it a datastore.yaml that doesn't have the "datastore" part.

"""Search the supplied paths, merging the configuration values

The values read will override values currently stored in the object.
Every file found in the path will be read. The priority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphaned sentence.

@timj timj merged commit f0893da into master May 17, 2018
@ktlim ktlim deleted the tickets/DM-14191 branch August 25, 2018 06:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants