New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-13814: Revamp how datastores and composites work #26
Conversation
3e1b954
to
1a5326c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok, modulo comments. But I'd like a second look at it by Jim
Assembled exposure. | ||
""" | ||
micomps = {} | ||
hasMI = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer longer names: maskedImageComponents
. Annoying to type, but easier to read.
if c in components: | ||
hasMI = True | ||
v = components[c] | ||
micomps[c] = v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too: c -> component
, v -> value
.
if wcs is not None: | ||
exposure.setWcs(wcs) | ||
|
||
# Set other components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if something in obs_base
already knows how to create an exposure from its components so I hadn't finished populating it. Maybe @TallJimbo knows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing like this already exists. The afw APIs probably ought to be updated someday, so you can just pass kwargs with components to the Exposure
constructor.
exposure = makeExposure(maskedImage, wcs=wcs) | ||
|
||
if not isinstance(exposure, pytype): | ||
raise RuntimeError("Internal inconsistency in Exposure assembly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a very descriptive error message.
python/lsst/daf/butler/butler.py
Outdated
# and also has components | ||
if storageClass.assemblerClass.disassemble is not None and storageClass.components: | ||
components = storageClass.assembler().disassemble(obj) | ||
if components: # Probably need to always have all components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets back to the "if a component is defined to be None, does it really exist". In this case I think disassembly must always return something unless it's a completely blank Exposure.
if name == "config" or (isinstance(info, str) and info.endswith(".yaml")): | ||
# This seems to be a location of another file so process that | ||
self.addFromConfig(sconfig["config"]) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like duplicate from Config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. I wrote that from scratch. Does Config
know that a storageClasses
entry in a config can recursively load another config file if it sees the key config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but it does have some check for a YAML file argument to its constructor that then builds it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to create a storageClass called extraStorageClassesHere.yaml
and I thought that you'd want the option of specifying multiple sources of extra definitions.
python/lsst/daf/butler/core/utils.py
Outdated
def __call__(cls, *args, **kwargs): # noqa N805 | ||
if cls not in cls._instances: | ||
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) | ||
return cls._instances[cls] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do about multiple instances generated with different arguments? Now you confusingly get back an instance constructed with different arguments then you supplied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only ever get the first thing you constructed back. The **kwargs
are there to allow this metaclass to be generally useful. I never call the StorageClassFactory
constructor with arguments now because it no longer makes any sense because you don't know if you are the first one. I could remove the arguments from this definition if you want to force our uses of Singleton
to only work if there are no arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that would be safer.
typeName = datasetType.name | ||
template = self.templates.getTemplate(typeName) | ||
location = self.locationFactory.fromPath(template.format(ref.dataId, | ||
datasetType=typeName) + "#predicted") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we don't use #component
anymore? So this won't clash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I should remove that code from Location
.
|
||
@staticmethod | ||
def calcChecksum(filename, algorithm="blake2b", block_size=8192): | ||
"""Calculate the checksum of the supplied file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calc
our convention for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. computeChecksum
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See rule 3-26. Abbreviations in names SHOULD be avoided in https://developer.lsst.io/cpp/style.html (which, given that it is about naming) I assume also applies to Python. compute
is a specific example.
tests/test_butler.py
Outdated
|
||
# Get a component | ||
summary = butler.get("{}.{}".format(datasetTypeName, "summary"), dataId) | ||
self.assertEqual(summary, metric.summary) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the other components?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I'll use a for loop and getattr
.
return "DatasetType({}, {}, {})".format(self.name, self.storageClass.name, self.dataUnits) | ||
|
||
@staticmethod | ||
def nameWithComponent(datasetTypeName, componentName): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pschella you might have an opinion on nameWithComponent
, component
and componentTypeName
in DatasetType
. They contain the logic for "calexp.wcs" naming convention so I wanted it to be in the one place in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I had one structural comment about where disassemblers are associated with StorageClasses that I suspect is just a misunderstanding on my part, and a bunch of smaller things that I think are just minor cleanups.
I also think we're missing a piece of the lookup for components of deferred virtual composites in Registry (@pschella); there's an if
statement in the confluence page pseudocode for that that doesn't seem to be implemented yet here.
if wcs is not None: | ||
exposure.setWcs(wcs) | ||
|
||
# Set other components |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, nothing like this already exists. The afw APIs probably ought to be updated someday, so you can just pass kwargs with components to the Exposure
constructor.
if "apCorrMap" in components: | ||
info.setApCorrMap(components.pop("calib")) | ||
if "coaddInputs" in components: | ||
info.setCoaddInputs(components.pop("coaddInputs")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily advocating this, but if you're interested, most of these ExposureInfo
setters can take None
(I think VisitInfo
may be the only exception), so you could use a pattern like
info.setX(components.pop("x", None))
storageClass = datasetType.storageClass | ||
|
||
# Check to see if this storage class has a disassembler | ||
if storageClass.assemblerClass.disassemble is not None and storageClass.components: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether this disassemble
method is None
depends on the Butler
config, right? How does that work if the storageClass
instance is being retrieved from the Registry? (I'd have thought the assembler
couldn't be attached to the storageClass
for this reason, or if it was, it'd have to be attached by Butler
itself, and then only if it knows it's the only Butler
using that particular StorageClass
instance).
That said, what we need from the config probably isn't the disassembler itself (since that should always be the same, which of course why it's nice to attach it to StorageClass
intrinsically) - we just need a flag in the config that says whether we should use it, to use in this if statement.
(Or maybe I just don't understand how it works, of course).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage classes are constructed via StorageClassFactory from config file. That's a singleton (as we discussed last week). The definition includes a definition of the assembler class. This is a name at class construction time. When you ask for an assembler instance from the storageclass it does a runtime import of the assembler (so you only need afw Exposure knowledge when you are actively wanting to deal with an Exposure). This is all hidden from registry and butler so registry is not involved other than asking the factory to convert the name into an instance when creating the DatasetType, and butler can just assume that the assembler will be there. You always have an assembler defined, but a disassemble method can be nulled out if you don't want to split up the in memory dataset into components.
Does that help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That all makes sense if the disassemble
method is either nulled out or not the original config file that defines the StorageClass
, and can never change after that.
Let's say we're talking about a StorageClass
that's define in the global configuration.
Can per-butler override configuration null the disassemble method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageClassFactory is a singleton so you always retrieve the same StorageClass with a given name so if you have multiple butlers instantiated you do indeed end up with no way to enable or disable the disassembler per-butler. StorageClass doesn't have any instance data either so you can't override it by configuration. In the previous implementation this wasn't a problem because each butler got its own StorageClassFactory and the StorageClass definitions in the factory were allowed to be different even if they had the same fully qualified name. The switch to a singleton does require me to think some more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I don't mind merging this as is; I don't think fixing it in the future will break anything structural.
path : `str` | ||
Path to Dataset, relative to `Datastore` root. | ||
storageClass : `StorageClass` | ||
`StorageClass` used when writing the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand how this works from our conversation yesterday, but this would be a good place to describe how/why/when it can differ from the StorageClass
used to read the file.
try: | ||
storedFileInfo = self.getStoredFileInfo(ref) | ||
except KeyError: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently returns False
for virtual composites. I think that might be a problem; we definitely need a way to distinguish that state from "it used to exist, but has been deleted", which I don't think we have right now, but the solution doesn't have to be having this return True
for virtual composites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. What does existence mean...? I was treating existence as "can I retrieve this thing" not "has it ever existed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also a datastore.exists and not a butler.exists. If butler splits up an exposure and writes it as separate items datastore never sees the parent DatasetRef so can't understand that it does indeed have all the bits needed to make the composite. The only way this would really work is if butler could register the parent DatasetRef with the datastore after it's put all the components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping Datastore from knowing about these parents is probably a good idea. I think that just means we need a way for Registry to know that no Datastore entry should exist. That was going to be the assembler
field in Dataset, and I think we can now just make that into a bool.
@@ -106,13 +106,13 @@ class FileTemplate: | |||
def __init__(self, template): | |||
self.template = template | |||
|
|||
def format(self, dataUnits, datasetType=None, component=None): | |||
def format(self, dataId, datasetType=None, component=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may not need a component
argument anymore, since that's packed into the datasetType
.
In the future, it will probably need to have a run
argument, as we'll need templates to be unique across runs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Now that I've encoded the dot syntax into DatasetType
I don't think a separate component is needed any more. I think I need to use the new DatasetType
splitter here now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it should just take a DatasetRef
directly (it has all the information).
""" | ||
|
||
# if this has never been written then we have to guess | ||
if ref.id is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check only tells you whether some Datastore
has written it, not whether this Datastore
has, and I think that's what you want to be checking.
That said, I think users actually may want to specify (e.g. with a boolean arg) whether they're asking for a guess URI for something they haven't written yet or a URI for something that should exist, so they can get an exception when they don't get what they want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote this I didn't have an exists method. I'll add an option to only retrieve things that actually exist.
raise FileNotFoundError("No such file: {0}".format(location.uri)) | ||
os.remove(location.preferredPath()) | ||
os.remove(location.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete from StoredFileInfo
, too?
# Concrete composite written as a single file (we hope) | ||
try: | ||
data = fileDescriptor.storageClass.assembler().getComponent(data, comp) | ||
except AttributeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised the expected exception is an AttributeError
; is that a detail of the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/core/composites.py#L205
You prefer KeyError
? I throw AttributeError
because I'm testing with hasattr
in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion; I just wanted to make sure this was being considered as part of the CompositeAssembler
public API rather than being an accident (and I see now that the exception type is documented there as the one derived classes should raise, so I'm happy).
* Datastore APIs now take DatasetRefs * Assembly and disassembly is now done by butler. * Datastore now records path and formatter indexed by DatasetRef. - No longer uses a URI to record that information. * URIs can now be retrieved separately. * StorageClassFactory is now a Singleton. * DatasetType now has a StorageClass instance (not name) * StorageClasses can now be defined in multiple config files. * afw Exposure assembly much improved. * File templates now use DatasetRef * DatasetType now knows how to to represent components in the name. * Read Formatters now have access to the storage class used to write and the storage class requested to read the data. * Location no longer knows about components.
Following the new scheme, DatasetRefs are now the means by which a datastore get/put work. Complete reorganization of how that happens and composite disassembly now happens at the butler level. All tests now pass. Currently uses a transient in-memory store in Datastore for keeping track of formatter that was used for a dataset.
Still need to add tests for butler put/get.