Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-13814: Revamp how datastores and composites work #26

Merged
merged 3 commits into from Mar 30, 2018
Merged

Conversation

timj
Copy link
Member

@timj timj commented Mar 28, 2018

Following the new scheme, DatasetRefs are now the means by which a datastore get/put work. Complete reorganization of how that happens and composite disassembly now happens at the butler level. All tests now pass. Currently uses a transient in-memory store in Datastore for keeping track of formatter that was used for a dataset.

Still need to add tests for butler put/get.

@timj timj force-pushed the tickets/DM-13814 branch 2 times, most recently from 3e1b954 to 1a5326c Compare March 29, 2018 19:47
Copy link
Collaborator

@pschella pschella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok, modulo comments. But I'd like a second look at it by Jim

Assembled exposure.
"""
micomps = {}
hasMI = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer longer names: maskedImageComponents. Annoying to type, but easier to read.

if c in components:
hasMI = True
v = components[c]
micomps[c] = v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too: c -> component, v -> value.

if wcs is not None:
exposure.setWcs(wcs)

# Set other components
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if something in obs_base already knows how to create an exposure from its components so I hadn't finished populating it. Maybe @TallJimbo knows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing like this already exists. The afw APIs probably ought to be updated someday, so you can just pass kwargs with components to the Exposure constructor.

exposure = makeExposure(maskedImage, wcs=wcs)

if not isinstance(exposure, pytype):
raise RuntimeError("Internal inconsistency in Exposure assembly")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very descriptive error message.

# and also has components
if storageClass.assemblerClass.disassemble is not None and storageClass.components:
components = storageClass.assembler().disassemble(obj)
if components: # Probably need to always have all components
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gets back to the "if a component is defined to be None, does it really exist". In this case I think disassembly must always return something unless it's a completely blank Exposure.

if name == "config" or (isinstance(info, str) and info.endswith(".yaml")):
# This seems to be a location of another file so process that
self.addFromConfig(sconfig["config"])
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplicate from Config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. I wrote that from scratch. Does Config know that a storageClasses entry in a config can recursively load another config file if it sees the key config?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but it does have some check for a YAML file argument to its constructor that then builds it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to create a storageClass called extraStorageClassesHere.yaml and I thought that you'd want the option of specifying multiple sources of extra definitions.

def __call__(cls, *args, **kwargs): # noqa N805
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
return cls._instances[cls]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do about multiple instances generated with different arguments? Now you confusingly get back an instance constructed with different arguments then you supplied.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only ever get the first thing you constructed back. The **kwargs are there to allow this metaclass to be generally useful. I never call the StorageClassFactory constructor with arguments now because it no longer makes any sense because you don't know if you are the first one. I could remove the arguments from this definition if you want to force our uses of Singleton to only work if there are no arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be safer.

typeName = datasetType.name
template = self.templates.getTemplate(typeName)
location = self.locationFactory.fromPath(template.format(ref.dataId,
datasetType=typeName) + "#predicted")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't use #component anymore? So this won't clash?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I should remove that code from Location.


@staticmethod
def calcChecksum(filename, algorithm="blake2b", block_size=8192):
"""Calculate the checksum of the supplied file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calc our convention for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. computeChecksum?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See rule 3-26. Abbreviations in names SHOULD be avoided in https://developer.lsst.io/cpp/style.html (which, given that it is about naming) I assume also applies to Python. compute is a specific example.


# Get a component
summary = butler.get("{}.{}".format(datasetTypeName, "summary"), dataId)
self.assertEqual(summary, metric.summary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other components?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I'll use a for loop and getattr.

return "DatasetType({}, {}, {})".format(self.name, self.storageClass.name, self.dataUnits)

@staticmethod
def nameWithComponent(datasetTypeName, componentName):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pschella you might have an opinion on nameWithComponent, component and componentTypeName in DatasetType. They contain the logic for "calexp.wcs" naming convention so I wanted it to be in the one place in this class.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I had one structural comment about where disassemblers are associated with StorageClasses that I suspect is just a misunderstanding on my part, and a bunch of smaller things that I think are just minor cleanups.

I also think we're missing a piece of the lookup for components of deferred virtual composites in Registry (@pschella); there's an if statement in the confluence page pseudocode for that that doesn't seem to be implemented yet here.

if wcs is not None:
exposure.setWcs(wcs)

# Set other components
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, nothing like this already exists. The afw APIs probably ought to be updated someday, so you can just pass kwargs with components to the Exposure constructor.

if "apCorrMap" in components:
info.setApCorrMap(components.pop("calib"))
if "coaddInputs" in components:
info.setCoaddInputs(components.pop("coaddInputs"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily advocating this, but if you're interested, most of these ExposureInfo setters can take None (I think VisitInfo may be the only exception), so you could use a pattern like

info.setX(components.pop("x", None))

storageClass = datasetType.storageClass

# Check to see if this storage class has a disassembler
if storageClass.assemblerClass.disassemble is not None and storageClass.components:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether this disassemble method is None depends on the Butler config, right? How does that work if the storageClass instance is being retrieved from the Registry? (I'd have thought the assembler couldn't be attached to the storageClass for this reason, or if it was, it'd have to be attached by Butler itself, and then only if it knows it's the only Butler using that particular StorageClass instance).

That said, what we need from the config probably isn't the disassembler itself (since that should always be the same, which of course why it's nice to attach it to StorageClass intrinsically) - we just need a flag in the config that says whether we should use it, to use in this if statement.

(Or maybe I just don't understand how it works, of course).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

storage classes are constructed via StorageClassFactory from config file. That's a singleton (as we discussed last week). The definition includes a definition of the assembler class. This is a name at class construction time. When you ask for an assembler instance from the storageclass it does a runtime import of the assembler (so you only need afw Exposure knowledge when you are actively wanting to deal with an Exposure). This is all hidden from registry and butler so registry is not involved other than asking the factory to convert the name into an instance when creating the DatasetType, and butler can just assume that the assembler will be there. You always have an assembler defined, but a disassemble method can be nulled out if you don't want to split up the in memory dataset into components.

Does that help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That all makes sense if the disassemble method is either nulled out or not the original config file that defines the StorageClass, and can never change after that.

Let's say we're talking about a StorageClass that's define in the global configuration.

Can per-butler override configuration null the disassemble method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageClassFactory is a singleton so you always retrieve the same StorageClass with a given name so if you have multiple butlers instantiated you do indeed end up with no way to enable or disable the disassembler per-butler. StorageClass doesn't have any instance data either so you can't override it by configuration. In the previous implementation this wasn't a problem because each butler got its own StorageClassFactory and the StorageClass definitions in the factory were allowed to be different even if they had the same fully qualified name. The switch to a singleton does require me to think some more about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I don't mind merging this as is; I don't think fixing it in the future will break anything structural.

path : `str`
Path to Dataset, relative to `Datastore` root.
storageClass : `StorageClass`
`StorageClass` used when writing the file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand how this works from our conversation yesterday, but this would be a good place to describe how/why/when it can differ from the StorageClass used to read the file.

try:
storedFileInfo = self.getStoredFileInfo(ref)
except KeyError:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently returns False for virtual composites. I think that might be a problem; we definitely need a way to distinguish that state from "it used to exist, but has been deleted", which I don't think we have right now, but the solution doesn't have to be having this return True for virtual composites.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What does existence mean...? I was treating existence as "can I retrieve this thing" not "has it ever existed".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also a datastore.exists and not a butler.exists. If butler splits up an exposure and writes it as separate items datastore never sees the parent DatasetRef so can't understand that it does indeed have all the bits needed to make the composite. The only way this would really work is if butler could register the parent DatasetRef with the datastore after it's put all the components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping Datastore from knowing about these parents is probably a good idea. I think that just means we need a way for Registry to know that no Datastore entry should exist. That was going to be the assembler field in Dataset, and I think we can now just make that into a bool.

@@ -106,13 +106,13 @@ class FileTemplate:
def __init__(self, template):
self.template = template

def format(self, dataUnits, datasetType=None, component=None):
def format(self, dataId, datasetType=None, component=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may not need a component argument anymore, since that's packed into the datasetType.

In the future, it will probably need to have a run argument, as we'll need templates to be unique across runs as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Now that I've encoded the dot syntax into DatasetType I don't think a separate component is needed any more. I think I need to use the new DatasetType splitter here now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it should just take a DatasetRef directly (it has all the information).

"""

# if this has never been written then we have to guess
if ref.id is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check only tells you whether some Datastore has written it, not whether this Datastore has, and I think that's what you want to be checking.

That said, I think users actually may want to specify (e.g. with a boolean arg) whether they're asking for a guess URI for something they haven't written yet or a URI for something that should exist, so they can get an exception when they don't get what they want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote this I didn't have an exists method. I'll add an option to only retrieve things that actually exist.

raise FileNotFoundError("No such file: {0}".format(location.uri))
os.remove(location.preferredPath())
os.remove(location.path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete from StoredFileInfo, too?

# Concrete composite written as a single file (we hope)
try:
data = fileDescriptor.storageClass.assembler().getComponent(data, comp)
except AttributeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised the expected exception is an AttributeError; is that a detail of the default implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/lsst/daf_butler/blob/master/python/lsst/daf/butler/core/composites.py#L205

You prefer KeyError? I throw AttributeError because I'm testing with hasattr in there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion; I just wanted to make sure this was being considered as part of the CompositeAssembler public API rather than being an accident (and I see now that the exception type is documented there as the one derived classes should raise, so I'm happy).

* Datastore APIs now take DatasetRefs
* Assembly and disassembly is now done by butler.
* Datastore now records path and formatter indexed by DatasetRef.
  - No longer uses a URI to record that information.
* URIs can now be retrieved separately.
* StorageClassFactory is now a Singleton.
* DatasetType now has a StorageClass instance (not name)
* StorageClasses can now be defined in multiple config files.
* afw Exposure assembly much improved.
* File templates now use DatasetRef
* DatasetType now knows how to to represent components in the name.
* Read Formatters now have access to the storage class used to write
  and the storage class requested to read the data.
* Location no longer knows about components.
@timj timj merged commit 551b585 into master Mar 30, 2018
@ktlim ktlim deleted the tickets/DM-13814 branch August 25, 2018 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants