New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-24288: Do disassembly inside datastore #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few line comments, and you can feel free to defer any that seem involved to other tickets.
Bigger picture (and definitely out of scope for now), the path of reading and writing internal records through the various intermediate base classes is feeling pretty twisty now, and I think we may be approaching the point where we need to do some refactoring. I don't know what the solution is, but GenericBaseDatastore
in particular now adds a ton of abstract methods and indirection while providing very little nontrivial logic to its subclasses.
python/lsst/daf/butler/_butler.py
Outdated
# Add Registry Dataset entry. If not a virtual composite, add | ||
# and attach components at the same time. | ||
dataId = self.registry.expandDataId(dataId, graph=datasetType.dimensions, **kwds) | ||
ref, = self.registry.insertDatasets(datasetType, run=run, dataIds=[dataId], | ||
producer=producer, recursive=not isVirtualComposite) | ||
producer=producer, recursive=False) # not isVirtualComposite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relic comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I wasn't brave enough to disable it completely on the first go.
checksum=record["checksum"], | ||
file_size=record["file_size"]) | ||
|
||
def getStoredItemsInfo(self, ref): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have both a getStoredItemsInfo
and a getStoredItemInfo
? Seems like one of those ought to be renamed or removed, though this seems preexisting so maybe it's out of scope for this ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plural is the new one. The old one has to stay around until export and getUri are fixed.
# Use the path to determine the location | ||
return [(self.locationFactory.fromPath(r.path), r) for r in records] | ||
|
||
def _can_remove_dataset_artifact(self, ref, location): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this function only exists for cases like DECam raws, where there are multiple top-level datasets associated with the same artifact, not anything to do with composites?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was there previously (I think somewhere else) and is for DECam-like things but also to do with composites when they are stored in the registry. If you wanted to delete a component I had to make sure you didn't delete the file if there were multiple components associated with one file.
if parentRef is not None: | ||
# Should already conform and we know no components | ||
return DatasetRef(storage.datasetType, parentRef.dataId, id=parentRef.id, | ||
run=parentRef.run, conform=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned that there's no way to tell from the DatasetRef
itself that the ID refers to the parent. Maybe that concern just goes away when we drop in-Registry disassembly/components, but for now I think it might be good to add a field to DatasetRef
to indicate when it's in this state. I suspect we'd also find that the existence of that flag might simplify logic elsewhere, and in the future (if in-Registry disassembly is gone) we could turn that field into a property that delegates to whether the DatasetType
is a component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you drop registry-composites then findDatasets should immediately drop the component name from the search and assume it's present. That would stop the "oh, you asked for a component but I know nothing about the components". You want some kind of "isParent" property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to say isComponent
, but I suspect we're talking about the same meaning and just differing on whether the subject of the name is the ID or the dataset type.
Anyhow, maybe it would just make more sense to rip out Registry components sooner rather than later. I do think we'll want that related-but-not-the-same baggage gone when we try to actually support rawplus
anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was saying "this may look like a component DatasetRef but the ID comes from the parent". Whereas I'm not sure "isComponent" is as explicit. You can tell it's a component by looking at ref.datasetType, what you can't tell is that there is no ID directly associated with this DatasetRef.
Now add components into the composite DatasetRef since this is currently how registry gives that information to datastore.
This failure was previously masked by datastore tests not properly emulating registry.
* Add component name to internal datastore records * Allow a component to be retrieved even if registry does not know about component. * Allow registry to find datasets by looking at the parent of a component. * Do disassembly and assembly in datastore * Disable deletion of components (if registry doesn't know about them and files may or may not exist what could it possibly do?)
The code itself is still there for now but is turned off.
This indicates that the DatasetRef was not directly associated with a component in the registry but is associated with the composite. Also add some tests.
No description provided.