Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-39198: fix storage class handling in datastore, as used by QuantumBackedButler #838

Merged
merged 3 commits into from
May 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/changes/DM-39198.bugfix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix bugs in storage class conversion in `FileDatastore`, as used by `QuantumBackedButler`.
22 changes: 16 additions & 6 deletions python/lsst/daf/butler/datastores/fileDatastore.py
Original file line number Diff line number Diff line change
Expand Up @@ -1108,7 +1108,9 @@ def _write_in_memory_to_artifact(self, inMemoryDataset: Any, ref: DatasetRef) ->
Information describing the artifact written to the datastore.
"""
# May need to coerce the in memory dataset to the correct
# python type.
# python type, but first we need to make sure the storage class
# reflects the one defined in the data repository.
ref = self._cast_storage_class(ref)
inMemoryDataset = ref.datasetType.storageClass.coerce_type(inMemoryDataset)

location, formatter = self._prepare_for_put(inMemoryDataset, ref)
Expand Down Expand Up @@ -1561,15 +1563,20 @@ def _mexists(
existence : `dict` of [`DatasetRef`, `bool`]
Mapping from dataset to boolean indicating existence.
"""
# Need a mapping of dataset_id to dataset ref since the API
# works with dataset_id
id_to_ref = {ref.getCheckedId(): ref for ref in refs}
# Make a mapping from refs with the internal storage class to the given
# refs that may have a different one. We'll use the internal refs
# throughout this method and convert back at the very end.
internal_ref_to_input_ref = {self._cast_storage_class(ref): ref for ref in refs}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because the graph doesn't include datastore records and so we are in "trustGetRequest" mode in QBB execution and so we have to determine the output file name and therefore need the correct storage class to find the correct template/file extension?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the particular QBB context, yes, and I think in execution butler too (which is weird, but maybe historical and irrelevant). But I can't claim I've made sense of it all; it seems like sometimes we're expecting butler to guarantee that certain refs passed to Datastore already use the registry storage class, and in other cases we're expecting Datastore to fix them itself via the callback passed to set_retrieve_dataset_type_method. What I've tried to do here is just be super defensive so the Datastore can handle what's thrown at it, and I bet that means we've got a few code paths where we ensure we've got the registry storage class on the same ref multiple times.

I really think the right fix is to make Butler.datastore private (which means giving ctrl_mpexec an alternative to calling butler.datastore.mexists) and then rework the Datastore API so it always gets registry storage classes in one argument and overrides in another argument, with the butler (either full or QBB) making sure those method are called correctly. It'd also help a lot to modify Datastore APIs that really only need the dataset ID or ID + component (not the full ref) to take just that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't used to matter because this was always getting refs that came from Registry because graphs used unresolved refs and butler.put used unresolved refs. Now that everything is resolved and there is no guarantee Registry even knows about it things are a bit more exciting.


# Need a mapping of dataset_id to (internal) dataset ref since some
# internal APIs work with dataset_id.
id_to_ref = {ref.getCheckedId(): ref for ref in internal_ref_to_input_ref}

# Set of all IDs we are checking for.
requested_ids = set(id_to_ref.keys())

# The records themselves. Could be missing some entries.
records = self._get_stored_records_associated_with_refs(refs)
records = self._get_stored_records_associated_with_refs(id_to_ref.values())

dataset_existence = self._process_mexists_records(
id_to_ref, records, True, artifact_existence=artifact_existence
Expand All @@ -1586,7 +1593,10 @@ def _mexists(
)
)

return dataset_existence
return {
internal_ref_to_input_ref[internal_ref]: existence
for internal_ref, existence in dataset_existence.items()
}

def _mexists_check_expected(
self, refs: Sequence[DatasetRef], artifact_existence: Optional[Dict[ResourcePath, bool]] = None
Expand Down