-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-28650: Add Butler.transfer_from #523
Changes from all commits
c6af39c
f20cab3
087f847
1c3f1ad
ab88153
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Add ``butler transfer-datasets`` command-line tool and associated ``Butler.transfer_from()`` API. | ||
|
||
This can be used to transfer datasets between different butlers, with the caveat that dimensions and dataset types must be pre-defined in the receiving butler repository. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -364,9 +364,34 @@ def getStoredItemsInfo(self, ref: DatasetIdRef) -> List[StoredFileInfo]: | |
|
||
# Look for the dataset_id -- there might be multiple matches | ||
# if we have disassembled the dataset. | ||
records = list(self._table.fetch(dataset_id=ref.id)) | ||
records = self._table.fetch(dataset_id=ref.id) | ||
return [StoredFileInfo.from_record(record) for record in records] | ||
|
||
def _get_stored_records_associated_with_refs(self, | ||
refs: Iterable[DatasetIdRef] | ||
) -> Dict[DatasetId, List[StoredFileInfo]]: | ||
"""Retrieve all records associated with the provided refs. | ||
|
||
Parameters | ||
---------- | ||
refs : iterable of `DatasetIdRef` | ||
The refs for which records are to be retrieved. | ||
|
||
Returns | ||
------- | ||
records : `dict` of [`DatasetId`, `list` of `StoredFileInfo`] | ||
The matching records indexed by the ref ID. The number of entries | ||
in the dict can be smaller than the number of requested refs. | ||
""" | ||
records = self._table.fetch(dataset_id=[ref.id for ref in refs]) | ||
|
||
# Uniqueness is dataset_id + component so can have multiple records | ||
# per ref. | ||
records_by_ref = defaultdict(list) | ||
for record in records: | ||
records_by_ref[record["dataset_id"]].append(StoredFileInfo.from_record(record)) | ||
return records_by_ref | ||
|
||
def _refs_associated_with_artifacts(self, paths: List[Union[str, ButlerURI]]) -> Dict[str, | ||
Set[DatasetId]]: | ||
"""Return paths and associated dataset refs. | ||
|
@@ -381,7 +406,7 @@ def _refs_associated_with_artifacts(self, paths: List[Union[str, ButlerURI]]) -> | |
mapping : `dict` of [`str`, `set` [`DatasetId`]] | ||
Mapping of each path to a set of associated database IDs. | ||
""" | ||
records = list(self._table.fetch(path=[str(path) for path in paths])) | ||
records = self._table.fetch(path=[str(path) for path in paths]) | ||
result = defaultdict(set) | ||
for row in records: | ||
result[row["path"]].add(row["dataset_id"]) | ||
|
@@ -1723,6 +1748,131 @@ def emptyTrash(self, ignore_errors: bool = True) -> None: | |
else: | ||
raise | ||
|
||
@transactional | ||
def transfer_from(self, source_datastore: Datastore, refs: Iterable[DatasetRef], | ||
local_refs: Optional[Iterable[DatasetRef]] = None, | ||
transfer: str = "auto") -> None: | ||
# Docstring inherited | ||
if type(self) is not type(source_datastore): | ||
raise TypeError(f"Datastore mismatch between this datastore ({type(self)}) and the " | ||
f"source datastore ({type(source_datastore)}).") | ||
|
||
# Be explicit for mypy | ||
if not isinstance(source_datastore, FileDatastore): | ||
raise TypeError("Can only transfer to a FileDatastore from another FileDatastore, not" | ||
f" {type(source_datastore)}") | ||
|
||
# Stop early if "direct" transfer mode is requested. That would | ||
# require that the URI inside the source datastore should be stored | ||
# directly in the target datastore, which seems unlikely to be useful | ||
# since at any moment the source datastore could delete the file. | ||
timj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if transfer == "direct": | ||
raise ValueError("Can not transfer from a source datastore using direct mode since" | ||
" those files are controlled by the other datastore.") | ||
|
||
# We will go through the list multiple times so must convert | ||
# generators to lists. | ||
refs = list(refs) | ||
|
||
if local_refs is None: | ||
local_refs = refs | ||
else: | ||
local_refs = list(local_refs) | ||
|
||
# In order to handle disassembled composites the code works | ||
# at the records level since it can assume that internal APIs | ||
# can be used. | ||
# - If the record already exists in the destination this is assumed | ||
# to be okay. | ||
# - If there is no record but the source and destination URIs are | ||
# identical no transfer is done but the record is added. | ||
# - If the source record refers to an absolute URI currently assume | ||
# that that URI should remain absolute and will be visible to the | ||
# destination butler. May need to have a flag to indicate whether | ||
# the dataset should be transferred. This will only happen if | ||
# the detached Butler has had a local ingest. | ||
|
||
# What we really want is all the records in the source datastore | ||
# associated with these refs. Or derived ones if they don't exist | ||
# in the source. | ||
source_records = source_datastore._get_stored_records_associated_with_refs(refs) | ||
|
||
# The source dataset_ids are the keys in these records | ||
source_ids = set(source_records) | ||
timj marked this conversation as resolved.
Show resolved
Hide resolved
|
||
log.debug("Number of datastore records found in source: %d", len(source_ids)) | ||
|
||
# The not None check is to appease mypy | ||
requested_ids = set(ref.id for ref in refs if ref.id is not None) | ||
missing_ids = requested_ids - source_ids | ||
|
||
# Missing IDs can be okay if that datastore has allowed | ||
# gets based on file existence. Should we transfer what we can | ||
# or complain about it and warn? | ||
if missing_ids and not source_datastore.trustGetRequest: | ||
raise ValueError(f"Some datasets are missing from source datastore {source_datastore}:" | ||
f" {missing_ids}") | ||
|
||
# Need to map these missing IDs to a DatasetRef so we can guess | ||
# the details. | ||
if missing_ids: | ||
log.info("Number of expected datasets missing from source datastore records: %d", | ||
len(missing_ids)) | ||
id_to_ref = {ref.id: ref for ref in refs if ref.id in missing_ids} | ||
|
||
for missing in missing_ids: | ||
expected = self._get_expected_dataset_locations_info(id_to_ref[missing]) | ||
source_records[missing].extend(info for _, info in expected) | ||
|
||
# See if we already have these records | ||
target_records = self._get_stored_records_associated_with_refs(local_refs) | ||
|
||
# The artifacts to register | ||
artifacts = [] | ||
|
||
# Refs that already exist | ||
already_present = [] | ||
|
||
# Now can transfer the artifacts | ||
for source_ref, target_ref in zip(refs, local_refs): | ||
if target_ref.id in target_records: | ||
# Already have an artifact for this. | ||
already_present.append(target_ref) | ||
continue | ||
|
||
# mypy needs to know these are always resolved refs | ||
for info in source_records[source_ref.getCheckedId()]: | ||
source_location = info.file_location(source_datastore.locationFactory) | ||
target_location = info.file_location(self.locationFactory) | ||
if source_location == target_location: | ||
# Either the dataset is already in the target datastore | ||
# (which is how execution butler currently runs) or | ||
# it is an absolute URI. | ||
if source_location.pathInStore.isabs(): | ||
# Just because we can see the artifact when running | ||
# the transfer doesn't mean it will be generally | ||
# accessible to a user of this butler. For now warn | ||
# but assume it will be accessible. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure whether this should be an exception or silent, but warning seems like it's still going to be a problem if it's user error, while being annoying if the user knows what they are doing. Do we want a flag so the user can say, "trust me on this"? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my head this scenario was extremely unlikely because it only happens with ingested datasets and not those created by processing. Since the transfer does not (yet?) copy dimension records over and the receiving registry won't ever have seen these raw files it seems like we aren't really going to get this far in the code. If we turn this into the default mode for transferring content from one butler to another (and only use export/import for cases where you can't see both butlers from your one python process) then yes, we need the code higher up to insert missing dimension records and then ask the user whether absolute URIs should be transferred or left as-is. I was putting in a warning there as a stop-gap scenario where I'm not expecting it to happen but I don't want to raise an exception immediately -- but warning will generally prompt people to ask us what the morning means. |
||
log.warning("Transfer request for an outside-datastore artifact has been found at %s", | ||
source_location) | ||
else: | ||
# Need to transfer it to the new location. | ||
# Assume we should always overwrite. If the artifact | ||
# is there this might indicate that a previous transfer | ||
# was interrupted but was not able to be rolled back | ||
# completely (eg pre-emption) so follow Datastore default | ||
# and overwrite. | ||
target_location.uri.transfer_from(source_location.uri, transfer=transfer, | ||
overwrite=True, transaction=self._transaction) | ||
|
||
artifacts.append((target_ref, info)) | ||
|
||
self._register_datasets(artifacts) | ||
|
||
if already_present: | ||
n_skipped = len(already_present) | ||
log.info("Skipped transfer of %d dataset%s already present in datastore", n_skipped, | ||
"" if n_skipped == 1 else "s") | ||
|
||
@transactional | ||
def forget(self, refs: Iterable[DatasetRef]) -> None: | ||
# Docstring inherited. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's a job for this ticket, but I do think we want a public API for this. It's not just an implementation detail; it's a behavioral change that's important to a lot of higher-level code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a transitory problem to a certain extent. Things break if you try to use UUID in an old registry and if I knew that I could strip the ids before calling the import. The question is whether we actually care about that kind of migration. One question I have is whether you are thinking of this as a long-term API or a quick hack. I could easily see a "private" attribute on the dataset manager that we retire when we retire integer IDs. Do you envisage a more public API on
Registry
that will return thedataset_id_type
even though in the long term that would presumably be the type ofDatasetRef.id
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think totally public - things in pipe_base and maybe obs_base do care (thinking especially about Quantum and QuantumGraph because I'm working on those now).
I imagine we would probably deprecate this API when we deprecate
int
support, but that depends on how much we want to hedge against changing again in the future.If we wanted to keep it around forever, we might want to think about whether there should be some other flags, in addition to the type; those might last longer:
has_globally_unique_ids
supports_deterministic_ids
Of course, if we expect both of those to always be true forever after
int
is gone, then maybe just theint
vsuuid
type flag is enough.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I'll do this on another ticket because it will also lead to a cleanup of obs_base ingest-raws.