New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good in general. My main concern is that I'm not entirely clear we have a story for cases where some files are not ingested due to constraints. Previously this triggered DatasetTypeNotSupporterdError but I can't see where that happens now (maybe it does and I've missed it).
"""Add an opaque (to the `Registry`) table for use by a `Datastore` or | ||
other data repository client. | ||
|
||
Opaque table record can be added via `insertOpaqueData`, retreived via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: retreived
|
||
# A "move" is sometimes a "copy" | ||
moveIsCopy = False | ||
if transfer == "move" and os.path.isabs(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You decided against keeping this logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. At first I was just going to move it around, but it was a lot messier after the split into two methods, and it wasn't clear to me we had a real need for it.
notAcceptedCounter = 0 | ||
# Filter down to just datasets the chained datastore's own | ||
# configuration accepts. | ||
okForParent: List[FileDataset] = [dataset for dataset in datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If okForParent
is empty shouldn't we raise DatasetTypeNotSupportedError ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising that exception (in ingest) is now done exclusively by Datastore.ingest
, which will check that the set of datasets actually ingested matches the set of datasets that the caller asked to be ingested and raise if that isn't true. That's because we need _prepIngest
to still return the list of datasets it can ingest even when it can't ingest all of them, and it can't do that if it raises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that now. Thanks.
if moveIsCopy: | ||
dstransfer = "copy" | ||
if constraints is not None: | ||
okForChild: List[FileDataset] = [dataset for dataset in okForParent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned that we've lost any visibility into whether any of the supplied datasets were ingested or not. Don't we worry if a dataset was rejected by all the child datastores because of constraints? As written it seems we worry about transfer modes but not constraints. If I ask 10 datasets to be ingested and only 9 are accepted how do I tell that? Previously ingest worked if a dataset was accepted by at least one datastore (and then we had the issue of deciding what to do if it was only accepted by an ephemeral datastore).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's all still true (see reply at #197 (comment)). Unlike the constraint-based rejections, transfer-mode rejections mean that all datasets would be rejected by the nested datastore, so we don't have to return anything and we can still use the raise-and-catch logic that was in place before.
# Docstring inherited from GenericBaseDatastore | ||
records = list(self.registry.fetchOpaqueData(self._tableName, dataset_id=ref.id)) | ||
if len(records) == 0: | ||
raise KeyError("Unable to retrieve formatter associated with Dataset {}".format(ref.id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that error message should be referring to a formatter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just moved it here from 797a3df#diff-4b3b7a0376839e94233cd3a2bddd157bL112, but I think I agree. "Unable to retrieve location associated with Dataset {}"
instead?
# Docstring inherited from Datastore._prepIngest. | ||
filtered = [] | ||
for dataset in datasets: | ||
if not self.constraints.isAcceptable(dataset.ref): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again how do we know that some of the datasets were not ingested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think this is already answered above)
# self.registry internally. Probably need to add | ||
# transactions to DatabaseDict to do better than that. | ||
self.addStoredItemInfo(ref, itemInfo) | ||
for ref, itemInfo in zip(refs, itemInfos): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think zip is going to complain if the number of refs is not equal to the number of infos. Would it be safer to accept a sequence of tuples instead?
Previously if a single test created two datastores they shared a registry.
We're about to start using it for ingest, so the old name no longer made sense.
This is safer than zipping over two sequences, which doesn't complain if they have different lengths, and no harder for callers to provide.
d3e9270
to
68464b4
Compare
(and switch to f-strings)
No description provided.