DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197

TallJimbo · 2019-10-07T19:55:53Z

No description provided.

timj

This looks good in general. My main concern is that I'm not entirely clear we have a story for cases where some files are not ingested due to constraints. Previously this triggered DatasetTypeNotSupporterdError but I can't see where that happens now (maybe it does and I've missed it).

timj · 2019-10-24T16:18:40Z

python/lsst/daf/butler/core/registry.py

+        """Add an opaque (to the `Registry`) table for use by a `Datastore` or
+        other data repository client.
+
+        Opaque table record can be added via `insertOpaqueData`, retreived via


Typo: retreived

timj · 2019-10-24T16:30:12Z

python/lsst/daf/butler/datastores/chainedDatastore.py

-
-        # A "move" is sometimes a "copy"
-        moveIsCopy = False
-        if transfer == "move" and os.path.isabs(path):


You decided against keeping this logic?

Yeah. At first I was just going to move it around, but it was a lot messier after the split into two methods, and it wasn't clear to me we had a real need for it.

timj · 2019-10-24T16:30:42Z

python/lsst/daf/butler/datastores/chainedDatastore.py

-        notAcceptedCounter = 0
+        # Filter down to just datasets the chained datastore's own
+        # configuration accepts.
+        okForParent: List[FileDataset] = [dataset for dataset in datasets


If okForParent is empty shouldn't we raise DatasetTypeNotSupportedError ?

Raising that exception (in ingest) is now done exclusively by Datastore.ingest, which will check that the set of datasets actually ingested matches the set of datasets that the caller asked to be ingested and raise if that isn't true. That's because we need _prepIngest to still return the list of datasets it can ingest even when it can't ingest all of them, and it can't do that if it raises.

I see that now. Thanks.

timj · 2019-10-24T16:37:47Z

python/lsst/daf/butler/datastores/chainedDatastore.py

-            if moveIsCopy:
-                dstransfer = "copy"
+            if constraints is not None:
+                okForChild: List[FileDataset] = [dataset for dataset in okForParent


I'm a bit concerned that we've lost any visibility into whether any of the supplied datasets were ingested or not. Don't we worry if a dataset was rejected by all the child datastores because of constraints? As written it seems we worry about transfer modes but not constraints. If I ask 10 datasets to be ingested and only 9 are accepted how do I tell that? Previously ingest worked if a dataset was accepted by at least one datastore (and then we had the issue of deciding what to do if it was only accepted by an ephemeral datastore).

I think that's all still true (see reply at #197 (comment)). Unlike the constraint-based rejections, transfer-mode rejections mean that all datasets would be rejected by the nested datastore, so we don't have to return anything and we can still use the raise-and-catch logic that was in place before.

timj · 2019-10-24T16:42:28Z

python/lsst/daf/butler/datastores/fileLikeDatastore.py

+        # Docstring inherited from GenericBaseDatastore
+        records = list(self.registry.fetchOpaqueData(self._tableName, dataset_id=ref.id))
+        if len(records) == 0:
+            raise KeyError("Unable to retrieve formatter associated with Dataset {}".format(ref.id))


I don't think that error message should be referring to a formatter.

I've just moved it here from 797a3df#diff-4b3b7a0376839e94233cd3a2bddd157bL112, but I think I agree. "Unable to retrieve location associated with Dataset {}" instead?

timj · 2019-10-24T16:52:31Z

python/lsst/daf/butler/datastores/fileLikeDatastore.py

+        # Docstring inherited from Datastore._prepIngest.
+        filtered = []
+        for dataset in datasets:
+            if not self.constraints.isAcceptable(dataset.ref):


Again how do we know that some of the datasets were not ingested?

(I think this is already answered above)

timj · 2019-10-24T16:56:04Z

python/lsst/daf/butler/datastores/genericDatastore.py

-        #       self.registry internally.  Probably need to add
-        #       transactions to DatabaseDict to do better than that.
-        self.addStoredItemInfo(ref, itemInfo)
+        for ref, itemInfo in zip(refs, itemInfos):


I don't think zip is going to complain if the number of refs is not equal to the number of infos. Would it be safer to accept a sequence of tuples instead?

Previously if a single test created two datastores they shared a registry.

We're about to start using it for ingest, so the old name no longer made sense.

This is safer than zipping over two sequences, which doesn't complain if they have different lengths, and no harder for callers to provide.

(and switch to f-strings)

timj approved these changes Oct 24, 2019

View reviewed changes

TallJimbo and others added 9 commits October 24, 2019 14:05

Add external table functionality to Registry.

47fd337

Use opaque tables instead of DatabaseDict in Datastores.

20cd432

Remove DatabaseDict.

ea4f446

Ensure that each datastore in datastore tests has its own registry

3f2b0db

Previously if a single test created two datastores they shared a registry.

Rename DatasetExport to FileDataset.

14a8a31

We're about to start using it for ingest, so the old name no longer made sense.

Vectorize GenericBaseDatastore info insertion.

4b1ecd9

Refactor and vectorize Datastore ingest.

5c2a291

Vectorize butler.ingest.

680738f

Switch _register_datasets sig. to sequence-of-tuples.

68464b4

This is safer than zipping over two sequences, which doesn't complain if they have different lengths, and no harder for callers to provide.

TallJimbo force-pushed the tickets/DM-21451 branch from d3e9270 to 68464b4 Compare October 24, 2019 18:05

Improve error message for StoredFileInfo lookup failure.

1c8c422

(and switch to f-strings)

TallJimbo merged commit 80af466 into master Oct 25, 2019

TallJimbo deleted the tickets/DM-21451 branch October 25, 2019 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197

DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197

TallJimbo commented Oct 7, 2019

timj left a comment

timj Oct 24, 2019

timj Oct 24, 2019

TallJimbo Oct 24, 2019

timj Oct 24, 2019

TallJimbo Oct 24, 2019

timj Oct 24, 2019

timj Oct 24, 2019

TallJimbo Oct 24, 2019

timj Oct 24, 2019

TallJimbo Oct 24, 2019

timj Oct 24, 2019

TallJimbo Oct 24, 2019

timj Oct 24, 2019

DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197

DM-21451: Remove DatabaseDict and vectorize Datastore/Butler ingest #197

Conversation

TallJimbo commented Oct 7, 2019

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment