DM-21849: overhaul collections system #236

TallJimbo · 2020-03-04T14:42:22Z

There are a few mostly-separate projects here (and a lot of miscellaneous prep work), combined in one branch to yield only one set of disruptive changes downstream (rather than four). I've tried hard to keep the commits for those projects separate, but sadly GitHub's chronological rather than topological ordering makes that particularly hard to see. Here are the commits ordered topologically and grouped by project:

Various cleanups to existing code and other prep work (recommended reviewer @timj):

91929ab
837bf0d
b58233d
8be92eb (note that this refactor is extended significantly in a1abbc1)
35d718f
40748ec
b1599a7
90180bf
8815607
2cfca3f
2ede6d0
832e86d

Reimplement collections in Registry ala prototype, and make runs a type of collection (recommended reviewer @andy-slac):

Move multi-collection support into Registry and Butler, add CHAINED collections (no preference for reviewer):

timj

Sorry it took me so long to make my way through this huge change. It looks good to me although I got a bit derailed in the low level query classes.

timj · 2020-03-09T19:46:37Z

python/lsst/daf/butler/_butler.py

-        collection will be used for input lookups as well; if not, it must have
-        the same value as ``run``.
+        Name of the run datasets should be output to.  If the run
+        does not exist, it will be created.  If ``collections`` is `None`, it


Maybe say explicitly here that if run is not set it is a readonly butler?

timj · 2020-03-09T20:45:21Z

python/lsst/daf/butler/_butler.py

+            refs = list(refs)
+            for ref in refs:
+                # isComponent isn't actually reliable, because the dataset type
+                # name isn't the whole story, and that's mildly concerning


If the dataset type name is A.b how do you get that if it isn't a component?

I was thinking of the converse, in the deferred-virtual-composite case, e.g. we write jointcal_wcs as a standalone dataset and later use it as the wcs component of some Exposure dataset.

timj · 2020-03-09T20:50:01Z

python/lsst/daf/butler/_butler.py

+                        # not.  This is consistent with the fact that we want
+                        # to ignore already-removed-from-datastore datasets
+                        # anyway.
+                        if self.datastore.exists(ref):


I still worry that this hides problems with composites since non-existence might be a bug.

Agreed; I was just trying to handle the common case here, and defer a real solution to DM-23671.

timj · 2020-03-09T21:25:16Z

python/lsst/daf/butler/_butler.py

+            The names of a `~CollectionType.TAGGED` collections to associate
+            the dataset with, overriding ``self.tags``.  These collections
+            must have already been added to the `Registry`.
+        collection : `str` or `False`, optional


There is no collection argument.

timj · 2020-03-10T00:35:34Z

python/lsst/daf/butler/registry/_registry.py

+        collectionFieldName = self._collections.getCollectionForeignKeyName()
+        collectionRecord = self._collections.find(collection)
+        if collectionRecord.type is CollectionType.RUN:
+            raise TypeError(f"Collection '{collection}' has type {collectionRecord.type.name}, not TAGGED.")


This error message seems odd since you are checking it IS RUN, not it's not TAGGED. This is going to get really confusing if extra collection types turn up. Can you say is not TAGGED instead?

timj · 2020-03-10T20:18:35Z

python/lsst/daf/butler/registry/tests/_registry.py

+from ..wildcards import DatasetTypeRestriction
+
+
+DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0],


I think os.path.dirname() is preferred over os.path.split()[0].

timj · 2020-03-10T20:20:40Z

python/lsst/daf/butler/registry/tests/_registry.py

+
+
+DATA_DIR = os.path.normpath(os.path.join(os.path.split(__file__)[0],
+                                         "../../../../../../tests/data/registry"))


It might be a little easier to let the tests specify the data dir themselves and pass it in to this class?

timj · 2020-03-10T21:08:50Z

python/lsst/daf/butler/registry/wildcards.py

+        if expression is ...:
+            if not allowAny:
+                raise TypeError("This expression may not be unconstrained.")
+            return ...


The doc string says this should return None.

Yeah, in an earlier (squashed away) version of this I was converting ... to None when going from public to internal APIs, and then I realized it was a bad idea to have two different sentinel types to represent the same concept. Docs fixed.

timj · 2020-03-10T21:13:30Z

python/lsst/daf/butler/registry/wildcards.py

+
+
+class DatasetTypeRestriction:
+    """An immutable set-like object that represents a restriction on the


inherit from collections.abc.Set ?

Can't; it can't always define __len__ (maybe other limitations, but at least that). I'll add a comment to that effect.

timj · 2020-03-10T23:29:30Z

python/lsst/daf/butler/script/validateButlerConfiguration.py

@@ -102,7 +97,7 @@ def validateButlerConfiguration(root, datasetTypes=None, ignore=None, quiet=Fals

    # The collection does not matter for validation but if a run is specified
    # in the configuration then it must be consistent with this collection


I think this comment is now completely irrelevant

timj · 2020-03-12T21:04:22Z

@andy-slac can you please take a look at the commits you were assigned so we can move towards merging?

andy-slac · 2020-03-12T21:47:26Z

Sorry, I missed that, will look at it ASAP.

andy-slac

I checked three commits that were specifically assigned to me, looks OK, small bunch of comments.

andy-slac · 2020-03-12T21:51:17Z

python/lsst/daf/butler/registry/_collectionType.py

+    """Datasets can be associated with and removed from ``TAGGED`` collections
+    arbitrarily.
+
+    Within a particular run, there may only be one dataset with a particular


run? Should it be a "tagged collection"?

andy-slac · 2020-03-12T21:54:02Z