DM-15214: extend pre-flight with output datasets information #78

andy-slac · 2018-08-24T20:08:10Z

Pre-flight API has changed to support extra information about existing output (and input) datasets, it now returns pre-made DatasetRefs instead of raw columns (via new class PreFlightUnitsRow).

Implemented support for specifying multiple input collections and per-dataset collections overrides (new PreFlightCollections class).

Pre-flight ginormous query had been extended to support priority-based search of input datasets in multiple collections (see docstring in sqlPreFlight).

timj · 2018-08-24T20:09:53Z

python/lsst/daf/butler/core/preFlight.py

+
+        Returns
+        -------
+        `list` of `str`, names of input collections.


Returns are meant to be written as:

Returns ------- collections : `list` of `str` Names of input collections

Do we have to give a name to returned item or can I just write

Returns ------- `list` of `str` Names of input collections

❔

https://developer.lsst.io/python/numpydoc.html?highlight=numpydoc#py-docstring-returns suggests a dummy name is needed.

TallJimbo · 2018-08-28T23:47:57Z

python/lsst/daf/butler/core/preFlight.py

+    dataId : `dict`
+        Maps DataUnit link name to its corresponding value.
+    datasetRefs : `dict`
+        Maps `DatasetType` to its corresponding `DatasetRef`.


Are the keys actually DatasetType instances, or the str names associated with them?

They are DatasetType instances, this is probably not a hard requirement, can switch to names if that is more natural.

If DatasetType instances are more convenient for you, that's fine. I hadn't realized they were usable as dict keys, but it looks like they are.

TallJimbo · 2018-08-28T23:52:52Z

python/lsst/daf/butler/core/registry.py

+            The `list` of `DatasetTypes <DatasetType>` whose DataUnits will
+            be included in the returned column set. Output is limited by the
+            the Datasets of these DatasetTypes which already exist in the
+            registry.


"limited by" -> "limited to"

TallJimbo · 2018-08-28T23:55:51Z

python/lsst/daf/butler/core/registry.py

+            The `list` of `DatasetTypes <DatasetType>` whose DataUnits will
+            be included in the returned column set. It is expected that
+            Datasets for these DatasetTypes do not exist in the registry,
+            but presently this is not checked.


I think selectDataUnits would be more generally useful outside of preflight (e.g. for selecting subsets for export/transfer) if we never actually check for futureDatasetTypes not existing in the registry.
In fact, it might be most useful to have this take just take a list of tuples of DataUnits, rather than a list of DatasetTypes. I think that's all preflight uses this for, right?

Not checking futureDatasetTypes can be added as an option indeed.

For preflight we still need to associate units with DatasetType names, and the query that we run also uses DatasetType names to find existing input Datasets, so we need some DatasetType information there, maybe just a name.

Sounds like it would make the most sense to have another argument that just adds more explicit DataUnits to the query, while leaving this argument as it is.

OK, and I'll leave that for the future tickets 🙂

TallJimbo · 2018-08-29T00:09:54Z