DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

TallJimbo · 2019-02-21T00:10:59Z

No description provided.

andy-slac

Great job untangling preflight mess! I did a quick scrolling through it and I need to look at it more but here is a small set of comments that I have so far.

andy-slac · 2019-02-27T18:15:07Z

config/schema.yaml

+        name: valid_last
+        type: datetime
+        doc: >
+          TAI timestamp of last exposure included in the range (inclusive).


I think usual case for intervals/ranges is to make second timepoint exclusive, it makes some operations on ranges simpler (e.g. easier to do [t0, t1), [t1, t2) than [t0, t1-something], [t1, t2]). Though if you are only interested in exposure timepoints (i.e. you have reasonable/predictable gaps) inclusive can work fine too.

For this ticket I'd like to keep it inclusive to retain the previous logic (or, at least, the old as-documented logic - thinking more about the changes to the unit tests I had to make, I suppose this may actually be a change to the behavior to bring it in line with the documentation).

Anyhow, I think I did have some concrete use-case reasons for why I thought an inclusive upper bound would be better here, and I think it did involve the fact that there are always gap between exposures. But I'm not sure I can reconstruct all of that thinking right now, and I'm happy to revisit it in the future.

andy-slac · 2019-02-27T18:30:24Z

config/schema.yaml

+        src: instrument
+        tgt: Instrument.instrument
+
+    ExposureCalibrationLabelJoin:


Is this view used by anything, I can't find it on this PR except in unit test?

Yes, it definitely is, and having it just appear in the unit tests makes sense; this should now behave pretty much just like any of the spatial join views (which also should appear by name only in the tests), but is computed from temporal overlap rather than spatial overlap.

andy-slac · 2019-02-27T18:40:39Z

python/lsst/daf/butler/core/registry.py

-        transformation from ``neededDatasetTypes`` to ``futureDatasetTypes``,
-        restricted by existing data and filter expression.
+        transformation from ``requiredDatasetTypes`` to ``optional
+        DatasetTypes``, restricted by existing data and filter expression.


``required`` to ``optional`` DatasetTypes?

andy-slac · 2019-02-27T18:57:46Z

python/lsst/daf/butler/core/registry.py

+            be included in the returned column set. Datasets of these types
+            may or may not existin the registry.
+        prerequisite : iterable of `DatasetType` or `str`
+            `DatasetType`s that should not constrain the query results, but


`DatasetType`s -> `DatasetTypes <DatasetType>`?

or just drop backticks entirely.

andy-slac · 2019-02-27T18:58:33Z

python/lsst/daf/butler/core/registry.py

+            must be present for all result rows.  These are included with
+            a LEFT OUTER JOIN, but the results are checked for NULL.  Unlike
+            regular inputs, prerequisite inputs lookups may be deferred
+            (see the documentaiton ``defer`` argument).


typo in documentaiton.
Which defer argument?

Removed; I had copied this from the docs for MultipleDatasetQueryBuilder, but the defer option there is controlled by configuration in SqlRegistry, not passed directly by the user, so I've instead just said that deferral is something that Registry implementations may or may not do.

andy-slac · 2019-02-27T23:54:59Z

python/lsst/daf/butler/sql/queryBuilder.py

+
+        Parameters
+        ----------
+        managed :  ResultColumnsManager.ManagedRow`


backtick is missing

andy-slac · 2019-02-27T23:58:36Z

python/lsst/daf/butler/sql/resultColumnsManager.py

+            already in the selected columns, this method does nothing.
+        datasetType : `DatasetType`, optional
+            If not `None`, this link corresponds to a "per-DatasetType"
+            dimension in a querym for this `DatasetType`.


andy-slac · 2019-02-28T00:00:46Z

python/lsst/daf/butler/sql/resultColumnsManager.py

+                return
+            self._indicesForDimensionLinks[link] = len(self._columns)
+        else:
+            if link in self._indicesForPerDatasetTypeDimensionLinks:


link in self._indicesForPerDatasetTypeDimensionLinks[datasetType]?

andy-slac · 2019-02-28T00:19:52Z

python/lsst/daf/butler/sql/resultColumnsManager.py

+            Parameters
+            ----------
+            datasetType : `DatasetType`, optional
+


Extra blank line

andy-slac · 2019-02-28T00:29:58Z

python/lsst/daf/butler/sql/singleDatasetQueryBuilder.py

+        ----------
+        managed : `ResultsColumnsManager.ManagedRow`
+            Intermediate result row object to convert.
+        expandDataId : `True`


andy-slac

Looks OK (though non-trivial for sure). Few minor comments.

andy-slac · 2019-02-28T17:53:40Z

python/lsst/daf/butler/sql/queryBuilder.py

+    tables : `dict`
+        Mapping of table names to `sqlalchemy.Table` instances.
+    dimensions : `DimensionGraph`
+        All Dimensions included in the query.


Parameters section needs an update

andy-slac · 2019-02-28T18:05:04Z

python/lsst/daf/butler/sql/queryBuilder.py

+            if selectable is None:
+                raise ValueError(f"No table or clause providing link '{column}' in this query.")
+        try:
+            return selectable.c[column]


I think I have to blame mostly myself for this, but after reading our code I strongly prefer columns to single-letter c. No need to change it now, maybe just try to use it more in the future.

I've switched to columns for all of the code in the new sql package. I'll leave the rest of the SqlRegistry code as-is for now.

andy-slac · 2019-02-28T18:30:53Z

python/lsst/daf/butler/sql/queryBuilder.py

+            query = query.where(whereSql)
+        _LOG.debug("building query: %s",
+                   query.compile(bind=self.registry._connection.engine,
+                                 compile_kwargs={"literal_binds": True}))


Maybe wrap it into if _LOG.isEnabledFor(logging.DEBUG): to avoid calling query.compile if it's not needed.

andy-slac · 2019-02-28T18:48:41Z

python/lsst/daf/butler/sql/queryBuilder.py

+        Notes
+        -----
+        A query row that include disjoint regions is filtered out, resulting
+        in `None` being returned.


If this is filtered by region overlap then I would assume that query is defined in a way to return just one "true" row (with regions overlapping) but there may be other "false" rows (without overlap). With that assumption I think you cannot do fetchone but instead it should iterate and return first row that passes filter.

andy-slac · 2019-02-28T19:25:43Z

python/lsst/daf/butler/sql/singleDatasetQueryBuilder.py

+
+    @classmethod
+    def fromCollections(cls, registry, datasetType, collections, addResultColumns=True):
+        """Construct a builder that searches a single collection for the


single collection -> multiple collections

andy-slac · 2019-02-28T19:30:53Z

python/lsst/daf/butler/sql/singleDatasetQueryBuilder.py

+                DS.linkN = DSG.linkN)
+        """
+        if len(collections) == 1:
+            return cls.fromSingleCollection(registry, datasetType, collections[0])


Do you want to pass addResultColumns argument too?

andy-slac · 2019-02-28T19:58:27Z

python/lsst/daf/butler/sql/multipleDatasetQueryBuilder.py

+
+    A multiple-dataset query returns a sequence of `MultipleDatasetQueryRow`
+    instances; each instance will have a unique `DataId`, but the
+    `DatasetRef`s in `datasetRefs` are not necessarily unique.  For example,


Double backticks around datasetRefs.

andy-slac · 2019-02-28T20:54:17Z

python/lsst/daf/butler/sql/multipleDatasetQueryBuilder.py

+                selectable = data.builder.findSelectableForLink(link)
+                column = selectable.c[link]
+                expr.append(column == dataId[link])
+            ref = data.builder.executeOne(whereSql=and_(*expr))


One possible optimization that I was thinking about when doing deferred queries is to reduce number of queries by writing queries to return multiple datasetRefs (e.g. all datasets in output collection if that is not too large) and doing in-memory join (or lookup). I have a feeling that doing individual queries here will be the main bottleneck for large graphs, if that is true we may consider re-implementing this piece of code.

The other code path that does include these queries in the big one instead of deferring them is still alive, and should Just Work for databases other than SQLite (where we need to defer to avoid the 64-term join limit), and I'm hoping we can avoid scaling problems with SQLite by just not using SQLite at scale.

Right now we also always do other follow-up queries to obtain component datasets for composites, of course, but that's a separate issue.

andy-slac · 2019-03-01T03:24:51Z

python/lsst/daf/butler/registries/sqlRegistry.py

@@ -41,7 +41,8 @@
 from ..core.config import Config
 from ..core.dimensions import DataId
 from .sqlRegistryDatabaseDict import SqlRegistryDatabaseDict
-from .sqlPreFlight import SqlPreFlight
+from ..sql import MultipleDatasetQueryBuilder
+

 __all__ = ("SqlRegistryConfig", "SqlRegistry")


Should be moved above all imports

andy-slac · 2019-03-01T03:29:41Z

python/lsst/daf/butler/instrument.py

@@ -183,3 +184,35 @@ def updateVisitEntryFromObsInfo(dataId, obsInfo):
        exposure_time=obsInfo.exposure_time.to_value("s"),
    )
    return dataId
+
+
+def addUnboundedCalibrationLabel(registry, instrumentName):


Don't you want it added to __all__?

This somewhat-generalizes the case of VisitDetectorRegion, which should only be included in a query when the region it provides is actually needed. The current query logic identifies this case based on it being the only DimensionJoin that is a region holder, but I'd like to start being explicit about that.

Old doc was a copy/paste error.

This separates the big preflight query into several QueryBuilder classes that delegate to each other in various ways. This makes the logic a bit easier to follow, at least at the highest level, by better encapsulating the lower-level bits. It also (I hope) is a start towards adding simpler Registry APIs for simpler queries, though that will need to wait for another ticket This does change the behavior of selectDimensions slightly: the full data IDs generated are now expanded to include implied dependencies. As those data IDs aren't actually used outside unit tests, this shouldn't break anything, and it's arguably better behavior anyway.

Unlike ExposureRange (but like all other dimensions) CalibrationLabel has a single link column (not counting dependencies), and is related to other Dimensions via a comparatively normal DimensionJoin view.

The major new functionality here is the "prerequisite" DatasetType category for selectDimensions (now selectMultipleDatasetTypes), but this also restructures a lot of the QueryBuilder code and tests, in some cases to try to anticipate future Registry methods that query for single DatasetTypes.

TallJimbo force-pushed the tickets/DM-16539 branch 13 times, most recently from c7095e8 to 16f0b2a Compare February 27, 2019 01:46

andy-slac reviewed Feb 28, 2019

View reviewed changes

andy-slac approved these changes Mar 1, 2019

View reviewed changes

TallJimbo added 14 commits March 5, 2019 13:46

Temporarily remove ExposureRange dimension.

eaaf3e1

Make it possible to get implied links from DimensionElements.

d35df7f

Clean up comments in __init__.py.

024d651

Doc fix for Registry.expandDataId.

5ce9945

Old doc was a copy/paste error.

Fix __all__ line for SqlRegistry.

9d65962

Add support for per-DatasetType dimensions in queries.

d7ff1f9

Add CalibrationLabel dimension to replace ExposureRange.

a01f807

Unlike ExposureRange (but like all other dimensions) CalibrationLabel has a single link column (not counting dependencies), and is related to other Dimensions via a comparatively normal DimensionJoin view.

Rewrite and re-enable calibration dimensions query tests.

49293b2

Pass dataset type name to gen2convert KeyHandlers.

f2b2bee

Refactor and update gen2convert handling of calibration data IDs.

55456e7

Switch from ExposureRange to CalibrationLabel keys in file template.

36b8975

TallJimbo force-pushed the tickets/DM-16539 branch from 16f0b2a to 2223847 Compare March 5, 2019 19:05

TallJimbo merged commit 8481d27 into master Mar 6, 2019

TallJimbo deleted the tickets/DM-16539 branch March 6, 2019 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

TallJimbo commented Feb 21, 2019

andy-slac left a comment

andy-slac Feb 27, 2019

TallJimbo Mar 5, 2019

andy-slac Feb 27, 2019

TallJimbo Mar 5, 2019

andy-slac Feb 27, 2019

andy-slac Feb 27, 2019

andy-slac Feb 27, 2019

TallJimbo Mar 5, 2019

andy-slac Feb 27, 2019

andy-slac Feb 27, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac left a comment

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

TallJimbo Mar 5, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

andy-slac Feb 28, 2019

TallJimbo Mar 5, 2019

andy-slac Mar 1, 2019

andy-slac Mar 1, 2019

DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

DM-16539: overhaul calibration dimensions and refactor selectDimensions implementations #135

Conversation

TallJimbo commented Feb 21, 2019

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andy-slac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment