DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

TallJimbo · 2022-12-02T18:06:32Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2022-12-02T20:28:05Z

Codecov Report

Base: 85.27% // Head: 84.87% // Decreases project coverage by -0.40% ⚠️

Coverage data is based on head (100a12c) compared to base (03715f1).
Patch coverage: 80.33% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #759      +/-   ##
==========================================
- Coverage   85.27%   84.87%   -0.41%     
==========================================
  Files         261      267       +6     
  Lines       34728    34993     +265     
  Branches     5868     5980     +112     
==========================================
+ Hits        29615    29699      +84     
- Misses       3865     4009     +144     
- Partials     1248     1285      +37

Impacted Files	Coverage Δ
python/lsst/daf/butler/registry/_registry.py	`72.82% <ø> (ø)`
python/lsst/daf/butler/registry/managers.py	`87.12% <ø> (-0.10%)`	⬇️
python/lsst/daf/butler/script/_associate.py	`60.00% <ø> (ø)`
python/lsst/daf/butler/script/_pruneDatasets.py	`97.67% <ø> (ø)`
python/lsst/daf/butler/script/pruneCollection.py	`84.61% <ø> (ø)`
python/lsst/daf/butler/script/queryDataIds.py	`81.48% <ø> (ø)`
python/lsst/daf/butler/script/queryDatasets.py	`84.00% <ø> (ø)`
tests/test_cliCmdAssociate.py	`90.00% <ø> (ø)`
tests/test_cliCmdPruneDatasets.py	`98.14% <ø> (ø)`
...thon/lsst/daf/butler/registry/dimensions/skypix.py	`61.11% <41.17%> (-15.56%)`	⬇️
... and 63 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

timj · 2022-12-06T18:30:37Z

To confirm what I said on Jira: lsst-daf-relation will have to be added as a dependency in pyproject.toml once that package ends up on PyPI. (which I see happened just after I wrote this)

andy-slac

Looks good, but this is all very new to me, so there are probably some dumb questions. I hoped that we would have less code in daf_butler but apparently line count cannot go down. 🙂

python/lsst/daf/butler/core/_column_type_info.py

python/lsst/daf/butler/registry/queries/_readers.py

python/lsst/daf/butler/registry/queries/_query.py

python/lsst/daf/butler/registry/queries/_query_context.py

python/lsst/daf/butler/registry/queries/_sql_query_context.py

TallJimbo

I believe I've addressed all review comments but two:

reworking CollectionWildcard to make require_ordered less fragile;
adding a test for inclusive vs. exclusive bounds in range literals.

But I'm going to rebase on main and resolve some conflicts before I start in on those.

python/lsst/daf/butler/registry/queries/_readers.py

TallJimbo · 2023-01-05T15:57:17Z

python/lsst/daf/butler/registry/dimensions/governor.py

+                record = reader.read(row)
+                cache[record.dataId] = record
+            self._cache = cache
+        return cast(Mapping[DataCoordinate, DimensionRecord], self._cache)


👍 I want to revisit some of the caching logic at a higher level, too, to deal with problems with cache expiration on rollback like DM-35498, since I think that's broader than just the collections issue it mentions and I think the approach taken here of allowing caches to be empty is the way out. I'm repurposing that ticket to expand its scope to this now.

TallJimbo · 2023-01-05T16:08:11Z

python/lsst/daf/butler/registry/interfaces/_dimensions.py

+        join : `~lsst.daf.relation.Join`
+            Join operation to use when the implementation is an actual join.
+            When a true join is being simulated by other relation operations,
+            this objects `~lsst.daf.relation.Join.min_columns` and
+            `~lsst.daf.relation.Join.max_columns` should still be respected.


I've defined the interface this way to allow the skypix dimension storage to simulate joins by actually applying a column-calculation relation operation that adds the region columns. That behaves like a join in this case because we know the virtual skypix table has to have a superset of the skypix indices appearing in any other table, and there are never any other columns in those virtual tables besides index and the region.

TallJimbo · 2023-01-05T16:11:10Z

python/lsst/daf/butler/registry/interfaces/_dimensions.py

    def digestTables(self) -> Iterable[sqlalchemy.schema.Table]:
        """Return tables used for schema digest.

        Returns
        -------
-        tables : `Iterable` [ `sqlalchemy.schema.Table` ]
-            Possibly empty set of tables for schema digest calculations.
+        relation : `lsst.daf.relation.Relation`
+            New relation.


I think this must have been copy-pasted into the wrong place, but looking a little closer, I think the annotation should really be list[sqlalchemy.schema.Table], since we do += on it and that's not legal on general iterables. Interesting that mypy didn't complain about that; I wonder if it can't do arithmetic operators for some reason.

TallJimbo · 2023-01-05T16:22:47Z

python/lsst/daf/butler/registry/obscore/_manager.py

-
-        detector = cast(Dimension, self.universe["detector"])
-        if detector in dataId:
+        relation = visit_definition_storage.join(relation, Join(), context)


That would be totally fine at runtime. But for MyPy I'd also need to do cast(DatabaseDimensionRecordStorage, visit_definition_storage) since DimensionRecordStorage doesn't have make_relation.

TallJimbo · 2023-01-05T16:33:29Z

python/lsst/daf/butler/registry/queries/_builder.py

+    context : `.queries.SqlQueryContext`
+        Object that manages relation engines and database-side state
+        (e.g. temporary tables) for the query.


Docs got out of date; in some of the intermediate commits this did have to be a SqlQueryContext, and it looks like I updated the annotation after that but not the docs.

python/lsst/daf/butler/registry/queries/_query.py

python/lsst/daf/butler/registry/queries/_query_context.py

TallJimbo · 2023-01-05T16:56:51Z

python/lsst/daf/butler/registry/queries/expressions/_predicate.py

+
+    def visitRangeLiteral(self, start: int, stop: int, stride: int | None, node: Node) -> VisitorResult:
+        # Docstring inherited.
+        return ColumnContainer.range_literal(range(start, stop, stride or 1))


Good catch! I hadn't realized our stop here was inclusive; I'll add a +1 here as well as a unit test.

This adds a dependency on daf_relation and specializations of a few of its types for daf_butler. This doesn't actually do anything yet.

These will eventually form the backbone of the new query system; for now they're unused stubs.

These are not yet used anywhere, and they partially duplicate functionality in the Query.extract* methods, which will soon be replaced by these new readers. Add relation-row readers for DimensionRecords and expanded data IDs.

These are not used yet, but they will be soon (and we'll be adding more, too, as needed).

This is the first of several commits that replace pieces of Registry query code with new versions that are based on the new daf_relation package. The real value of that switch won't be realized until the query system as a whole is replaced (and in particular the Query class itself is wholly rewritten), and in the meantime there will be a fair amount of transitional code tying the new and old interfaces together. This moves the former DatasetRecordStorage.find implementation into SqlRegistry, since it can now delegate to DatasetRecordStorage.make_relation. And once we have a QueryBackend implementation for RemoteRegistry, we can move it further into the base class Registry and use the same implementation for all registries.

This gives the queries.expressions package a single entry point, which is annoying in its unit tests' imports but I think a very nice win for code organization overall, in addition to setting up this aspect of the query system for the future. The commit also includes a few minor (but broad) style changes: renaming dataId->data_id arguments in some internal interfaces, and using just the empty string for nonexistent expressions instead of allowing either None or the empty string (at the level of type annotations, anyway). It also includes one small behavior change: instead of check=False turning off all expression analysis (which disabled some important checks, and made it hard to reason about the behavior in their absence), it now just permits expressions to leave dimensions "orphaned" by not providing their governor. In internal interfaces the argument has been renamed to allow_orphans to reflect this. Finally, this still includes some transitional code that will ultimately be removed when Relation and Predicate are more fully integrated with the query system. We still have two versions (i.e. new and old) of many many things. Restore range literals in expressions have inclusive stop. Needs a test.

These exception messages reflect parsing an order-by str sequence in the context of a single DimensionElement, but logic that will land shortly instead parses these in the context of multiple elements (i.e. a data ID query). I'll restore these tests and the slightly better error messages they represent after the daf_relation overhaul is complete.

Most of this ticket is a major overhaul of QueryBuilder and the DimensionRecordStorage classes to use relations throughout, with conversion back to the old SimpleQuery system now happening only at the boundary between QueryBuilder and Query. Ideally, that'd be plenty for one commit, but doing that work revealed some preexisting problems with our dimension caching system and how we test for data ID validity in a few different contexts. But I couldn't fix those in earlier commits without the relation machinery introduced on this one, and I couldn't get that new relation machinery to pass tests without fixing them. So here they are. The first problem was that all the previous cleverness about populating the CachingDimensionRecordStorage cache one row at a time never made sense: it's much more efficient to just fetch all rows for cached dimensions whenever any of them are requested, since by definition we only configure caching for dimensions whose record count is small enough that we can easily hold them all in memory, and probably small enough that fetching all of them is still dominated by per-query overheads. Changing that fixes a bug in which QueryDimensionRecordStorage (used exclusively for 'band') happily returned nonexistent dimension records if you gave it a nonexistent data ID, and hence you could expand a bad data ID and not notice. Or, more precisely, this commit sidesteps that still-existing problem, because 'band' is actually configured to use CachingDimensionRecordStorage on top of QueryDimensionRecordStorage, and now CachingDimensionRecordStorage prevents the bad behavior. In a few commits I plan to remove the fetch method from all of those storage objects, and then the bug will really fully be gone. Anyhow, fixing that bug creates new problems, in the form of a Registry test that was actually checking for the wrong behavior - we *can* tell when a query with a bad band value is doomed, because we try to expand the data ID up front and (now) that fails. But we don't want an exception for this - we want a doomed query. So that snowballed into additional logic changes involving exceptions vs. doomed queries. Last but not least, this modifies CachingDimensionRecordStorage and GovernorDimensionRecordStorage to allow them to have an unset (None) cache, and removes the refresh methods for populating those caches. Calling higher-level refresh methods now actually unsets the caches, forcing them to be repopulated on first use. It didn't actually make sense to have both refresh and clearCaches before, but we need clearCaches in order to have something we can call at transaction rollback without hitting the database; this is exactly the problem identified (with collection caching) on DM-35498. Also, having clearCaches make the cache an empty container was a bug waiting to happen for GovernorDimensionRecordStorage, since that (usually incorrectly) would say that there are no governor dimension values in the database.

This simplifies its join implementation by allowing it to delegate to the target's join method. It'll also help with some new methods in future commits.

This is a full rewrite of Query - it's now a concrete final class instead a hierarchy (extensibility is provided by the QueryContext and QueryBackend interfaces it delegates to), and it's all backed by Relation instead of interacting with SQLAlchemy objects at all. That's important for future client-server butler work, where the expectation is that Query can be used more-or-less as-is with new QueryContext and QueryBackend implementation (with the caveat that an abstract interface rarely survives the definition of its second-ever derived class unscathed). It's also where Relation really starts to show its power in making the query system _conceptually_ simpler, even if there's still a ton of code: before, when applying a new operation (e.g. sorting) to an existing query, we basically had to try to reconstruct all of the parameters that went into the original query, then modify them according to the new operation (and any that had been applied since). Effectively every Query factory method had to know about all of the things that could go into construting a Query. Now, we can sometimes just append a new relation operation to the existing tree, and even the more complex cases can be expressed as relatively straightforward operations on that existing tree. That gives us the backbone we'll need for real query composability; it's not quite here yet only because I'm focusing now on reimplementing existing interfaces rather than defining newer more powerful ones. But it's a pretty small jump from here to a lot of things we've wanted for a long time, especially QG generation, including: - data ID set uploads, implemented as iteration-engine leaf relations that get transferred to the SQL engine; - adding new dimensions to an existing query with customization of the spatiotemporal relationships to use, which is just a frontend to QueryBackend.make_dimension_relation; - vectorizing calibration and other prerequisite lookups, which is a little more code in `Query.find_datasets` to add temporal joins. And the previously-thorny problem how how to represent the results of that - which need to involve rows that represent both a DatasetRef and dimensions that aren't part of that ref's DatasetType - is neatly solved by direct iteration over a Query instead of DatasetQueryResults. On that note, the other big change here is that pretty much all of the functionality in the QueryResults classes has moved to Query; the QueryResults classes are just thin facades now, and DimensionRecordQueryResults now delegates directly to Query (as its data ID and dataset siblings do) instead of the query-factory stuff it had before. The changes in tests here mostly reflect the fact that using relation improves our ability to do diagnostics on queries that return no results as well as our ability to see when a query is doomed before execution. I'm a bit worried the diagnostics in some cases could also be getting a bit less readable to end-users with this change, though, because they're coming from daf_relation now and use its terminology. But I think they'll still carry the important pieces of information and they'll be much more precise, so let's see how they work in the wild before trying to intercept and rewrite them in daf_butler.

This reverts commit 771d589.

All previous uses went away with the Query rewrite.

This requires passing a SqlQueryContext down from the Registry itself through a few layers. That'll be increasingly common as we move more read operations to rely on relation-returning methods to reduce the SQL-aware interface surface.

This was only used by DimensionRecordStorage.fetch and DatasetRecordStorage.[de]certify, and both of those have now been switched to use relations instead.

All usage now fully replaced by relation.

This avoids a problem with dimension record cache consistency on was introduced in the cache rework a few commits back, when we started considering the cache more trustworthy. Unfortunately there's no cost-free fix: that cache rework fixed other bugs with cache consistency, so we can't just drop it. The best solution seems to be admitting that expandDataIds cannot be _relied_ upon to test for dimension record existence; this has always been the case in general, but it may have been impossible to trigger via the default dimension configuration before. See code comments for additional rationale.

We were using += on the results of these calls, which only works for lists or tuples, not general iterables (and it was always lists at runtime). Not sure why MyPy didn't complain; maybe it doesn't check arithmetic operators?

TallJimbo force-pushed the tickets/DM-31725 branch 3 times, most recently from f0abe6e to 76750ef Compare December 2, 2022 20:24

TallJimbo force-pushed the tickets/DM-31725 branch 8 times, most recently from 3cc7e71 to 271f36e Compare December 6, 2022 18:29

andy-slac approved these changes Dec 10, 2022

View reviewed changes

TallJimbo commented Jan 5, 2023

View reviewed changes

TallJimbo force-pushed the tickets/DM-31725 branch from 05a0332 to c52226a Compare January 5, 2023 17:23

TallJimbo added 6 commits January 5, 2023 12:23

Add column tag classes and implement daf_relation interfaces.

e4bccc4

This adds a dependency on daf_relation and specializations of a few of its types for daf_butler. This doesn't actually do anything yet.

Add DimensionElementFields.columns property.

c6635b5

Add ColumnTypeInfo.make_relation_table_spec.

3b4b04e

Add QueryContext and QueryBackend objects.

6bb0266

These will eventually form the backbone of the new query system; for now they're unused stubs.

Add factories for common predicates.

6458a73

These are not used yet, but they will be soon (and we'll be adding more, too, as needed).

TallJimbo force-pushed the tickets/DM-31725 branch 2 times, most recently from 7cac757 to 934bab7 Compare January 5, 2023 18:51

TallJimbo added 6 commits January 5, 2023 14:30

Add require_ordered kwarg to CollectionWildcard.from_expression.

8a6c2df

Pass view target storage instance to QueryDimensionRecordStorage.

b2f7b94

This simplifies its join implementation by allowing it to delegate to the target's join method. It'll also help with some new methods in future commits.

TallJimbo added 13 commits January 5, 2023 14:30

Revert "Temporarily drop exception message tests for record order-by."

9f5f590

This reverts commit 771d589.

Drop DimensionRecordStorage.fetch.

b735ede

All previous uses went away with the Query rewrite.

Add make_data_id_relation to QueryContext.

f85a194

Use relation in certify/decertify implementations.

5c7cb3e

This requires passing a SqlQueryContext down from the Registry itself through a few layers. That'll be increasingly common as we move more read operations to rely on relation-returning methods to reduce the SQL-aware interface surface.

Drop DataCoordinateIterable.constrain.

52bbc90

This was only used by DimensionRecordStorage.fetch and DatasetRecordStorage.[de]certify, and both of those have now been switched to use relations instead.

Drop SimpleQuery.

5778424

All usage now fully replaced by relation.

Drop QueryColumns and DatasetQueryColumns.

81ffe43

Note that findDataset now respects a given storage class.

db801e8

Add changelog entry.

9f5372c

Add daf_relation dependency to pyproject.toml.

432b91c

Fix type annotation for digestTables.

100a12c

We were using += on the results of these calls, which only works for lists or tuples, not general iterables (and it was always lists at runtime). Not sure why MyPy didn't complain; maybe it doesn't check arithmetic operators?

TallJimbo force-pushed the tickets/DM-31725 branch from 934bab7 to 100a12c Compare January 5, 2023 19:30

TallJimbo merged commit 25462af into main Jan 6, 2023

TallJimbo deleted the tickets/DM-31725 branch January 6, 2023 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

TallJimbo commented Dec 2, 2022 •

edited

codecov bot commented Dec 2, 2022 •

edited

timj commented Dec 6, 2022 •

edited

andy-slac left a comment

TallJimbo left a comment

TallJimbo Jan 5, 2023

TallJimbo Jan 5, 2023

TallJimbo Jan 5, 2023

TallJimbo Jan 5, 2023

TallJimbo Jan 5, 2023

TallJimbo Jan 5, 2023

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

Conversation

TallJimbo commented Dec 2, 2022 • edited

Checklist

codecov bot commented Dec 2, 2022 • edited

Codecov Report

timj commented Dec 6, 2022 • edited

andy-slac left a comment

Choose a reason for hiding this comment

TallJimbo left a comment

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo Jan 5, 2023

Choose a reason for hiding this comment

TallJimbo commented Dec 2, 2022 •

edited

codecov bot commented Dec 2, 2022 •

edited

timj commented Dec 6, 2022 •

edited