New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759
Conversation
f0abe6e
to
76750ef
Compare
Codecov ReportBase: 85.27% // Head: 84.87% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #759 +/- ##
==========================================
- Coverage 85.27% 84.87% -0.41%
==========================================
Files 261 267 +6
Lines 34728 34993 +265
Branches 5868 5980 +112
==========================================
+ Hits 29615 29699 +84
- Misses 3865 4009 +144
- Partials 1248 1285 +37
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
3cc7e71
to
271f36e
Compare
To confirm what I said on Jira: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but this is all very new to me, so there are probably some dumb questions. I hoped that we would have less code in daf_butler
but apparently line count cannot go down. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I've addressed all review comments but two:
- reworking
CollectionWildcard
to makerequire_ordered
less fragile; - adding a test for inclusive vs. exclusive bounds in range literals.
But I'm going to rebase on main
and resolve some conflicts before I start in on those.
record = reader.read(row) | ||
cache[record.dataId] = record | ||
self._cache = cache | ||
return cast(Mapping[DataCoordinate, DimensionRecord], self._cache) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I want to revisit some of the caching logic at a higher level, too, to deal with problems with cache expiration on rollback like DM-35498, since I think that's broader than just the collections issue it mentions and I think the approach taken here of allowing caches to be empty is the way out. I'm repurposing that ticket to expand its scope to this now.
join : `~lsst.daf.relation.Join` | ||
Join operation to use when the implementation is an actual join. | ||
When a true join is being simulated by other relation operations, | ||
this objects `~lsst.daf.relation.Join.min_columns` and | ||
`~lsst.daf.relation.Join.max_columns` should still be respected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've defined the interface this way to allow the skypix dimension storage to simulate joins by actually applying a column-calculation relation operation that adds the region columns. That behaves like a join in this case because we know the virtual skypix table has to have a superset of the skypix indices appearing in any other table, and there are never any other columns in those virtual tables besides index and the region.
def digestTables(self) -> Iterable[sqlalchemy.schema.Table]: | ||
"""Return tables used for schema digest. | ||
|
||
Returns | ||
------- | ||
tables : `Iterable` [ `sqlalchemy.schema.Table` ] | ||
Possibly empty set of tables for schema digest calculations. | ||
relation : `lsst.daf.relation.Relation` | ||
New relation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this must have been copy-pasted into the wrong place, but looking a little closer, I think the annotation should really be list[sqlalchemy.schema.Table]
, since we do +=
on it and that's not legal on general iterables. Interesting that mypy didn't complain about that; I wonder if it can't do arithmetic operators for some reason.
|
||
detector = cast(Dimension, self.universe["detector"]) | ||
if detector in dataId: | ||
relation = visit_definition_storage.join(relation, Join(), context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be totally fine at runtime. But for MyPy I'd also need to do cast(DatabaseDimensionRecordStorage, visit_definition_storage)
since DimensionRecordStorage
doesn't have make_relation
.
context : `.queries.SqlQueryContext` | ||
Object that manages relation engines and database-side state | ||
(e.g. temporary tables) for the query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs got out of date; in some of the intermediate commits this did have to be a SqlQueryContext
, and it looks like I updated the annotation after that but not the docs.
|
||
def visitRangeLiteral(self, start: int, stop: int, stride: int | None, node: Node) -> VisitorResult: | ||
# Docstring inherited. | ||
return ColumnContainer.range_literal(range(start, stop, stride or 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I hadn't realized our stop
here was inclusive; I'll add a +1
here as well as a unit test.
05a0332
to
c52226a
Compare
This adds a dependency on daf_relation and specializations of a few of its types for daf_butler. This doesn't actually do anything yet.
These will eventually form the backbone of the new query system; for now they're unused stubs.
These are not yet used anywhere, and they partially duplicate functionality in the Query.extract* methods, which will soon be replaced by these new readers. Add relation-row readers for DimensionRecords and expanded data IDs.
These are not used yet, but they will be soon (and we'll be adding more, too, as needed).
7cac757
to
934bab7
Compare
This is the first of several commits that replace pieces of Registry query code with new versions that are based on the new daf_relation package. The real value of that switch won't be realized until the query system as a whole is replaced (and in particular the Query class itself is wholly rewritten), and in the meantime there will be a fair amount of transitional code tying the new and old interfaces together. This moves the former DatasetRecordStorage.find implementation into SqlRegistry, since it can now delegate to DatasetRecordStorage.make_relation. And once we have a QueryBackend implementation for RemoteRegistry, we can move it further into the base class Registry and use the same implementation for all registries.
This gives the queries.expressions package a single entry point, which is annoying in its unit tests' imports but I think a very nice win for code organization overall, in addition to setting up this aspect of the query system for the future. The commit also includes a few minor (but broad) style changes: renaming dataId->data_id arguments in some internal interfaces, and using just the empty string for nonexistent expressions instead of allowing either None or the empty string (at the level of type annotations, anyway). It also includes one small behavior change: instead of check=False turning off all expression analysis (which disabled some important checks, and made it hard to reason about the behavior in their absence), it now just permits expressions to leave dimensions "orphaned" by not providing their governor. In internal interfaces the argument has been renamed to allow_orphans to reflect this. Finally, this still includes some transitional code that will ultimately be removed when Relation and Predicate are more fully integrated with the query system. We still have two versions (i.e. new and old) of many many things. Restore range literals in expressions have inclusive stop. Needs a test.
These exception messages reflect parsing an order-by str sequence in the context of a single DimensionElement, but logic that will land shortly instead parses these in the context of multiple elements (i.e. a data ID query). I'll restore these tests and the slightly better error messages they represent after the daf_relation overhaul is complete.
Most of this ticket is a major overhaul of QueryBuilder and the DimensionRecordStorage classes to use relations throughout, with conversion back to the old SimpleQuery system now happening only at the boundary between QueryBuilder and Query. Ideally, that'd be plenty for one commit, but doing that work revealed some preexisting problems with our dimension caching system and how we test for data ID validity in a few different contexts. But I couldn't fix those in earlier commits without the relation machinery introduced on this one, and I couldn't get that new relation machinery to pass tests without fixing them. So here they are. The first problem was that all the previous cleverness about populating the CachingDimensionRecordStorage cache one row at a time never made sense: it's much more efficient to just fetch all rows for cached dimensions whenever any of them are requested, since by definition we only configure caching for dimensions whose record count is small enough that we can easily hold them all in memory, and probably small enough that fetching all of them is still dominated by per-query overheads. Changing that fixes a bug in which QueryDimensionRecordStorage (used exclusively for 'band') happily returned nonexistent dimension records if you gave it a nonexistent data ID, and hence you could expand a bad data ID and not notice. Or, more precisely, this commit sidesteps that still-existing problem, because 'band' is actually configured to use CachingDimensionRecordStorage on top of QueryDimensionRecordStorage, and now CachingDimensionRecordStorage prevents the bad behavior. In a few commits I plan to remove the fetch method from all of those storage objects, and then the bug will really fully be gone. Anyhow, fixing that bug creates new problems, in the form of a Registry test that was actually checking for the wrong behavior - we *can* tell when a query with a bad band value is doomed, because we try to expand the data ID up front and (now) that fails. But we don't want an exception for this - we want a doomed query. So that snowballed into additional logic changes involving exceptions vs. doomed queries. Last but not least, this modifies CachingDimensionRecordStorage and GovernorDimensionRecordStorage to allow them to have an unset (None) cache, and removes the refresh methods for populating those caches. Calling higher-level refresh methods now actually unsets the caches, forcing them to be repopulated on first use. It didn't actually make sense to have both refresh and clearCaches before, but we need clearCaches in order to have something we can call at transaction rollback without hitting the database; this is exactly the problem identified (with collection caching) on DM-35498. Also, having clearCaches make the cache an empty container was a bug waiting to happen for GovernorDimensionRecordStorage, since that (usually incorrectly) would say that there are no governor dimension values in the database.
This simplifies its join implementation by allowing it to delegate to the target's join method. It'll also help with some new methods in future commits.
This is a full rewrite of Query - it's now a concrete final class instead a hierarchy (extensibility is provided by the QueryContext and QueryBackend interfaces it delegates to), and it's all backed by Relation instead of interacting with SQLAlchemy objects at all. That's important for future client-server butler work, where the expectation is that Query can be used more-or-less as-is with new QueryContext and QueryBackend implementation (with the caveat that an abstract interface rarely survives the definition of its second-ever derived class unscathed). It's also where Relation really starts to show its power in making the query system _conceptually_ simpler, even if there's still a ton of code: before, when applying a new operation (e.g. sorting) to an existing query, we basically had to try to reconstruct all of the parameters that went into the original query, then modify them according to the new operation (and any that had been applied since). Effectively every Query factory method had to know about all of the things that could go into construting a Query. Now, we can sometimes just append a new relation operation to the existing tree, and even the more complex cases can be expressed as relatively straightforward operations on that existing tree. That gives us the backbone we'll need for real query composability; it's not quite here yet only because I'm focusing now on reimplementing existing interfaces rather than defining newer more powerful ones. But it's a pretty small jump from here to a lot of things we've wanted for a long time, especially QG generation, including: - data ID set uploads, implemented as iteration-engine leaf relations that get transferred to the SQL engine; - adding new dimensions to an existing query with customization of the spatiotemporal relationships to use, which is just a frontend to QueryBackend.make_dimension_relation; - vectorizing calibration and other prerequisite lookups, which is a little more code in `Query.find_datasets` to add temporal joins. And the previously-thorny problem how how to represent the results of that - which need to involve rows that represent both a DatasetRef and dimensions that aren't part of that ref's DatasetType - is neatly solved by direct iteration over a Query instead of DatasetQueryResults. On that note, the other big change here is that pretty much all of the functionality in the QueryResults classes has moved to Query; the QueryResults classes are just thin facades now, and DimensionRecordQueryResults now delegates directly to Query (as its data ID and dataset siblings do) instead of the query-factory stuff it had before. The changes in tests here mostly reflect the fact that using relation improves our ability to do diagnostics on queries that return no results as well as our ability to see when a query is doomed before execution. I'm a bit worried the diagnostics in some cases could also be getting a bit less readable to end-users with this change, though, because they're coming from daf_relation now and use its terminology. But I think they'll still carry the important pieces of information and they'll be much more precise, so let's see how they work in the wild before trying to intercept and rewrite them in daf_butler.
This reverts commit 771d589.
All previous uses went away with the Query rewrite.
This requires passing a SqlQueryContext down from the Registry itself through a few layers. That'll be increasingly common as we move more read operations to rely on relation-returning methods to reduce the SQL-aware interface surface.
This was only used by DimensionRecordStorage.fetch and DatasetRecordStorage.[de]certify, and both of those have now been switched to use relations instead.
All usage now fully replaced by relation.
This avoids a problem with dimension record cache consistency on was introduced in the cache rework a few commits back, when we started considering the cache more trustworthy. Unfortunately there's no cost-free fix: that cache rework fixed other bugs with cache consistency, so we can't just drop it. The best solution seems to be admitting that expandDataIds cannot be _relied_ upon to test for dimension record existence; this has always been the case in general, but it may have been impossible to trigger via the default dimension configuration before. See code comments for additional rationale.
We were using += on the results of these calls, which only works for lists or tuples, not general iterables (and it was always lists at runtime). Not sure why MyPy didn't complain; maybe it doesn't check arithmetic operators?
934bab7
to
100a12c
Compare
Checklist
doc/changes