Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31725: rewrite queries subpackage, via new dependency on daf_relation #759

Merged
merged 25 commits into from Jan 6, 2023

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Dec 2, 2022

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@TallJimbo TallJimbo force-pushed the tickets/DM-31725 branch 3 times, most recently from f0abe6e to 76750ef Compare December 2, 2022 20:24
@codecov
Copy link

codecov bot commented Dec 2, 2022

Codecov Report

Base: 85.27% // Head: 84.87% // Decreases project coverage by -0.40% ⚠️

Coverage data is based on head (100a12c) compared to base (03715f1).
Patch coverage: 80.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #759      +/-   ##
==========================================
- Coverage   85.27%   84.87%   -0.41%     
==========================================
  Files         261      267       +6     
  Lines       34728    34993     +265     
  Branches     5868     5980     +112     
==========================================
+ Hits        29615    29699      +84     
- Misses       3865     4009     +144     
- Partials     1248     1285      +37     
Impacted Files Coverage Δ
python/lsst/daf/butler/registry/_registry.py 72.82% <ø> (ø)
python/lsst/daf/butler/registry/managers.py 87.12% <ø> (-0.10%) ⬇️
python/lsst/daf/butler/script/_associate.py 60.00% <ø> (ø)
python/lsst/daf/butler/script/_pruneDatasets.py 97.67% <ø> (ø)
python/lsst/daf/butler/script/pruneCollection.py 84.61% <ø> (ø)
python/lsst/daf/butler/script/queryDataIds.py 81.48% <ø> (ø)
python/lsst/daf/butler/script/queryDatasets.py 84.00% <ø> (ø)
tests/test_cliCmdAssociate.py 90.00% <ø> (ø)
tests/test_cliCmdPruneDatasets.py 98.14% <ø> (ø)
...thon/lsst/daf/butler/registry/dimensions/skypix.py 61.11% <41.17%> (-15.56%) ⬇️
... and 63 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TallJimbo TallJimbo force-pushed the tickets/DM-31725 branch 8 times, most recently from 3cc7e71 to 271f36e Compare December 6, 2022 18:29
@timj
Copy link
Member

timj commented Dec 6, 2022

To confirm what I said on Jira: lsst-daf-relation will have to be added as a dependency in pyproject.toml once that package ends up on PyPI. (which I see happened just after I wrote this)

Copy link
Contributor

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but this is all very new to me, so there are probably some dumb questions. I hoped that we would have less code in daf_butler but apparently line count cannot go down. 🙂

python/lsst/daf/butler/core/_column_type_info.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_readers.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_readers.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
Copy link
Member Author

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've addressed all review comments but two:

  • reworking CollectionWildcard to make require_ordered less fragile;
  • adding a test for inclusive vs. exclusive bounds in range literals.

But I'm going to rebase on main and resolve some conflicts before I start in on those.

record = reader.read(row)
cache[record.dataId] = record
self._cache = cache
return cast(Mapping[DataCoordinate, DimensionRecord], self._cache)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I want to revisit some of the caching logic at a higher level, too, to deal with problems with cache expiration on rollback like DM-35498, since I think that's broader than just the collections issue it mentions and I think the approach taken here of allowing caches to be empty is the way out. I'm repurposing that ticket to expand its scope to this now.

Comment on lines +112 to +116
join : `~lsst.daf.relation.Join`
Join operation to use when the implementation is an actual join.
When a true join is being simulated by other relation operations,
this objects `~lsst.daf.relation.Join.min_columns` and
`~lsst.daf.relation.Join.max_columns` should still be respected.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've defined the interface this way to allow the skypix dimension storage to simulate joins by actually applying a column-calculation relation operation that adds the region columns. That behaves like a join in this case because we know the virtual skypix table has to have a superset of the skypix indices appearing in any other table, and there are never any other columns in those virtual tables besides index and the region.

Comment on lines 242 to 248
def digestTables(self) -> Iterable[sqlalchemy.schema.Table]:
"""Return tables used for schema digest.

Returns
-------
tables : `Iterable` [ `sqlalchemy.schema.Table` ]
Possibly empty set of tables for schema digest calculations.
relation : `lsst.daf.relation.Relation`
New relation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this must have been copy-pasted into the wrong place, but looking a little closer, I think the annotation should really be list[sqlalchemy.schema.Table], since we do += on it and that's not legal on general iterables. Interesting that mypy didn't complain about that; I wonder if it can't do arithmetic operators for some reason.


detector = cast(Dimension, self.universe["detector"])
if detector in dataId:
relation = visit_definition_storage.join(relation, Join(), context)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be totally fine at runtime. But for MyPy I'd also need to do cast(DatabaseDimensionRecordStorage, visit_definition_storage) since DimensionRecordStorage doesn't have make_relation.

Comment on lines 53 to 55
context : `.queries.SqlQueryContext`
Object that manages relation engines and database-side state
(e.g. temporary tables) for the query.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs got out of date; in some of the intermediate commits this did have to be a SqlQueryContext, and it looks like I updated the annotation after that but not the docs.

python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved

def visitRangeLiteral(self, start: int, stop: int, stride: int | None, node: Node) -> VisitorResult:
# Docstring inherited.
return ColumnContainer.range_literal(range(start, stop, stride or 1))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I hadn't realized our stop here was inclusive; I'll add a +1 here as well as a unit test.

This adds a dependency on daf_relation and specializations of a few of
its types for daf_butler.  This doesn't actually do anything yet.
These will eventually form the backbone of the new query system; for
now they're unused stubs.
These are not yet used anywhere, and they partially duplicate
functionality in the Query.extract* methods, which will soon be
replaced by these new readers.

Add relation-row readers for DimensionRecords and expanded data IDs.
These are not used yet, but they will be soon (and we'll be adding
more, too, as needed).
@TallJimbo TallJimbo force-pushed the tickets/DM-31725 branch 2 times, most recently from 7cac757 to 934bab7 Compare January 5, 2023 18:51
This is the first of several commits that replace pieces of Registry
query code with new versions that are based on the new daf_relation
package.  The real value of that switch won't be realized until the
query system as a whole is replaced (and in particular the Query class
itself is wholly rewritten), and in the meantime there will be a fair
amount of transitional code tying the new and old interfaces together.

This moves the former DatasetRecordStorage.find implementation into
SqlRegistry, since it can now delegate to
DatasetRecordStorage.make_relation.  And once we have a QueryBackend
implementation for RemoteRegistry, we can move it further into the
base class Registry and use the same implementation for all
registries.
This gives the queries.expressions package a single entry point, which
is annoying in its unit tests' imports but I think a very nice win for
code organization overall, in addition to setting up this aspect of
the query system for the future.

The commit also includes a few minor (but broad) style changes:
renaming dataId->data_id arguments in some internal interfaces, and
using just the empty string for nonexistent expressions instead of
allowing either None or the empty string (at the level of type
annotations, anyway).

It also includes one small behavior change: instead of check=False
turning off all expression analysis (which disabled some important
checks, and made it hard to reason about the behavior in their
absence), it now just permits expressions to leave dimensions
"orphaned" by not providing their governor.  In internal interfaces
the argument has been renamed to allow_orphans to reflect this.

Finally, this still includes some transitional code that will
ultimately be removed when Relation and Predicate are more fully
integrated with the query system.  We still have two versions
(i.e. new and old) of many many things.

Restore range literals in expressions have inclusive stop.

Needs a test.
These exception messages reflect parsing an order-by str sequence in
the context of a single DimensionElement, but logic that will land
shortly instead parses these in the context of multiple elements (i.e.
a data ID query).  I'll restore these tests and the slightly better
error messages they represent after the daf_relation overhaul is
complete.
Most of this ticket is a major overhaul of QueryBuilder and the
DimensionRecordStorage classes to use relations throughout, with
conversion back to the old SimpleQuery system now happening only at
the boundary between QueryBuilder and Query.

Ideally, that'd be plenty for one commit, but doing that work revealed
some preexisting problems with our dimension caching system and how we
test for data ID validity in a few different contexts.  But I couldn't
fix those in earlier commits without the relation machinery introduced
on this one, and I couldn't get that new relation machinery to pass
tests without fixing them.  So here they are.

The first problem was that all the previous cleverness about
populating the CachingDimensionRecordStorage cache one row at a time
never made sense: it's much more efficient to just fetch all rows for
cached dimensions whenever any of them are requested, since by
definition we only configure caching for dimensions whose record count
is small enough that we can easily hold them all in memory, and
probably small enough that fetching all of them is still dominated by
per-query overheads.

Changing that fixes a bug in which QueryDimensionRecordStorage (used
exclusively for 'band') happily returned nonexistent dimension records
if you gave it a nonexistent data ID, and hence you could expand a bad
data ID and not notice.  Or, more precisely, this commit sidesteps
that still-existing problem, because 'band' is actually configured to
use CachingDimensionRecordStorage on top of
QueryDimensionRecordStorage, and now CachingDimensionRecordStorage
prevents the bad behavior.  In a few commits I plan to remove the
fetch method from all of those storage objects, and then the bug will
really fully be gone.

Anyhow, fixing that bug creates new problems, in the form of a
Registry test that was actually checking for the wrong behavior - we
*can* tell when a query with a bad band value is doomed, because we
try to expand the data ID up front and (now) that fails.  But we don't
want an exception for this - we want a doomed query.  So that
snowballed into additional logic changes involving exceptions
vs. doomed queries.

Last but not least, this modifies CachingDimensionRecordStorage and
GovernorDimensionRecordStorage to allow them to have an unset (None)
cache, and removes the refresh methods for populating those caches.
Calling higher-level refresh methods now actually unsets the caches,
forcing them to be repopulated on first use.  It didn't actually make
sense to have both refresh and clearCaches before, but we need
clearCaches in order to have something we can call at transaction
rollback without hitting the database; this is exactly the problem
identified (with collection caching) on DM-35498.  Also, having
clearCaches make the cache an empty container was a bug waiting to
happen for GovernorDimensionRecordStorage, since that (usually
incorrectly) would say that there are no governor dimension values in
the database.
This simplifies its join implementation by allowing it to delegate to
the target's join method.  It'll also help with some new methods in
future commits.
This is a full rewrite of Query - it's now a concrete final class
instead a hierarchy (extensibility is provided by the QueryContext and
QueryBackend interfaces it delegates to), and it's all backed by
Relation instead of interacting with SQLAlchemy objects at all.

That's important for future client-server butler work, where the
expectation is that Query can be used more-or-less as-is with new
QueryContext and QueryBackend implementation (with the caveat that an
abstract interface rarely survives the definition of its second-ever
derived class unscathed).

It's also where Relation really starts to show its power in making the
query system _conceptually_ simpler, even if there's still a ton of
code: before, when applying a new operation (e.g. sorting) to an
existing query, we basically had to try to reconstruct all of the
parameters that went into the original query, then modify them
according to the new operation (and any that had been applied since).
Effectively every Query factory method had to know about all of the
things that could go into construting a Query.  Now, we can sometimes
just append a new relation operation to the existing tree, and even
the more complex cases can be expressed as relatively straightforward
operations on that existing tree.

That gives us the backbone we'll need for real query composability;
it's not quite here yet only because I'm focusing now on
reimplementing existing interfaces rather than defining newer more
powerful ones.  But it's a pretty small jump from here to a lot of
things we've wanted for a long time, especially QG generation,
including:

- data ID set uploads, implemented as iteration-engine leaf relations that
  get transferred to the SQL engine;

- adding new dimensions to an existing query with customization of the
  spatiotemporal relationships to use, which is just a frontend to
  QueryBackend.make_dimension_relation;

- vectorizing calibration and other prerequisite lookups, which is a
  little more code in `Query.find_datasets` to add temporal joins.
  And the previously-thorny problem how how to represent the results
  of that - which need to involve rows that represent both a
  DatasetRef and dimensions that aren't part of that ref's DatasetType
  - is neatly solved by direct iteration over a Query instead of
  DatasetQueryResults.

On that note, the other big change here is that pretty much all of the
functionality in the QueryResults classes has moved to Query; the
QueryResults classes are just thin facades now, and
DimensionRecordQueryResults now delegates directly to Query (as its
data ID and dataset siblings do) instead of the query-factory stuff it
had before.

The changes in tests here mostly reflect the fact that using relation
improves our ability to do diagnostics on queries that return no
results as well as our ability to see when a query is doomed before
execution.  I'm a bit worried the diagnostics in some cases could also
be getting a bit less readable to end-users with this change, though,
because they're coming from daf_relation now and use its terminology.
But I think they'll still carry the important pieces of information
and they'll be much more precise, so let's see how they work in the
wild before trying to intercept and rewrite them in daf_butler.
All previous uses went away with the Query rewrite.
This requires passing a SqlQueryContext down from the Registry itself
through a few layers.  That'll be increasingly common as we move more
read operations to rely on relation-returning methods to reduce the
SQL-aware interface surface.
This was only used by DimensionRecordStorage.fetch and
DatasetRecordStorage.[de]certify, and both of those have now been
switched to use relations instead.
All usage now fully replaced by relation.
This avoids a problem with dimension record cache consistency on was
introduced in the cache rework a few commits back, when we started
considering the cache more trustworthy.

Unfortunately there's no cost-free fix: that cache rework fixed other
bugs with cache consistency, so we can't just drop it.  The best
solution seems to be admitting that expandDataIds cannot be _relied_
upon to test for dimension record existence; this has always been the
case in general, but it may have been impossible to trigger via the
default dimension configuration before.

See code comments for additional rationale.
We were using += on the results of these calls, which only works for
lists or tuples, not general iterables (and it was always lists at
runtime).

Not sure why MyPy didn't complain; maybe it doesn't check arithmetic
operators?
@TallJimbo TallJimbo merged commit 25462af into main Jan 6, 2023
@TallJimbo TallJimbo deleted the tickets/DM-31725 branch January 6, 2023 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants