DM-41158: (mostly) implement QueryDriver for DirectButler #915

TallJimbo · 2023-11-30T20:01:08Z

I've inspected all of the missed-coverage lines and they fall into two categories:

they need DatasetRef or DataCoordinate result support to be tested (which I'm deferring to another ticket);
they're really trivial error cases that I don't really think are worth the effort.

Checklist

ran Jenkins
~~added a release note for user-visible changes to doc/changes~~ (N/A)

codecov · 2023-11-30T20:11:47Z

Codecov Report

Attention: Patch coverage is 87.63736% with 180 lines in your changes are missing coverage. Please review.

Project coverage is 88.95%. Comparing base (c0af174) to head (8a03dd5).

Files	Patch %	Lines
...hon/lsst/daf/butler/direct_query_driver/_driver.py	83.49%	42 Missing and 26 partials ⚠️
.../butler/registry/datasets/byDimensions/_storage.py	47.05%	24 Missing and 12 partials ⚠️
...t/daf/butler/direct_query_driver/_query_builder.py	87.42%	11 Missing and 10 partials ⚠️
...thon/lsst/daf/butler/registry/dimensions/static.py	84.52%	7 Missing and 6 partials ⚠️
...thon/lsst/daf/butler/queries/expression_factory.py	87.17%	9 Missing and 1 partial ⚠️
.../daf/butler/direct_query_driver/_postprocessing.py	88.05%	3 Missing and 5 partials ⚠️
...lsst/daf/butler/direct_query_driver/_query_plan.py	94.44%	5 Missing and 2 partials ⚠️
.../butler/direct_query_driver/_sql_column_visitor.py	96.74%	2 Missing and 2 partials ⚠️
python/lsst/daf/butler/tests/butler_queries.py	97.89%	2 Missing and 2 partials ⚠️
python/lsst/daf/butler/name_shrinker.py	62.50%	3 Missing ⚠️
... and 4 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #915      +/-   ##
==========================================
- Coverage   88.97%   88.95%   -0.03%     
==========================================
  Files         329      338       +9     
  Lines       42631    44014    +1383     
  Branches     8743     9071     +328     
==========================================
+ Hits        37933    39151    +1218     
- Misses       3445     3549     +104     
- Partials     1253     1314      +61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhirving

Looks good to me, just a bunch of nitpicks you can take or leave. (Like last time GitHub is hiding some of the comments because there are too many.)

I really appreciated your extensive commenting in the hairier parts of query construction... I'm not going to claim that I fully understand all of the details but definitely closer than I was before.

Looks like it will be pretty easy to wrap the client/server version around this.

dhirving · 2024-04-01T17:07:16Z

python/lsst/daf/butler/column_spec.py

@@ -79,11 +80,14 @@ class _BaseColumnSpec(pydantic.BaseModel, ABC):
        description="Whether the column may be ``NULL``.",
    )

-    def to_sql_spec(self, **kwargs: Any) -> ddl.FieldSpec:
+    def to_sql_spec(self, name_shrinker: NameShrinker | None = None, **kwargs: Any) -> ddl.FieldSpec:


Should this nameshrinker argument be non-optional? I would think that if it is ever needed, it will always be needed... otherwise different parts of the code can think names are different. Just creates the potential for certain columns to work with some types of queries/operations but not others.

I prefer to do that at a higher level, because the set of names that need shrinking is limited to only those constructed from dataset type names, and that keeps name-shrinker arguments from sprouting up in many more places.

dhirving · 2024-04-01T17:16:35Z

python/lsst/daf/butler/direct_query_driver/_driver.py

+        # complicated to implement anyway.
+        #
+        # We start a transaction rather than just opening a connection both
+        # to establish consistency guarantees throughout the query context and


What consistency guarantees do we get from the transaction? I thought Butler transactions were always READ COMMITTED, which is the same as what we'd get without the explicit transaction.

Yeah, there was some lazy thinking going on here; I had been mostly thinking about consistency of things in caches (which you reminded me I hadn't actually done in another thread) and started conflating that with actual transaction consistency (which as you note here depends on isolation level). I'll drop the first clause of this sentence and just keep the second.

dhirving · 2024-04-01T17:32:48Z

tests/test_query_direct_sqlite.py

+TESTDIR = os.path.abspath(os.path.dirname(__file__))
+
+
+class DirectButlerSQLiteTests(ButlerQueryTests, unittest.TestCase):


Should there be another instance of these tests for Postgres as well?

Yes, I meant to do that before I put it up for review and apparently I forgot.

But having added it now, I see some failures in cursor/temp-table lifetimes. I wasn't able to figure it quickly, so I've marked the tests that use temp tables as expected failures and will try to fix them on DM-43697.

dhirving · 2024-04-01T18:07:11Z

python/lsst/daf/butler/direct_query_driver/_driver.py

+        self._universe = universe
+        self._defaults = defaults
+        self._materializations: dict[
+            qt.MaterializationKey, tuple[sqlalchemy.Table, frozenset[str], Postprocessing]


Making this tuple a NamedTuple or frozen dataclass would make it self-documenting and less fiddly to interact with.

dhirving · 2024-04-01T18:12:30Z

python/lsst/daf/butler/direct_query_driver/_driver.py

+        assert self._exit_stack is not None
+        while self._cursors:
+            _, cursor = self._cursors.popitem()
+            cursor.close(exc_type, exc_value, traceback)


It might be a good idea to push this cursor closing onto the exit stack instead of directly calling close... that way the rest of cleanup will still occur even if closing a cursor raises an exception.

I tried something like that first, but got stuck being unable to pop a cursor out of the exit stack by name when it could be closed due to normal termination of iteration. But you're right that there's a resource management gap here. Adding a manual try/except here is nontrivial because we do want to re-raise at the end, so I'll punt this to DM-43697.

I was just thinking call exit_stack.push(cursor) right here instead of calling cursor.close()... push() calls __exit__ without calling __enter__.

dhirving · 2024-04-01T22:21:31Z

python/lsst/daf/butler/direct_query_driver/_driver.py

+                    else:
+                        result.append((record, self.managers.datasets.getCollectionSummary(record)))
+
+        recurse(collections)


I think this same operation is already implemented with somewhat fewer queries at self.managers.datasets.fetch_summaries. The keys of the returned dict aren't CollectionRecord but it could easily be changed so it is -- it's operating on CollectionRecord internally.

I'll look into this on DM-43698. We might want to replace both with that recursive CTE at the same time.

dhirving · 2024-04-02T17:53:43Z

python/lsst/daf/butler/direct_query_driver/_query_builder.py

+    Keys are "logical tables" - dimension element names or dataset type names.
+    """
+
+    special: dict[str, sqlalchemy.ColumnElement[Any]] = dataclasses.field(default_factory=dict)


I'm slightly concerned about the way assignments are made to this throughout the code... seems like there is opportunity for one chunk of code to accidentally clobber something added by another chunk.

I didn't spot any cases where that is actually occurring, though.

Yeah, I really struggled with how much to encapsulate QueryBuilder and QueryJoiner state throughout this ticket, and after a couple of flip-flops I made myself stop in order to keep making progress overall. So there's room for improvement, and if you spot a way to improve it sometime while you're working nearby, feel free to give it a try; you may spot a possibility I didn't.

The only obvious idea I had was a dict that only allows one write to each key by default, so you at least have to explicitly specify clobbering. But yeah it's not an easy problem to solve.

dhirving · 2024-04-02T17:57:41Z

python/lsst/daf/butler/direct_query_driver/_sql_column_visitor.py

+    def expect_timespan(self, expression: qt.ColumnExpression) -> TimespanDatabaseRepresentation:
+        return cast(TimespanDatabaseRepresentation, expression.visit(self))


You might consider adding instanceof asserts to these expect_ functions instead of casting... might save some debugging time if something of an unexpected type creeps in.

dhirving · 2024-04-02T18:03:24Z

python/lsst/daf/butler/direct_query_driver/_query_builder.py

+    """
+
+    def __post_init__(self) -> None:
+        assert not (self.distinct and self.group_by), "At most one of distinct and group_by can be set."


This check should probably be in select() instead of post_init... most of the assignments to distinct and group_by occur after initialization.

dhirving · 2024-04-02T18:19:18Z

python/lsst/daf/butler/direct_query_driver/_query_builder.py

+    included in raw SQL results.
+    """
+
+    name_shrinker: NameShrinker | None = None


I wonder if it would be easier to add a name_shrinker property to Database instead of worrying about creating them and transferring records between them here.

I'm not sure; I'd need to look for places where we might want the NameShrinker without having the Database it came from. If we don't have any such places (or they're easy to address some other way), this sounds like a good idea. But one for another ticket (I'll put it on DM-43697 for now).

dhirving · 2024-04-04T18:25:54Z

I'd like to start working on the client/server version of this early next week, while it's still fresh in my mind. There's absolutely nothing in these review comments above the level of a petty nitpick. So since you've got a lot on your plate at the moment I think it would be fine to merge this down as-is.

It is very unlikely that we'll ever trip this guard.

This reduces the need to import more stuff from the queries subpackage in butlers.

Query already has 'constraint_dimensions' (the full dimensions that can be used for e.g. WHERE expressions), which is different from what 'dimensions' should have meant (the dimensions of the output objects) on the QueryResults classes.

Previous annotations rejected a lot of typical usage that would almost always be fine (because most DimensionElements are Dimensions and are hence usable directly as values).

SkyPix regions should always be computed on the fly in the client.

They're not public, and some of them clash with Registry symbols.

This test wanted the result objects from lsst.daf.butler.registry.queries for use in annotations, but it was getting them from lsst.daf.butler.queries and we didn't notice because we don't run MyPy on tests.

Co-authored-by: David H. Irving <david.irving@noirlab.edu>

This has some failures that I've marked as expected until we try to fix them again on DM-43697.

TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 00946bf to c53102e Compare November 30, 2023 20:02

TallJimbo force-pushed the tickets/DM-41158 branch 6 times, most recently from 67bdb5a to 0e68108 Compare December 11, 2023 21:17

TallJimbo force-pushed the tickets/DM-41158 branch from 0e68108 to 232b56d Compare December 17, 2023 22:30

TallJimbo force-pushed the tickets/DM-41158 branch from a9ca7d0 to cfd5b2d Compare December 27, 2023 21:37

TallJimbo force-pushed the tickets/DM-41158 branch from cfd5b2d to 8f5d995 Compare January 10, 2024 16:21

TallJimbo force-pushed the tickets/DM-41158 branch 6 times, most recently from e090ce4 to 2c640a6 Compare January 31, 2024 19:34

TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 9b11866 to d47954f Compare March 1, 2024 18:41

TallJimbo force-pushed the tickets/DM-41158 branch from d47954f to 18ba8ac Compare March 4, 2024 21:59

TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from ab12144 to 1e537b4 Compare March 19, 2024 14:52

TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from f894111 to 94f0a3e Compare March 27, 2024 16:57

TallJimbo changed the title ~~DM-41158: Prototype a daf_relation-free relation tree for queries~~ DM-41158: (mostly) implement QueryDriver for DirectButler Mar 27, 2024

TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 716f283 to f629cea Compare March 28, 2024 01:29

TallJimbo marked this pull request as ready for review March 28, 2024 02:08

TallJimbo force-pushed the tickets/DM-41158 branch from cc2dc67 to c3b43a4 Compare March 28, 2024 03:17

TallJimbo force-pushed the tickets/DM-41158 branch from c3b43a4 to c1778cd Compare March 28, 2024 13:49

dhirving approved these changes Apr 2, 2024

View reviewed changes

TallJimbo added 10 commits April 4, 2024 16:57

Move NameShrinker out of registry and add container methods.

61f438d

Accept NameShrinker in ColumnSpec.to_sql_spec.

6ebc934

Guard against long field names in dimensions manager.

2504bcf

It is very unlikely that we'll ever trip this guard.

Default QueryTree argument to Query constructor.

ece2431

This reduces the need to import more stuff from the queries subpackage in butlers.

Fix truncated docstring in queries.tree.ColumnSet.

3c04617

Make queries.ExpressionFactory friendlier to type checkers.

53092e0

Previous annotations rejected a lot of typical usage that would almost always be fine (because most DimensionElements are Dimensions and are hence usable directly as values).

Fix doc typo that inverted meaning in Database interface.

08639c7

Guard against skypix regions in result specs.

c73556e

SkyPix regions should always be computed on the fly in the client.

Export InvalidQueryError at package scope.

0015626

TallJimbo force-pushed the tickets/DM-41158 branch from 8305fc9 to d5b742e Compare April 4, 2024 20:58

TallJimbo and others added 13 commits April 5, 2024 10:51

Implement QueryDriver for DirectButler.

2546a02

Implement Butler._query.

b76888c

Fix typo in comment.

722ae4c

Don't lift 'queries' symbols into package scope (yet).

1c3729e

They're not public, and some of them clash with Registry symbols.

Fix incorrect query-result type imports.

b29f78a

This test wanted the result objects from lsst.daf.butler.registry.queries for use in annotations, but it was getting them from lsst.daf.butler.queries and we didn't notice because we don't run MyPy on tests.

Reduce duplication in query result processing.

2ca5451

Co-authored-by: David H. Irving <david.irving@noirlab.edu>

Open caching context in query context.

3852316

Fix incorrect code comment about query transactions.

b8d4454

Use dataclass for DirectQueryDriver materialization state.

f045ba3

Drop redundant cursor close.

98fe0c7

Defer guards on QueryBuilder distinct vs. group_by to make them useful.

e4b6b13

Assert on rather than cast SqlColumnVisitor types.

f5a5b56

Add test for DirectQueryDriver with PostgreSQL.

8a03dd5

This has some failures that I've marked as expected until we try to fix them again on DM-43697.

TallJimbo force-pushed the tickets/DM-41158 branch from d5b742e to 8a03dd5 Compare April 5, 2024 14:52

TallJimbo merged commit d56bc62 into main Apr 5, 2024
16 of 18 checks passed

TallJimbo deleted the tickets/DM-41158 branch April 5, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-41158: (mostly) implement QueryDriver for DirectButler #915

DM-41158: (mostly) implement QueryDriver for DirectButler #915

TallJimbo commented Nov 30, 2023 •

edited

codecov bot commented Nov 30, 2023 •

edited

dhirving left a comment

dhirving Apr 1, 2024

TallJimbo Apr 4, 2024

dhirving Apr 1, 2024

TallJimbo Apr 4, 2024

dhirving Apr 1, 2024

TallJimbo Apr 4, 2024

dhirving Apr 1, 2024

dhirving Apr 1, 2024

TallJimbo Apr 4, 2024

dhirving Apr 5, 2024

dhirving Apr 1, 2024

TallJimbo Apr 4, 2024

dhirving Apr 2, 2024

TallJimbo Apr 4, 2024

dhirving Apr 5, 2024

dhirving Apr 2, 2024

dhirving Apr 2, 2024

dhirving Apr 2, 2024

TallJimbo Apr 4, 2024

dhirving commented Apr 4, 2024

		TESTDIR = os.path.abspath(os.path.dirname(__file__))


		class DirectButlerSQLiteTests(ButlerQueryTests, unittest.TestCase):

		def expect_timespan(self, expression: qt.ColumnExpression) -> TimespanDatabaseRepresentation:
		return cast(TimespanDatabaseRepresentation, expression.visit(self))

DM-41158: (mostly) implement QueryDriver for DirectButler #915

DM-41158: (mostly) implement QueryDriver for DirectButler #915

Conversation

TallJimbo commented Nov 30, 2023 • edited

Checklist

codecov bot commented Nov 30, 2023 • edited

Codecov Report

dhirving left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving commented Apr 4, 2024

TallJimbo commented Nov 30, 2023 •

edited

codecov bot commented Nov 30, 2023 •

edited