Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-41158: (mostly) implement QueryDriver for DirectButler #915

Merged
merged 23 commits into from Apr 5, 2024

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Nov 30, 2023

I've inspected all of the missed-coverage lines and they fall into two categories:

  • they need DatasetRef or DataCoordinate result support to be tested (which I'm deferring to another ticket);
  • they're really trivial error cases that I don't really think are worth the effort.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes (N/A)

@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 00946bf to c53102e Compare November 30, 2023 20:02
Copy link

codecov bot commented Nov 30, 2023

Codecov Report

Attention: Patch coverage is 87.63736% with 180 lines in your changes are missing coverage. Please review.

Project coverage is 88.95%. Comparing base (c0af174) to head (8a03dd5).

Files Patch % Lines
...hon/lsst/daf/butler/direct_query_driver/_driver.py 83.49% 42 Missing and 26 partials ⚠️
.../butler/registry/datasets/byDimensions/_storage.py 47.05% 24 Missing and 12 partials ⚠️
...t/daf/butler/direct_query_driver/_query_builder.py 87.42% 11 Missing and 10 partials ⚠️
...thon/lsst/daf/butler/registry/dimensions/static.py 84.52% 7 Missing and 6 partials ⚠️
...thon/lsst/daf/butler/queries/expression_factory.py 87.17% 9 Missing and 1 partial ⚠️
.../daf/butler/direct_query_driver/_postprocessing.py 88.05% 3 Missing and 5 partials ⚠️
...lsst/daf/butler/direct_query_driver/_query_plan.py 94.44% 5 Missing and 2 partials ⚠️
.../butler/direct_query_driver/_sql_column_visitor.py 96.74% 2 Missing and 2 partials ⚠️
python/lsst/daf/butler/tests/butler_queries.py 97.89% 2 Missing and 2 partials ⚠️
python/lsst/daf/butler/name_shrinker.py 62.50% 3 Missing ⚠️
... and 4 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #915      +/-   ##
==========================================
- Coverage   88.97%   88.95%   -0.03%     
==========================================
  Files         329      338       +9     
  Lines       42631    44014    +1383     
  Branches     8743     9071     +328     
==========================================
+ Hits        37933    39151    +1218     
- Misses       3445     3549     +104     
- Partials     1253     1314      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 6 times, most recently from 67bdb5a to 0e68108 Compare December 11, 2023 21:17
@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 6 times, most recently from e090ce4 to 2c640a6 Compare January 31, 2024 19:34
@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 9b11866 to d47954f Compare March 1, 2024 18:41
@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from ab12144 to 1e537b4 Compare March 19, 2024 14:52
@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from f894111 to 94f0a3e Compare March 27, 2024 16:57
@TallJimbo TallJimbo changed the title DM-41158: Prototype a daf_relation-free relation tree for queries DM-41158: (mostly) implement QueryDriver for DirectButler Mar 27, 2024
@TallJimbo TallJimbo force-pushed the tickets/DM-41158 branch 2 times, most recently from 716f283 to f629cea Compare March 28, 2024 01:29
@TallJimbo TallJimbo marked this pull request as ready for review March 28, 2024 02:08
Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a bunch of nitpicks you can take or leave. (Like last time GitHub is hiding some of the comments because there are too many.)

I really appreciated your extensive commenting in the hairier parts of query construction... I'm not going to claim that I fully understand all of the details but definitely closer than I was before.

Looks like it will be pretty easy to wrap the client/server version around this.

@@ -79,11 +80,14 @@ class _BaseColumnSpec(pydantic.BaseModel, ABC):
description="Whether the column may be ``NULL``.",
)

def to_sql_spec(self, **kwargs: Any) -> ddl.FieldSpec:
def to_sql_spec(self, name_shrinker: NameShrinker | None = None, **kwargs: Any) -> ddl.FieldSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this nameshrinker argument be non-optional? I would think that if it is ever needed, it will always be needed... otherwise different parts of the code can think names are different. Just creates the potential for certain columns to work with some types of queries/operations but not others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to do that at a higher level, because the set of names that need shrinking is limited to only those constructed from dataset type names, and that keeps name-shrinker arguments from sprouting up in many more places.

# complicated to implement anyway.
#
# We start a transaction rather than just opening a connection both
# to establish consistency guarantees throughout the query context and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What consistency guarantees do we get from the transaction? I thought Butler transactions were always READ COMMITTED, which is the same as what we'd get without the explicit transaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there was some lazy thinking going on here; I had been mostly thinking about consistency of things in caches (which you reminded me I hadn't actually done in another thread) and started conflating that with actual transaction consistency (which as you note here depends on isolation level). I'll drop the first clause of this sentence and just keep the second.

TESTDIR = os.path.abspath(os.path.dirname(__file__))


class DirectButlerSQLiteTests(ButlerQueryTests, unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be another instance of these tests for Postgres as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant to do that before I put it up for review and apparently I forgot.

But having added it now, I see some failures in cursor/temp-table lifetimes. I wasn't able to figure it quickly, so I've marked the tests that use temp tables as expected failures and will try to fix them on DM-43697.

self._universe = universe
self._defaults = defaults
self._materializations: dict[
qt.MaterializationKey, tuple[sqlalchemy.Table, frozenset[str], Postprocessing]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this tuple a NamedTuple or frozen dataclass would make it self-documenting and less fiddly to interact with.

assert self._exit_stack is not None
while self._cursors:
_, cursor = self._cursors.popitem()
cursor.close(exc_type, exc_value, traceback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to push this cursor closing onto the exit stack instead of directly calling close... that way the rest of cleanup will still occur even if closing a cursor raises an exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something like that first, but got stuck being unable to pop a cursor out of the exit stack by name when it could be closed due to normal termination of iteration. But you're right that there's a resource management gap here. Adding a manual try/except here is nontrivial because we do want to re-raise at the end, so I'll punt this to DM-43697.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just thinking call exit_stack.push(cursor) right here instead of calling cursor.close()... push() calls __exit__ without calling __enter__.

else:
result.append((record, self.managers.datasets.getCollectionSummary(record)))

recurse(collections)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this same operation is already implemented with somewhat fewer queries at self.managers.datasets.fetch_summaries. The keys of the returned dict aren't CollectionRecord but it could easily be changed so it is -- it's operating on CollectionRecord internally.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this on DM-43698. We might want to replace both with that recursive CTE at the same time.

Keys are "logical tables" - dimension element names or dataset type names.
"""

special: dict[str, sqlalchemy.ColumnElement[Any]] = dataclasses.field(default_factory=dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly concerned about the way assignments are made to this throughout the code... seems like there is opportunity for one chunk of code to accidentally clobber something added by another chunk.

I didn't spot any cases where that is actually occurring, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I really struggled with how much to encapsulate QueryBuilder and QueryJoiner state throughout this ticket, and after a couple of flip-flops I made myself stop in order to keep making progress overall. So there's room for improvement, and if you spot a way to improve it sometime while you're working nearby, feel free to give it a try; you may spot a possibility I didn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only obvious idea I had was a dict that only allows one write to each key by default, so you at least have to explicitly specify clobbering. But yeah it's not an easy problem to solve.

Comment on lines 262 to 263
def expect_timespan(self, expression: qt.ColumnExpression) -> TimespanDatabaseRepresentation:
return cast(TimespanDatabaseRepresentation, expression.visit(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might consider adding instanceof asserts to these expect_ functions instead of casting... might save some debugging time if something of an unexpected type creeps in.

"""

def __post_init__(self) -> None:
assert not (self.distinct and self.group_by), "At most one of distinct and group_by can be set."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check should probably be in select() instead of post_init... most of the assignments to distinct and group_by occur after initialization.

included in raw SQL results.
"""

name_shrinker: NameShrinker | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be easier to add a name_shrinker property to Database instead of worrying about creating them and transferring records between them here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure; I'd need to look for places where we might want the NameShrinker without having the Database it came from. If we don't have any such places (or they're easy to address some other way), this sounds like a good idea. But one for another ticket (I'll put it on DM-43697 for now).

@dhirving
Copy link
Contributor

dhirving commented Apr 4, 2024

I'd like to start working on the client/server version of this early next week, while it's still fresh in my mind. There's absolutely nothing in these review comments above the level of a petty nitpick. So since you've got a lot on your plate at the moment I think it would be fine to merge this down as-is.

It is very unlikely that we'll ever trip this guard.
This reduces the need to import more stuff from the queries subpackage
in butlers.
Query already has 'constraint_dimensions' (the full dimensions that
can be used for e.g. WHERE expressions), which is different from
what 'dimensions' should have meant (the dimensions of the output
objects) on the QueryResults classes.
Previous annotations rejected a lot of typical usage that would almost
always be fine (because most DimensionElements are Dimensions and are
hence usable directly as values).
SkyPix regions should always be computed on the fly in the client.
TallJimbo and others added 13 commits April 5, 2024 10:51
They're not public, and some of them clash with Registry symbols.
This test wanted the result objects from

   lsst.daf.butler.registry.queries

for use in annotations, but it was getting them from

   lsst.daf.butler.queries

and we didn't notice because we don't run MyPy on tests.
Co-authored-by: David H. Irving <david.irving@noirlab.edu>
This has some failures that I've marked as expected until we try to fix
them again on DM-43697.
@TallJimbo TallJimbo merged commit d56bc62 into main Apr 5, 2024
16 of 18 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-41158 branch April 5, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants