Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-42740: rewrite new query system interfaces and division of responsibilities #966

Merged
merged 7 commits into from Mar 4, 2024

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Feb 26, 2024

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes (not yet user-visibile)

@TallJimbo TallJimbo force-pushed the tickets/DM-42740 branch 2 times, most recently from 5d3bae1 to 8e48a44 Compare February 26, 2024 16:30
Copy link

codecov bot commented Feb 26, 2024

Codecov Report

Attention: Patch coverage is 96.13527% with 120 lines in your changes are missing coverage. Please review.

Project coverage is 88.88%. Comparing base (6308b79) to head (cdc963a).

Files Patch % Lines
python/lsst/daf/butler/_butler.py 9.67% 28 Missing ⚠️
python/lsst/daf/butler/queries/overlaps.py 79.16% 16 Missing and 9 partials ⚠️
python/lsst/daf/butler/queries/visitors.py 83.90% 11 Missing and 3 partials ⚠️
tests/test_query_interface.py 98.57% 6 Missing and 8 partials ⚠️
python/lsst/daf/butler/queries/driver.py 85.88% 2 Missing and 10 partials ⚠️
.../lsst/daf/butler/queries/_dataset_query_results.py 91.13% 7 Missing ⚠️
python/lsst/daf/butler/dimensions/_record_set.py 16.66% 5 Missing ⚠️
python/lsst/daf/butler/queries/convert_args.py 94.73% 1 Missing and 4 partials ⚠️
python/lsst/daf/butler/arrow_utils.py 75.00% 2 Missing ⚠️
python/lsst/daf/butler/queries/result_specs.py 98.41% 1 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #966      +/-   ##
==========================================
+ Coverage   88.53%   88.88%   +0.35%     
==========================================
  Files         313      329      +16     
  Lines       40184    42374    +2190     
  Branches     8407     8741     +334     
==========================================
+ Hits        35575    37663    +2088     
- Misses       3390     3457      +67     
- Partials     1219     1254      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work pretty well. It pushes a bit more complexity to the client than I would prefer -- mostly in join_dataset_search where it forces the client to do work deciding what dimensions will need to be involved in the query instead of just recording what dataset types the client asked for and sending it to the server. However, since the dimensions are fairly high-level logical things inherent to the way Butler works I suspect this won't cause many backwards-compatibility issues. The serialized data sent in QueryTree and the ResultSpec classes is quite straightforward and simple, so that's a win.

I don't see anything here that would be harmful to merge in its current form. To the extent that you decide to make changes in response to these comments it can probably happen in a later (and hopefully much smaller!) PR. (Also note there are 26 comments here -- sometimes GitHub likes to hide some when it goes over 10.)

Comment on lines +103 to +108
storage_class_name: str | None
"""Name of the storage class to use when returning `DatasetRef` results.

May be `None` if the dataset is only used as a constraint or to return
columns that do not include a full dataset type.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI the strategy I've been using so far on StorageClass is that the client never sends the server anything about a StorageClass -- the server sends back a ref with the default storage class for the data type and the client overrides it if required. This theoretically lets us support clients loading data using custom/experimental storage classes not known to the server.

I don't think it's harmful to have this get sent to the server and ignored, though at first glance I would have expected this bit of data to be attached to the SingleTypeDatasetQueryResults object and not in this part of the QueryTree... since the dataset types provided in the call to datasets() define the output from the query, not any intermediate joins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with this a lot myself. I added it because:

  • if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it;
  • I wanted Query.join_dataset_search to be able to take DatasetType as well as str, both so Query.datasets could delegate to it and so I didn't have to document why it was unusual in that respect.

I think the real problem is that DatasetType has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass public class is its own problem. And that's all thoroughly water under the bridge.

In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query (and it would be Query, not SingleTypeDatasetResults, to serve its current purpose).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it

Don't they already have to pass the dataset type again when they call datasets() to declare that they want refs back for this dataset type? To me it makes sense that whatever you pass into datasets is the authoritative thing, since that's where you're declaring the output format.

not SingleTypeDatasetResults, to serve its current purpose).

It's possible I'm misunderstanding what is going on with this StorageClass (or maybe all the effects that a StorageClass can have?) To me it looks like you could just pass in the StorageClass when you create the SingleTypeDatasetQueryResults, and call ref.overrideStorageClass if needed just before yielding from __iter__. If this doesn't work I'd like to know because I'm using a similar pattern a few places in RemoteButler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In almost all cases, that's right - it makes more sense to pass the storage class in later. It's really quite an edge case I'm worried about here:

  • user calls Query.join_dataset_search with a DatasetType instance that has a storage class that is not the same as the default one for that dataset type in the repository;
  • user calls Query.datasets with a str dataset type name that has no storage class information. Do we respect their previous storage class override?

Maybe it would be better to raise an exception or emit a warning if the DatasetType passed to join_dataset_search has a storage class that isn't the same as the repository one (since we need to look up the dataset type definition in the repository at that point anyway, to get the dimensions), saying that the storage class passed in there will be ignored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on DM-43146: DatasetSearch no longer has storage_class_name. Instead join_dataset_search raises if you give it a storage class override.

Comment on lines +53 to +79
class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]):
"""A query for `DatasetRef` results."""

@abstractmethod
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
"""Group results by dataset type.

Returns
-------
iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ]
An iterator over `DatasetQueryResults` instances that are each
responsible for a single dataset type.
"""
raise NotImplementedError()

@property
@abstractmethod
def has_dimension_records(self) -> bool:
"""Whether all data IDs in this iterable contain dimension records."""
raise NotImplementedError()

@abstractmethod
def with_dimension_records(self) -> DatasetQueryResults:
"""Return a results object for which `has_dimension_records` is
`True`.
"""
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this DatasetQueryResults base class is necessary. If Query.datasets always returned a ChainedDatasetQueryResults (potentially of length 1, and presumably renamed to DatasetQueryResults) then:

  • The DatasetQueryResults base class could be removed, leaving the reader to with less of a multiple inheritance tangle to puzzle out (there's at least one diamond in the current setup.)
  • There would be fewer code paths to go down -- users and maintainers wouldn't have to worry about potentially varying behavior between the 'direct' case and the 'chained' case. All test coverage would be covering the same code paths.
  • by_dataset_type could be removed from SingleTypeDatasetQueryResults (and possibly other methods with a bit of restructuring that becomes possible when the interfaces of SingleTypeDatasetQueryResults and ChainedDatasetQueryResults don't have to be identical.)
  • SingleTypeDatasetQueryResults could potentially become private to this file (if all available methods were forwarded from ChainedDatasetQueryResults, which is probably desirable for consistency anyway.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on DM-43146.

python/lsst/daf/butler/queries/tree/_column_literal.py Outdated Show resolved Hide resolved

expression_type: Literal["float"] = "float"

value: float
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON encoders have a long history of doing whatever they feel like with floating point precision. I'm not sure what Pydantic/Python's versions do... have you confirmed that they guarantee they will round-trip values with sufficient precision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. Thankfully float fields are pretty rare in our schema and I can't think of anything that would require high precision, but it would be nice to know if we're losing anything here.

Comment on lines +96 to +103
def precedence(self) -> int:
"""Operator precedence for this operation.

Lower values bind more tightly, so parentheses are needed when printing
an expression where an operand has a higher value than the expression
itself.
"""
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised to see precedence here... I would have expected that to be resolved at the higher level where we parse the input expressions, so that we just have a tree of expressions that could be evaluated directly here without worrying about ordering.

Is this only used for debug stringifications? If it's me I would just always put the parentheses... less complexity, and it makes the tree structure explicit (which is often what you're trying to debug if you're looking at these?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just about stringification and it might now be overkill. I added it back when Predicate was a flexible tree of binary AND and OR operators, and A AND (B AND (C AND (D AND E))) happened a lot and did get in the way of readability. But now that Predicate always maintains conjunctive normal form (ANDs-of-ORs-[of NOT]) we don't need it to solve that particular problem. Let me see how it goes if I rip it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (on DM-43146): we now just add parentheses for any kind of nested expression (but do try to avoid them on literals, direct references to columns, and predicates).

Comment on lines +339 to +341
# TODO: string length matches the one defined in the
# CollectionManager implementations; we need to find a way to
# avoid hard-coding the value in multiple places.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just declare a constant and reference it in both places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just forgot to come back to this and do it later on the branch; will do. I think I was just hung up on where to put the constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done on DM-43146, and a good thing, too - I had actually gotten the number wrong here before.

Copy link
Member Author

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a round of replies about changes I now plan to make but haven't made yet.

I'm not really accustomed to leaving even small review-response changes to another ticket, but this is such a behemoth PR that I can imagine that working better for the reviewer - so I'll leave that up to you. I'll start working on those on strictly separate commits, and I can put them on some other branch or push them here later.

I do think I want to make a number of those changes before getting back to implementing the driver, because I think they may actually make that easier, or at least differently hard.

@@ -101,48 +101,59 @@ Python API reference

.. automodapi:: lsst.daf.butler
:no-main-docstr:
:no-inherited-members:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to remove it and see if it's just warnings we get, but I know it's at least possible to get hard errors in the doc-build from the Sphinx bug this works around, and if that happens I'll have to put it back in at least a few places. From the slack conversation that led to this commit it sounds like if we can finally upgrade our doc tooling to a newer Sphinx it may go away.

python/lsst/daf/butler/queries/_base.py Show resolved Hide resolved
self._tree, order_by=convert_order_by_args(self.dimensions, self._get_datasets(), *args)
)

def limit(self, limit: int | None = None, offset: int = 0) -> Self:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of any direct use cases, and I'd be happy to remove it. @timj , @andy-slac , do you have a reason we need offset support, or was it added originally just because it felt reasonable when we added limit?

Comment on lines +53 to +79
class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]):
"""A query for `DatasetRef` results."""

@abstractmethod
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
"""Group results by dataset type.

Returns
-------
iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ]
An iterator over `DatasetQueryResults` instances that are each
responsible for a single dataset type.
"""
raise NotImplementedError()

@property
@abstractmethod
def has_dimension_records(self) -> bool:
"""Whether all data IDs in this iterable contain dimension records."""
raise NotImplementedError()

@abstractmethod
def with_dimension_records(self) -> DatasetQueryResults:
"""Return a results object for which `has_dimension_records` is
`True`.
"""
raise NotImplementedError()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).

Comment on lines +184 to +186
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
# Docstring inherited.
return iter(self._by_dataset_type)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to dropping ChainedDatasetQueryResults. That will simplify the mess of base classes a lot, and I do think the long term plan is to run those queries separately.

We would then need a separate method or methods for some important summary queries, like "which dataset types exist at all in these collections". I had originally planned to add those in the future as convenience methods on Query that would delegate to ChainedDatasetQueryResults.by_dataset_type() and then call any on each (and hence not require any new QueryDriver methods). But having a driver method dedicated to those summaries is not really a burden, and it could save us from having to bring the CollectionSummary objects down to the client at all (I'll have to work through that possibility to be sure) since that was largely about optimizing those "dataset types in collections" summary queries. (The other reason to bring them down to the client is that caching them there may be easier than caching them on the server, but that's also very much a "maybe".)

I'll start with deleting ChainedDatasetQueryResults and whatever base classes that renders useless on this branch before I merge. If that is goes well I'll stop there and put summarization methods on another ticket, but I think I might want to prototype those out before I dive back into implementing the driver to make sure I don't want to delete or change anything in the lower levels that I should have kept.

Comment on lines +103 to +108
storage_class_name: str | None
"""Name of the storage class to use when returning `DatasetRef` results.

May be `None` if the dataset is only used as a constraint or to return
columns that do not include a full dataset type.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled with this a lot myself. I added it because:

  • if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it;
  • I wanted Query.join_dataset_search to be able to take DatasetType as well as str, both so Query.datasets could delegate to it and so I didn't have to document why it was unusual in that respect.

I think the real problem is that DatasetType has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass public class is its own problem. And that's all thoroughly water under the bridge.

In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query (and it would be Query, not SingleTypeDatasetResults, to serve its current purpose).

python/lsst/daf/butler/queries/visitors.py Show resolved Hide resolved
@dhirving
Copy link
Contributor

dhirving commented Mar 1, 2024

I'm not really accustomed to leaving even small review-response changes to another ticket, but this is such a behemoth PR that I can imagine that working better for the reviewer - so I'll leave that up to you.

🤷 Probably makes sense to clean up some of the little stuff like the wrong comments before merging, but to me it makes sense to get this version merged down before removing the ChainedDatasetQueryResults or any changes like that.

For a new feature like this where the impact on users is basically nil until we make it public, there's basically no downside to merging early and merging often :) (Other than that our development workflow is kind of annoying with the need to generate a new JIRA ticket to make a PR.)

Benefits to you of having this code merged:

  • You have a record of how the design has evolved as you tweak it, and you can more easily backtrack a change or resurrect a useful utility function because there is a linear history on a single branch to look through to find it.
  • You don't need to rebase it anymore, and any new changes people make will take into account the code you've written here.
  • You reduce the risk of subtle breakage from rebasing, because any rebasing will happen in smaller chunks.
  • Having the rough framework merged makes it easier to delegate parts to other people

Benefits to the rest of the team of having this code merged:

  • We can make changes to it (e.g. as I move the exception hierarchy around this week I can see which ones you've used here and make decisions accordingly.)
  • It's easier to learn and think about it, because it's visible in the same branches we're working on.
  • We don't have to wonder which of your side branches we should be looking at, because we can just see the progress on main.

This keeps our doc build from trying to parse upstream docstrings that
may not be numpydoc.  I'm sure we don't have to use it in all of our
modules, but I don't have much of an opinion about whether including
inherited members is generally useful, so consistency wins.
This was a very minor issue; running mypy against daf_butler without
daf_relation setup will now cause mypy to complain.  But of course it's
rare for us to do that because more than mypy breaks.
I'm prioritizing how this appears when built-in connections are
printed (which always calls repr) over the "print like you'd construct
it" guideline.
@TallJimbo
Copy link
Member Author

TallJimbo commented Mar 4, 2024

Ok, I've fixed a few of the doc/comment issues and I'm punting the rest to DM-43146.

This includes:

- replacing the Query and *QueryResults ABCs with concrete classes
  that delegate to another QueryDriver ABC;

- substantially reworking the public Query and *QueryResults
  interfaces, mostly to minimize the number of different ways to do
  various things (and hence limit complexity);

- adding a large suite of Pydantic models that can describe complex
  under-construction queries, allowing us to send them over the wire
  in RemoteButler.

Because QueryDriver doesn't have any concrete implementations yet,
this change means Butler._query no longer works at (previously it
delegated to the old registry.queries system).  A QueryDriver
implementation for DirectButler has been largely implemented on
another branch and will be added later.

For now, the only tests are those that rely on a mocked QueryDriver
(or don't require one at all).  These are in two files:

- test_query_interfaces.py tests the public interface objects,
  including the semi-public Pydantic models;

- test_query_utilities.py tests some utility classes (ColumnSet and
  OverlapsVisitor) that are expected to be used by all driver
  implementations to establish some behavioral invariants.

There is already substantial duplication with code in
lsst.daf.butler.registry.queries, and that will get worse when a
direct-SQL driver class is added.  Eventually the plan is to retire
almost all of lsst.daf.butler.registry.queries (except the
string-expression parser, which we'll move later) making the public
registry query interfaces delegate to lsst.daf.butler.queries instead,
but that will require both getting the latter fully functional and
RFC'ing the removal of some things we have no intention of doing in
the new system.

Fix outdated docs for Butler._query_datasets.

Fix docstrings on column literal value attributes.

DO NOT MERGE; SQUASH.

Make implementation-centric docstring into a comment.

DO NOT MERGE; SQUASH.
@TallJimbo TallJimbo merged commit c40c3fe into main Mar 4, 2024
18 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-42740 branch March 4, 2024 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants