DM-42740: rewrite new query system interfaces and division of responsibilities #966

TallJimbo · 2024-02-26T16:23:01Z

Checklist

ran Jenkins
~~added a release note for user-visible changes to doc/changes~~ (not yet user-visibile)

codecov · 2024-02-26T16:33:58Z

Codecov Report

Attention: Patch coverage is 96.13527% with 120 lines in your changes are missing coverage. Please review.

Project coverage is 88.88%. Comparing base (6308b79) to head (cdc963a).

Files	Patch %	Lines
python/lsst/daf/butler/_butler.py	9.67%	28 Missing ⚠️
python/lsst/daf/butler/queries/overlaps.py	79.16%	16 Missing and 9 partials ⚠️
python/lsst/daf/butler/queries/visitors.py	83.90%	11 Missing and 3 partials ⚠️
tests/test_query_interface.py	98.57%	6 Missing and 8 partials ⚠️
python/lsst/daf/butler/queries/driver.py	85.88%	2 Missing and 10 partials ⚠️
.../lsst/daf/butler/queries/_dataset_query_results.py	91.13%	7 Missing ⚠️
python/lsst/daf/butler/dimensions/_record_set.py	16.66%	5 Missing ⚠️
python/lsst/daf/butler/queries/convert_args.py	94.73%	1 Missing and 4 partials ⚠️
python/lsst/daf/butler/arrow_utils.py	75.00%	2 Missing ⚠️
python/lsst/daf/butler/queries/result_specs.py	98.41%	1 Missing and 1 partial ⚠️
... and 5 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #966      +/-   ##
==========================================
+ Coverage   88.53%   88.88%   +0.35%     
==========================================
  Files         313      329      +16     
  Lines       40184    42374    +2190     
  Branches     8407     8741     +334     
==========================================
+ Hits        35575    37663    +2088     
- Misses       3390     3457      +67     
- Partials     1219     1254      +35

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dhirving

I think this should work pretty well. It pushes a bit more complexity to the client than I would prefer -- mostly in join_dataset_search where it forces the client to do work deciding what dimensions will need to be involved in the query instead of just recording what dataset types the client asked for and sending it to the server. However, since the dimensions are fairly high-level logical things inherent to the way Butler works I suspect this won't cause many backwards-compatibility issues. The serialized data sent in QueryTree and the ResultSpec classes is quite straightforward and simple, so that's a win.

I don't see anything here that would be harmful to merge in its current form. To the extent that you decide to make changes in response to these comments it can probably happen in a later (and hopefully much smaller!) PR. (Also note there are 26 comments here -- sometimes GitHub likes to hide some when it goes over 10.)

python/lsst/daf/butler/queries/tree/_query_tree.py

dhirving · 2024-02-27T23:22:34Z

python/lsst/daf/butler/queries/tree/_query_tree.py

+    storage_class_name: str | None
+    """Name of the storage class to use when returning `DatasetRef` results.
+
+    May be `None` if the dataset is only used as a constraint or to return
+    columns that do not include a full dataset type.
+    """


Just FYI the strategy I've been using so far on StorageClass is that the client never sends the server anything about a StorageClass -- the server sends back a ref with the default storage class for the data type and the client overrides it if required. This theoretically lets us support clients loading data using custom/experimental storage classes not known to the server.

I don't think it's harmful to have this get sent to the server and ignored, though at first glance I would have expected this bit of data to be attached to the SingleTypeDatasetQueryResults object and not in this part of the QueryTree... since the dataset types provided in the call to datasets() define the output from the query, not any intermediate joins.

I struggled with this a lot myself. I added it because:

if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it;

I wanted Query.join_dataset_search to be able to take DatasetType as well as str, both so Query.datasets could delegate to it and so I didn't have to document why it was unusual in that respect.

I think the real problem is that DatasetType has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass public class is its own problem. And that's all thoroughly water under the bridge.

In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query (and it would be Query, not SingleTypeDatasetResults, to serve its current purpose).

if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it

Don't they already have to pass the dataset type again when they call datasets() to declare that they want refs back for this dataset type? To me it makes sense that whatever you pass into datasets is the authoritative thing, since that's where you're declaring the output format.

not SingleTypeDatasetResults, to serve its current purpose).

It's possible I'm misunderstanding what is going on with this StorageClass (or maybe all the effects that a StorageClass can have?) To me it looks like you could just pass in the StorageClass when you create the SingleTypeDatasetQueryResults, and call ref.overrideStorageClass if needed just before yielding from __iter__. If this doesn't work I'd like to know because I'm using a similar pattern a few places in RemoteButler.

In almost all cases, that's right - it makes more sense to pass the storage class in later. It's really quite an edge case I'm worried about here:

user calls Query.join_dataset_search with a DatasetType instance that has a storage class that is not the same as the default one for that dataset type in the repository;

user calls Query.datasets with a str dataset type name that has no storage class information. Do we respect their previous storage class override?

Maybe it would be better to raise an exception or emit a warning if the DatasetType passed to join_dataset_search has a storage class that isn't the same as the repository one (since we need to look up the dataset type definition in the repository at that point anyway, to get the dimensions), saying that the storage class passed in there will be ignored.

Done on DM-43146: DatasetSearch no longer has storage_class_name. Instead join_dataset_search raises if you give it a storage class override.

dhirving · 2024-02-28T00:26:46Z

python/lsst/daf/butler/queries/_dataset_query_results.py

+class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]):
+    """A query for `DatasetRef` results."""
+
+    @abstractmethod
+    def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
+        """Group results by dataset type.
+
+        Returns
+        -------
+        iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ]
+            An iterator over `DatasetQueryResults` instances that are each
+            responsible for a single dataset type.
+        """
+        raise NotImplementedError()
+
+    @property
+    @abstractmethod
+    def has_dimension_records(self) -> bool:
+        """Whether all data IDs in this iterable contain dimension records."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def with_dimension_records(self) -> DatasetQueryResults:
+        """Return a results object for which `has_dimension_records` is
+        `True`.
+        """
+        raise NotImplementedError()


I'm not sure this DatasetQueryResults base class is necessary. If Query.datasets always returned a ChainedDatasetQueryResults (potentially of length 1, and presumably renamed to DatasetQueryResults) then:

The DatasetQueryResults base class could be removed, leaving the reader to with less of a multiple inheritance tangle to puzzle out (there's at least one diamond in the current setup.)

There would be fewer code paths to go down -- users and maintainers wouldn't have to worry about potentially varying behavior between the 'direct' case and the 'chained' case. All test coverage would be covering the same code paths.

by_dataset_type could be removed from SingleTypeDatasetQueryResults (and possibly other methods with a bit of restructuring that becomes possible when the interfaces of SingleTypeDatasetQueryResults and ChainedDatasetQueryResults don't have to be identical.)

SingleTypeDatasetQueryResults could potentially become private to this file (if all available methods were forwarded from ChainedDatasetQueryResults, which is probably desirable for consistency anyway.)

I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).

Done on DM-43146.

python/lsst/daf/butler/queries/tree/_column_literal.py

dhirving · 2024-02-29T18:15:11Z

python/lsst/daf/butler/queries/tree/_column_literal.py

+
+    expression_type: Literal["float"] = "float"
+
+    value: float


JSON encoders have a long history of doing whatever they feel like with floating point precision. I'm not sure what Pydantic/Python's versions do... have you confirmed that they guarantee they will round-trip values with sufficient precision?

I have not. Thankfully float fields are pretty rare in our schema and I can't think of anything that would require high precision, but it would be nice to know if we're losing anything here.

dhirving · 2024-02-29T18:27:07Z

python/lsst/daf/butler/queries/tree/_base.py

+    def precedence(self) -> int:
+        """Operator precedence for this operation.
+
+        Lower values bind more tightly, so parentheses are needed when printing
+        an expression where an operand has a higher value than the expression
+        itself.
+        """
+        raise NotImplementedError()


I'm a little surprised to see precedence here... I would have expected that to be resolved at the higher level where we parse the input expressions, so that we just have a tree of expressions that could be evaluated directly here without worrying about ordering.

Is this only used for debug stringifications? If it's me I would just always put the parentheses... less complexity, and it makes the tree structure explicit (which is often what you're trying to debug if you're looking at these?)

It is just about stringification and it might now be overkill. I added it back when Predicate was a flexible tree of binary AND and OR operators, and A AND (B AND (C AND (D AND E))) happened a lot and did get in the way of readability. But now that Predicate always maintains conjunctive normal form (ANDs-of-ORs-[of NOT]) we don't need it to solve that particular problem. Let me see how it goes if I rip it out.

Done (on DM-43146): we now just add parentheses for any kind of nested expression (but do try to avoid them on literals, direct references to columns, and predicates).

dhirving · 2024-02-29T19:06:02Z

python/lsst/daf/butler/queries/tree/_column_set.py

+                # TODO: string length matches the one defined in the
+                # CollectionManager implementations; we need to find a way to
+                # avoid hard-coding the value in multiple places.


Why not just declare a constant and reference it in both places?

I think I just forgot to come back to this and do it later on the branch; will do. I think I was just hung up on where to put the constant.

Done on DM-43146, and a good thing, too - I had actually gotten the number wrong here before.

python/lsst/daf/butler/queries/tree/_column_set.py

TallJimbo

Here's a round of replies about changes I now plan to make but haven't made yet.

I'm not really accustomed to leaving even small review-response changes to another ticket, but this is such a behemoth PR that I can imagine that working better for the reviewer - so I'll leave that up to you. I'll start working on those on strictly separate commits, and I can put them on some other branch or push them here later.

I do think I want to make a number of those changes before getting back to implementing the driver, because I think they may actually make that easier, or at least differently hard.

TallJimbo · 2024-03-01T15:11:59Z

doc/lsst.daf.butler/index.rst

@@ -101,48 +101,59 @@ Python API reference

 .. automodapi:: lsst.daf.butler
   :no-main-docstr:
+   :no-inherited-members:


I can try to remove it and see if it's just warnings we get, but I know it's at least possible to get hard errors in the doc-build from the Sphinx bug this works around, and if that happens I'll have to put it back in at least a few places. From the slack conversation that led to this commit it sounds like if we can finally upgrade our doc tooling to a newer Sphinx it may go away.

python/lsst/daf/butler/queries/_base.py

TallJimbo · 2024-03-01T15:18:25Z

python/lsst/daf/butler/queries/_base.py

+            self._tree, order_by=convert_order_by_args(self.dimensions, self._get_datasets(), *args)
+        )
+
+    def limit(self, limit: int | None = None, offset: int = 0) -> Self:


I don't know of any direct use cases, and I'd be happy to remove it. @timj , @andy-slac , do you have a reason we need offset support, or was it added originally just because it felt reasonable when we added limit?

TallJimbo · 2024-03-01T15:20:21Z

python/lsst/daf/butler/queries/_dataset_query_results.py

+class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]):
+    """A query for `DatasetRef` results."""
+
+    @abstractmethod
+    def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
+        """Group results by dataset type.
+
+        Returns
+        -------
+        iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ]
+            An iterator over `DatasetQueryResults` instances that are each
+            responsible for a single dataset type.
+        """
+        raise NotImplementedError()
+
+    @property
+    @abstractmethod
+    def has_dimension_records(self) -> bool:
+        """Whether all data IDs in this iterable contain dimension records."""
+        raise NotImplementedError()
+
+    @abstractmethod
+    def with_dimension_records(self) -> DatasetQueryResults:
+        """Return a results object for which `has_dimension_records` is
+        `True`.
+        """
+        raise NotImplementedError()


I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).

TallJimbo · 2024-03-01T16:46:42Z

python/lsst/daf/butler/queries/_dataset_query_results.py

+    def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]:
+        # Docstring inherited.
+        return iter(self._by_dataset_type)


👍 to dropping ChainedDatasetQueryResults. That will simplify the mess of base classes a lot, and I do think the long term plan is to run those queries separately.

We would then need a separate method or methods for some important summary queries, like "which dataset types exist at all in these collections". I had originally planned to add those in the future as convenience methods on Query that would delegate to ChainedDatasetQueryResults.by_dataset_type() and then call any on each (and hence not require any new QueryDriver methods). But having a driver method dedicated to those summaries is not really a burden, and it could save us from having to bring the CollectionSummary objects down to the client at all (I'll have to work through that possibility to be sure) since that was largely about optimizing those "dataset types in collections" summary queries. (The other reason to bring them down to the client is that caching them there may be easier than caching them on the server, but that's also very much a "maybe".)

I'll start with deleting ChainedDatasetQueryResults and whatever base classes that renders useless on this branch before I merge. If that is goes well I'll stop there and put summarization methods on another ticket, but I think I might want to prototype those out before I dive back into implementing the driver to make sure I don't want to delete or change anything in the lower levels that I should have kept.

python/lsst/daf/butler/queries/tree/_query_tree.py

TallJimbo · 2024-03-01T18:00:40Z

python/lsst/daf/butler/queries/tree/_query_tree.py

+    storage_class_name: str | None
+    """Name of the storage class to use when returning `DatasetRef` results.
+
+    May be `None` if the dataset is only used as a constraint or to return
+    columns that do not include a full dataset type.
+    """


I struggled with this a lot myself. I added it because:

if Query.join_dataset_search can take a DatasetType and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it;

I wanted Query.join_dataset_search to be able to take DatasetType as well as str, both so Query.datasets could delegate to it and so I didn't have to document why it was unusual in that respect.

I think the real problem is that DatasetType has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass public class is its own problem. And that's all thoroughly water under the bridge.

In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query (and it would be Query, not SingleTypeDatasetResults, to serve its current purpose).

python/lsst/daf/butler/queries/tree/_query_tree.py

python/lsst/daf/butler/queries/visitors.py

dhirving · 2024-03-01T21:59:34Z

I'm not really accustomed to leaving even small review-response changes to another ticket, but this is such a behemoth PR that I can imagine that working better for the reviewer - so I'll leave that up to you.

🤷 Probably makes sense to clean up some of the little stuff like the wrong comments before merging, but to me it makes sense to get this version merged down before removing the ChainedDatasetQueryResults or any changes like that.

For a new feature like this where the impact on users is basically nil until we make it public, there's basically no downside to merging early and merging often :) (Other than that our development workflow is kind of annoying with the need to generate a new JIRA ticket to make a PR.)

Benefits to you of having this code merged:

You have a record of how the design has evolved as you tweak it, and you can more easily backtrack a change or resurrect a useful utility function because there is a linear history on a single branch to look through to find it.
You don't need to rebase it anymore, and any new changes people make will take into account the code you've written here.
You reduce the risk of subtle breakage from rebasing, because any rebasing will happen in smaller chunks.
Having the rough framework merged makes it easier to delegate parts to other people

Benefits to the rest of the team of having this code merged:

We can make changes to it (e.g. as I move the exception hierarchy around this week I can see which ones you've used here and make decisions accordingly.)
It's easier to learn and think about it, because it's visible in the same branches we're working on.
We don't have to wonder which of your side branches we should be looking at, because we can just see the progress on main.

This keeps our doc build from trying to parse upstream docstrings that may not be numpydoc. I'm sure we don't have to use it in all of our modules, but I don't have much of an opinion about whether including inherited members is generally useful, so consistency wins.

This was a very minor issue; running mypy against daf_butler without daf_relation setup will now cause mypy to complain. But of course it's rare for us to do that because more than mypy breaks.

I'm prioritizing how this appears when built-in connections are printed (which always calls repr) over the "print like you'd construct it" guideline.

TallJimbo · 2024-03-04T15:59:27Z

Ok, I've fixed a few of the doc/comment issues and I'm punting the rest to DM-43146.

This includes: - replacing the Query and *QueryResults ABCs with concrete classes that delegate to another QueryDriver ABC; - substantially reworking the public Query and *QueryResults interfaces, mostly to minimize the number of different ways to do various things (and hence limit complexity); - adding a large suite of Pydantic models that can describe complex under-construction queries, allowing us to send them over the wire in RemoteButler. Because QueryDriver doesn't have any concrete implementations yet, this change means Butler._query no longer works at (previously it delegated to the old registry.queries system). A QueryDriver implementation for DirectButler has been largely implemented on another branch and will be added later. For now, the only tests are those that rely on a mocked QueryDriver (or don't require one at all). These are in two files: - test_query_interfaces.py tests the public interface objects, including the semi-public Pydantic models; - test_query_utilities.py tests some utility classes (ColumnSet and OverlapsVisitor) that are expected to be used by all driver implementations to establish some behavioral invariants. There is already substantial duplication with code in lsst.daf.butler.registry.queries, and that will get worse when a direct-SQL driver class is added. Eventually the plan is to retire almost all of lsst.daf.butler.registry.queries (except the string-expression parser, which we'll move later) making the public registry query interfaces delegate to lsst.daf.butler.queries instead, but that will require both getting the latter fully functional and RFC'ing the removal of some things we have no intention of doing in the new system. Fix outdated docs for Butler._query_datasets. Fix docstrings on column literal value attributes. DO NOT MERGE; SQUASH. Make implementation-centric docstring into a comment. DO NOT MERGE; SQUASH.

TallJimbo force-pushed the tickets/DM-42740 branch 2 times, most recently from 5d3bae1 to 8e48a44 Compare February 26, 2024 16:30

TallJimbo force-pushed the tickets/DM-42740 branch from 8e48a44 to 97d3e88 Compare February 26, 2024 17:33

dhirving approved these changes Feb 29, 2024

View reviewed changes

TallJimbo commented Mar 1, 2024

View reviewed changes

TallJimbo added 6 commits March 4, 2024 10:57

Fix daf_relation reference in mypy.ini.

8058884

This was a very minor issue; running mypy against daf_butler without daf_relation setup will now cause mypy to complain. But of course it's rare for us to do that because more than mypy breaks.

Handle None in to-pyarrow converters.

7751419

Add __repr__ to DimensionRecordSet.

9697070

Add __repr__ to TopologicalFamily.

4524495

I'm prioritizing how this appears when built-in connections are printed (which always calls repr) over the "print like you'd construct it" guideline.

Drop butler_query test suite.

9c63025

TallJimbo force-pushed the tickets/DM-42740 branch from 714931b to 3b92d1c Compare March 4, 2024 15:58

TallJimbo force-pushed the tickets/DM-42740 branch from 3b92d1c to b2a5479 Compare March 4, 2024 16:01

TallJimbo force-pushed the tickets/DM-42740 branch from b2a5479 to cdc963a Compare March 4, 2024 17:07

TallJimbo merged commit c40c3fe into main Mar 4, 2024
18 checks passed

TallJimbo deleted the tickets/DM-42740 branch March 4, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-42740: rewrite new query system interfaces and division of responsibilities #966

DM-42740: rewrite new query system interfaces and division of responsibilities #966

TallJimbo commented Feb 26, 2024 •

edited

codecov bot commented Feb 26, 2024 •

edited

dhirving left a comment

dhirving Feb 27, 2024

TallJimbo Mar 1, 2024

dhirving Mar 1, 2024

TallJimbo Mar 4, 2024

TallJimbo Mar 4, 2024

dhirving Feb 28, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 4, 2024

dhirving Feb 29, 2024

TallJimbo Mar 1, 2024

dhirving Feb 29, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 4, 2024

dhirving Feb 29, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 4, 2024

TallJimbo left a comment

TallJimbo Mar 1, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 1, 2024

TallJimbo Mar 1, 2024

dhirving commented Mar 1, 2024

TallJimbo commented Mar 4, 2024 •

edited by jira bot

DM-42740: rewrite new query system interfaces and division of responsibilities #966

DM-42740: rewrite new query system interfaces and division of responsibilities #966

Conversation

TallJimbo commented Feb 26, 2024 • edited

Checklist

codecov bot commented Feb 26, 2024 • edited

Codecov Report

dhirving left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhirving commented Mar 1, 2024

TallJimbo commented Mar 4, 2024 • edited by jira bot

TallJimbo commented Feb 26, 2024 •

edited

codecov bot commented Feb 26, 2024 •

edited

TallJimbo commented Mar 4, 2024 •

edited by jira bot