New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-42740: rewrite new query system interfaces and division of responsibilities #966
Conversation
5d3bae1
to
8e48a44
Compare
8e48a44
to
97d3e88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work pretty well. It pushes a bit more complexity to the client than I would prefer -- mostly in join_dataset_search
where it forces the client to do work deciding what dimensions will need to be involved in the query instead of just recording what dataset types the client asked for and sending it to the server. However, since the dimensions are fairly high-level logical things inherent to the way Butler works I suspect this won't cause many backwards-compatibility issues. The serialized data sent in QueryTree and the ResultSpec classes is quite straightforward and simple, so that's a win.
I don't see anything here that would be harmful to merge in its current form. To the extent that you decide to make changes in response to these comments it can probably happen in a later (and hopefully much smaller!) PR. (Also note there are 26 comments here -- sometimes GitHub likes to hide some when it goes over 10.)
storage_class_name: str | None | ||
"""Name of the storage class to use when returning `DatasetRef` results. | ||
|
||
May be `None` if the dataset is only used as a constraint or to return | ||
columns that do not include a full dataset type. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI the strategy I've been using so far on StorageClass is that the client never sends the server anything about a StorageClass -- the server sends back a ref with the default storage class for the data type and the client overrides it if required. This theoretically lets us support clients loading data using custom/experimental storage classes not known to the server.
I don't think it's harmful to have this get sent to the server and ignored, though at first glance I would have expected this bit of data to be attached to the SingleTypeDatasetQueryResults object and not in this part of the QueryTree... since the dataset types provided in the call to datasets()
define the output from the query, not any intermediate joins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggled with this a lot myself. I added it because:
- if
Query.join_dataset_search
can take aDatasetType
and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it; - I wanted
Query.join_dataset_search
to be able to takeDatasetType
as well asstr
, both soQuery.datasets
could delegate to it and so I didn't have to document why it was unusual in that respect.
I think the real problem is that DatasetType
has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass
public class is its own problem. And that's all thoroughly water under the bridge.
In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query
(and it would be Query
, not SingleTypeDatasetResults
, to serve its current purpose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if
Query.join_dataset_search
can take aDatasetType
and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it
Don't they already have to pass the dataset type again when they call datasets()
to declare that they want refs back for this dataset type? To me it makes sense that whatever you pass into datasets
is the authoritative thing, since that's where you're declaring the output format.
not
SingleTypeDatasetResults
, to serve its current purpose).
It's possible I'm misunderstanding what is going on with this StorageClass (or maybe all the effects that a StorageClass can have?) To me it looks like you could just pass in the StorageClass when you create the SingleTypeDatasetQueryResults, and call ref.overrideStorageClass
if needed just before yielding from __iter__
. If this doesn't work I'd like to know because I'm using a similar pattern a few places in RemoteButler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In almost all cases, that's right - it makes more sense to pass the storage class in later. It's really quite an edge case I'm worried about here:
- user calls
Query.join_dataset_search
with aDatasetType
instance that has a storage class that is not the same as the default one for that dataset type in the repository; - user calls
Query.datasets
with astr
dataset type name that has no storage class information. Do we respect their previous storage class override?
Maybe it would be better to raise an exception or emit a warning if the DatasetType
passed to join_dataset_search
has a storage class that isn't the same as the repository one (since we need to look up the dataset type definition in the repository at that point anyway, to get the dimensions), saying that the storage class passed in there will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on DM-43146: DatasetSearch
no longer has storage_class_name
. Instead join_dataset_search
raises if you give it a storage class override.
class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]): | ||
"""A query for `DatasetRef` results.""" | ||
|
||
@abstractmethod | ||
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]: | ||
"""Group results by dataset type. | ||
|
||
Returns | ||
------- | ||
iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ] | ||
An iterator over `DatasetQueryResults` instances that are each | ||
responsible for a single dataset type. | ||
""" | ||
raise NotImplementedError() | ||
|
||
@property | ||
@abstractmethod | ||
def has_dimension_records(self) -> bool: | ||
"""Whether all data IDs in this iterable contain dimension records.""" | ||
raise NotImplementedError() | ||
|
||
@abstractmethod | ||
def with_dimension_records(self) -> DatasetQueryResults: | ||
"""Return a results object for which `has_dimension_records` is | ||
`True`. | ||
""" | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this DatasetQueryResults
base class is necessary. If Query.datasets
always returned a ChainedDatasetQueryResults
(potentially of length 1, and presumably renamed to DatasetQueryResults
) then:
- The
DatasetQueryResults
base class could be removed, leaving the reader to with less of a multiple inheritance tangle to puzzle out (there's at least one diamond in the current setup.) - There would be fewer code paths to go down -- users and maintainers wouldn't have to worry about potentially varying behavior between the 'direct' case and the 'chained' case. All test coverage would be covering the same code paths.
by_dataset_type
could be removed from SingleTypeDatasetQueryResults (and possibly other methods with a bit of restructuring that becomes possible when the interfaces of SingleTypeDatasetQueryResults and ChainedDatasetQueryResults don't have to be identical.)- SingleTypeDatasetQueryResults could potentially become private to this file (if all available methods were forwarded from ChainedDatasetQueryResults, which is probably desirable for consistency anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on DM-43146.
|
||
expression_type: Literal["float"] = "float" | ||
|
||
value: float |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON encoders have a long history of doing whatever they feel like with floating point precision. I'm not sure what Pydantic/Python's versions do... have you confirmed that they guarantee they will round-trip values with sufficient precision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not. Thankfully float
fields are pretty rare in our schema and I can't think of anything that would require high precision, but it would be nice to know if we're losing anything here.
def precedence(self) -> int: | ||
"""Operator precedence for this operation. | ||
|
||
Lower values bind more tightly, so parentheses are needed when printing | ||
an expression where an operand has a higher value than the expression | ||
itself. | ||
""" | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised to see precedence here... I would have expected that to be resolved at the higher level where we parse the input expressions, so that we just have a tree of expressions that could be evaluated directly here without worrying about ordering.
Is this only used for debug stringifications? If it's me I would just always put the parentheses... less complexity, and it makes the tree structure explicit (which is often what you're trying to debug if you're looking at these?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just about stringification and it might now be overkill. I added it back when Predicate
was a flexible tree of binary AND and OR operators, and A AND (B AND (C AND (D AND E)))
happened a lot and did get in the way of readability. But now that Predicate
always maintains conjunctive normal form (ANDs-of-ORs-[of NOT]) we don't need it to solve that particular problem. Let me see how it goes if I rip it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (on DM-43146): we now just add parentheses for any kind of nested expression (but do try to avoid them on literals, direct references to columns, and predicates).
# TODO: string length matches the one defined in the | ||
# CollectionManager implementations; we need to find a way to | ||
# avoid hard-coding the value in multiple places. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just declare a constant and reference it in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I just forgot to come back to this and do it later on the branch; will do. I think I was just hung up on where to put the constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done on DM-43146, and a good thing, too - I had actually gotten the number wrong here before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's a round of replies about changes I now plan to make but haven't made yet.
I'm not really accustomed to leaving even small review-response changes to another ticket, but this is such a behemoth PR that I can imagine that working better for the reviewer - so I'll leave that up to you. I'll start working on those on strictly separate commits, and I can put them on some other branch or push them here later.
I do think I want to make a number of those changes before getting back to implementing the driver, because I think they may actually make that easier, or at least differently hard.
@@ -101,48 +101,59 @@ Python API reference | |||
|
|||
.. automodapi:: lsst.daf.butler | |||
:no-main-docstr: | |||
:no-inherited-members: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to remove it and see if it's just warnings we get, but I know it's at least possible to get hard errors in the doc-build from the Sphinx bug this works around, and if that happens I'll have to put it back in at least a few places. From the slack conversation that led to this commit it sounds like if we can finally upgrade our doc tooling to a newer Sphinx it may go away.
self._tree, order_by=convert_order_by_args(self.dimensions, self._get_datasets(), *args) | ||
) | ||
|
||
def limit(self, limit: int | None = None, offset: int = 0) -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know of any direct use cases, and I'd be happy to remove it. @timj , @andy-slac , do you have a reason we need offset
support, or was it added originally just because it felt reasonable when we added limit
?
class DatasetQueryResults(CountableQueryBase, Iterable[DatasetRef]): | ||
"""A query for `DatasetRef` results.""" | ||
|
||
@abstractmethod | ||
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]: | ||
"""Group results by dataset type. | ||
|
||
Returns | ||
------- | ||
iter : `~collections.abc.Iterator` [ `SingleTypeDatasetQueryResults` ] | ||
An iterator over `DatasetQueryResults` instances that are each | ||
responsible for a single dataset type. | ||
""" | ||
raise NotImplementedError() | ||
|
||
@property | ||
@abstractmethod | ||
def has_dimension_records(self) -> bool: | ||
"""Whether all data IDs in this iterable contain dimension records.""" | ||
raise NotImplementedError() | ||
|
||
@abstractmethod | ||
def with_dimension_records(self) -> DatasetQueryResults: | ||
"""Return a results object for which `has_dimension_records` is | ||
`True`. | ||
""" | ||
raise NotImplementedError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider the single-type results to be the important thing and the chain to be secondary, because I expect most queries to be for a specific dataset type. But I very much like the idea in one of your other comments of just dropping the support for querying more than one dataset type at a time in favor of the user writing their own loop over dataset types (let's discuss that further on that other thread).
def by_dataset_type(self) -> Iterator[SingleTypeDatasetQueryResults]: | ||
# Docstring inherited. | ||
return iter(self._by_dataset_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 to dropping ChainedDatasetQueryResults
. That will simplify the mess of base classes a lot, and I do think the long term plan is to run those queries separately.
We would then need a separate method or methods for some important summary queries, like "which dataset types exist at all in these collections". I had originally planned to add those in the future as convenience methods on Query
that would delegate to ChainedDatasetQueryResults.by_dataset_type()
and then call any
on each (and hence not require any new QueryDriver
methods). But having a driver method dedicated to those summaries is not really a burden, and it could save us from having to bring the CollectionSummary
objects down to the client at all (I'll have to work through that possibility to be sure) since that was largely about optimizing those "dataset types in collections" summary queries. (The other reason to bring them down to the client is that caching them there may be easier than caching them on the server, but that's also very much a "maybe".)
I'll start with deleting ChainedDatasetQueryResults
and whatever base classes that renders useless on this branch before I merge. If that is goes well I'll stop there and put summarization methods on another ticket, but I think I might want to prototype those out before I dive back into implementing the driver to make sure I don't want to delete or change anything in the lower levels that I should have kept.
storage_class_name: str | None | ||
"""Name of the storage class to use when returning `DatasetRef` results. | ||
|
||
May be `None` if the dataset is only used as a constraint or to return | ||
columns that do not include a full dataset type. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggled with this a lot myself. I added it because:
- if
Query.join_dataset_search
can take aDatasetType
and the user wants dataset results, it'll be confusing if they passed in a storage class override initially and then we forgot it just because we didn't have a place to put it; - I wanted
Query.join_dataset_search
to be able to takeDatasetType
as well asstr
, both soQuery.datasets
could delegate to it and so I didn't have to document why it was unusual in that respect.
I think the real problem is that DatasetType
has a storage class, and that's a problem in lots of other places, but of course it's tremendously useful in probably even more other places and having a DatasetTypeWithoutStorageClass
public class is its own problem. And that's all thoroughly water under the bridge.
In any case, yes, I think the server can ignore this (or just blindly copy it into Parquet files, so the client doesn't have to edit them); the important thing is that the client doesn't forget it, and I think it's slightly better (even if that means sending it to the server) than out on its own in Query
(and it would be Query
, not SingleTypeDatasetResults
, to serve its current purpose).
🤷 Probably makes sense to clean up some of the little stuff like the wrong comments before merging, but to me it makes sense to get this version merged down before removing the ChainedDatasetQueryResults or any changes like that. For a new feature like this where the impact on users is basically nil until we make it public, there's basically no downside to merging early and merging often :) (Other than that our development workflow is kind of annoying with the need to generate a new JIRA ticket to make a PR.) Benefits to you of having this code merged:
Benefits to the rest of the team of having this code merged:
|
This keeps our doc build from trying to parse upstream docstrings that may not be numpydoc. I'm sure we don't have to use it in all of our modules, but I don't have much of an opinion about whether including inherited members is generally useful, so consistency wins.
This was a very minor issue; running mypy against daf_butler without daf_relation setup will now cause mypy to complain. But of course it's rare for us to do that because more than mypy breaks.
I'm prioritizing how this appears when built-in connections are printed (which always calls repr) over the "print like you'd construct it" guideline.
714931b
to
3b92d1c
Compare
Ok, I've fixed a few of the doc/comment issues and I'm punting the rest to DM-43146. |
3b92d1c
to
b2a5479
Compare
This includes: - replacing the Query and *QueryResults ABCs with concrete classes that delegate to another QueryDriver ABC; - substantially reworking the public Query and *QueryResults interfaces, mostly to minimize the number of different ways to do various things (and hence limit complexity); - adding a large suite of Pydantic models that can describe complex under-construction queries, allowing us to send them over the wire in RemoteButler. Because QueryDriver doesn't have any concrete implementations yet, this change means Butler._query no longer works at (previously it delegated to the old registry.queries system). A QueryDriver implementation for DirectButler has been largely implemented on another branch and will be added later. For now, the only tests are those that rely on a mocked QueryDriver (or don't require one at all). These are in two files: - test_query_interfaces.py tests the public interface objects, including the semi-public Pydantic models; - test_query_utilities.py tests some utility classes (ColumnSet and OverlapsVisitor) that are expected to be used by all driver implementations to establish some behavioral invariants. There is already substantial duplication with code in lsst.daf.butler.registry.queries, and that will get worse when a direct-SQL driver class is added. Eventually the plan is to retire almost all of lsst.daf.butler.registry.queries (except the string-expression parser, which we'll move later) making the public registry query interfaces delegate to lsst.daf.butler.queries instead, but that will require both getting the latter fully functional and RFC'ing the removal of some things we have no intention of doing in the new system. Fix outdated docs for Butler._query_datasets. Fix docstrings on column literal value attributes. DO NOT MERGE; SQUASH. Make implementation-centric docstring into a comment. DO NOT MERGE; SQUASH.
b2a5479
to
cdc963a
Compare
Checklist
added a release note for user-visible changes to(not yet user-visibile)doc/changes