Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-32403: Add support for order_by/limit to query results #601

Merged
merged 5 commits into from Nov 16, 2021

Conversation

andy-slac
Copy link
Contributor

@andy-slac andy-slac commented Nov 10, 2021

SQL registry method queryDimensionRecords now returns special iterable
object instead of plain iterator. Iterables returned from queryDataIds
and queryDimensionRecords have new methods order_by() and limit(). Same
two methods were added to Query classes. Extended unit tests for query
methods to check this new functionality.

Note that remote registry does not support this functionality (yet)

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Nov 10, 2021

Codecov Report

Merging #601 (0501a6f) into master (3fdd349) will increase coverage by 0.10%.
The diff coverage is 92.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #601      +/-   ##
==========================================
+ Coverage   83.54%   83.64%   +0.10%     
==========================================
  Files         241      241              
  Lines       30316    30624     +308     
  Branches     4528     4591      +63     
==========================================
+ Hits        25326    25614     +288     
- Misses       3795     3812      +17     
- Partials     1195     1198       +3     
Impacted Files Coverage Δ
python/lsst/daf/butler/_butler.py 78.51% <ø> (ø)
python/lsst/daf/butler/core/_topology.py 79.06% <ø> (+0.77%) ⬆️
python/lsst/daf/butler/core/config.py 91.94% <ø> (ø)
python/lsst/daf/butler/core/datasets/type.py 81.25% <ø> (ø)
...hon/lsst/daf/butler/core/dimensions/_coordinate.py 83.43% <ø> (ø)
.../butler/core/dimensions/_dataCoordinateIterable.py 71.16% <ø> (ø)
...ython/lsst/daf/butler/core/dimensions/_database.py 90.71% <ø> (ø)
...ython/lsst/daf/butler/core/dimensions/_elements.py 76.98% <ø> (+0.79%) ⬆️
...ython/lsst/daf/butler/core/dimensions/_governor.py 89.83% <ø> (ø)
python/lsst/daf/butler/core/dimensions/_graph.py 81.64% <ø> (ø)
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fdd349...0501a6f. Read the comment docs.

@andy-slac andy-slac force-pushed the tickets/DM-32403 branch 3 times, most recently from 6c92507 to f00059c Compare November 11, 2021 03:21
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I noted in one or two of the line comments, this has a pretty big limitation, in that you can only order by metadata fields if something else has happened to ensure the table that provides those fields has been joined into the query, and this is frequently not the case, especially when datasets are included in the query. In fact, I think with the way queryDimensionRecords works, even metadata fields from the dimension element whose records are being queried are not available for use in order_by, because order_by acts on a subquery generated by queryDataIds that may not join in that table.

The quick solution to this problem is to be less ambitious: remove support for ordering by metadata fields, because ordering by dimension keys should work fine. While I'm sure our colleagues really want to order_by(["day_obs", "seq_num"]), we can tell them to order_by(["exposure"]) as a workaround and they'll get the same order with a better chance of good index usage anyway.

Full support for metadata-field ordering would require moving this logic from Query to QuerySummary and QueryBuilder:

  • Add a mapping similar to QueryWhereClause.columns to QuerySummary, probably via an analogous nested struct that processes the user's order_by argument and stores the columns it needs. The important thing is that QuerySummary.mustHaveTableJoined include any DimensionElement whose metadata fields are included in the order_by list, and something in QuerySummary should remember the order_by list itself so it can be used later.

  • Move the logic in _OrderByHandler.query_columns to [something called by] QueryBuilder.finish, operating on QueryBuilder._elements, and then passing the DirectQuery constructor just the list of sqlalchemy columns to order by.

  • Make Query.order_by return a new Query by calling Query.makeBuilder with an augmented QuerySummary that includes the order_by request, use that to make a QueryBuilder and use that to make a new Query to return.

I am not sure it is worth taking the more difficult/complete road right now, as I think most needs can be met with dimension-key ordering. But this is tied up with the question of how to handle RemoteRegistry - at least a bit - and I'll bring that up on Jira since that's where that conversation started.

python/lsst/daf/butler/datastores/chainedDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/datastores/fileDatastore.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/timespan.py Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/queries/_query.py Outdated Show resolved Hide resolved
@andy-slac
Copy link
Contributor Author

I think with the way queryDimensionRecords works, even metadata fields from the dimension element whose records are being queried are not available for use in order_by, because order_by acts on a subquery generated by queryDataIds that may not join in that table.

I see that queryDimensionRecords works by doing queryDataIds first and then selecting records that match returned DataIds, I don't think there are sub-query involved there. In my dumb implementation of queryDimensionRecords I order queryDataIds results and then re-order records based on that ordering.

@andy-slac
Copy link
Contributor Author

And I think my general naive idea was that to order something based on metadata its corresponding dimension have to be in the query already. If you want to order based on metadata from some other dimension, we'd have to include that dimension into join, but that would be a different query with potentially different constraints?

@TallJimbo
Copy link
Member

TallJimbo commented Nov 11, 2021

And I think my general naive idea was that to order something based on metadata its corresponding dimension have to be in the query already. If you want to order based on metadata from some other dimension, we'd have to include that dimension into join, but that would be a different query with potentially different constraints?

I think we resolved this question at the middleware standup, but for posterity, I am specifically worried about the case where the dimension is in the query, but as an optimization we did not include its table, either because a dataset with that dimension is in the query, or because some dependent dimension is in the query. The user won't be aware of those optimizations and at present can't control them, so it would be surprising for their order_by to stop working when they are adding new constraints to the query but aren't changing the dimensions.

@andy-slac andy-slac force-pushed the tickets/DM-32403 branch 3 times, most recently from a8b36c1 to e252dde Compare November 12, 2021 05:40
@andy-slac
Copy link
Contributor Author

@TallJimbo, I'm now trying to do the right thing and follow your recipe:

Make Query.order_by return a new Query by calling Query.makeBuilder with an augmented QuerySummary that includes the order_by request, use that to make a QueryBuilder and use that to make a new Query to return.

but without much success. Query.makeBuilder does make a builder, but that builder does not have _elements populated (naturally). To populate it I'd have to join all dimensions again calling finish(True) but I don't think this is the correct thing to do here as it will join them with the already existing query (sort of creating self-join). Do I need to re-implement makeBuilder to do something different (my guess it's not really possible to fill _elements from existing query).

@andy-slac andy-slac force-pushed the tickets/DM-32403 branch 2 times, most recently from 48200f1 to 3a0cf92 Compare November 16, 2021 01:51
@TallJimbo
Copy link
Member

Ah, very good point; I was imaging that makeBuilder was much smarter than it actually is. And I don't think that can really be fixed without changing much more than just makeBuilder; the problem is that a Query just doesn't carry around enough information to reconstruct it. At this point I think it'd be better to go back to your original approach and just document the limitations.

@andy-slac
Copy link
Contributor Author

andy-slac commented Nov 16, 2021

@TallJimbo, I think I managed to resolve the problem by delaying the construction of a Query until the latest possible moment when order_by and limit are already known. It needed a sort of another level of indirection by passing a factory method instead of Query object to a Result constructor. Could you re-review 35f3ab0 20ea772 to see if it's reasonable or can be improved?

For NULL timespans I think one possible option is to assume begin/end as 0 for them so that they sort before any actual timespan, that should make things work as expected for people looking for the most recent values. For advanced handling we could filter NULLs using user expressions (not sure that it is supported right now though).

SQL registry method queryDimensionRecords now returns special iterable
object instead of plain iterator. Iterables returned from queryDataIds
and queryDimensionRecords have new methods order_by() and limit(). Same
two methods were added to Query classes. Extended unit tests for query
methods to check this new functionality.

Note that remote registry does not support this functionality (yet).
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 New version looks good; the extra indirection is pragmatic way to get the functionality in without a major rewrite of the queries subpackage, even though it does add some complexity that makes doing that rewrite feel a bit more important to me (to be clear, most of the problems were preexisting by far).

As for timespans, I'm pretty happy with any ordering. My main concern was actually with the endpoint SQLALchemy accessors on TimespanDatabaseRepresentation themselves, and in particular whether we can define consistent ordered comparison operations on those in the presence of NULL. I think that only matters if we also add support for them in the boolean expression language, so all this ticket does is make that seem a bit easier (when it's actually quite hard). So perhaps we should just add a disclaimer to the docs or comments on those attributes saying that Timespan methods should always be used for comparisons in boolean expressions, and attributes should not be exposed to the expression language in the future without a lot of attention to detail.

@andy-slac
Copy link
Contributor Author

I have added this note to lower method (and similar to upper):

    Notes
    -----
    If database holds ``NULL`` for a timespan then the returned expression
    should evaluate to 0. Main purpose of this and `upper` method is to use
    them in generating SQL, in particular ORDER BY clause, to guarantee a
    predictable ordering. It may potentially be used for transforming
    boolean user expressions into SQL, but it will likely require extra
    attention to ordering issues.

@andy-slac andy-slac merged commit b15e845 into master Nov 16, 2021
@andy-slac andy-slac deleted the tickets/DM-32403 branch November 16, 2021 23:07
@timj timj mentioned this pull request Jan 12, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants