DM-32403: Add support for order_by/limit to query results #601

andy-slac · 2021-11-10T19:32:42Z

SQL registry method queryDimensionRecords now returns special iterable
object instead of plain iterator. Iterables returned from queryDataIds
and queryDimensionRecords have new methods order_by() and limit(). Same
two methods were added to Query classes. Extended unit tests for query
methods to check this new functionality.

Note that remote registry does not support this functionality (yet)

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2021-11-10T19:47:23Z

Codecov Report

Merging #601 (0501a6f) into master (3fdd349) will increase coverage by 0.10%.
The diff coverage is 92.26%.

@@            Coverage Diff             @@
##           master     #601      +/-   ##
==========================================
+ Coverage   83.54%   83.64%   +0.10%     
==========================================
  Files         241      241              
  Lines       30316    30624     +308     
  Branches     4528     4591      +63     
==========================================
+ Hits        25326    25614     +288     
- Misses       3795     3812      +17     
- Partials     1195     1198       +3

Impacted Files	Coverage Δ
python/lsst/daf/butler/_butler.py	`78.51% <ø> (ø)`
python/lsst/daf/butler/core/_topology.py	`79.06% <ø> (+0.77%)`	⬆️
python/lsst/daf/butler/core/config.py	`91.94% <ø> (ø)`
python/lsst/daf/butler/core/datasets/type.py	`81.25% <ø> (ø)`
...hon/lsst/daf/butler/core/dimensions/_coordinate.py	`83.43% <ø> (ø)`
.../butler/core/dimensions/_dataCoordinateIterable.py	`71.16% <ø> (ø)`
...ython/lsst/daf/butler/core/dimensions/_database.py	`90.71% <ø> (ø)`
...ython/lsst/daf/butler/core/dimensions/_elements.py	`76.98% <ø> (+0.79%)`	⬆️
...ython/lsst/daf/butler/core/dimensions/_governor.py	`89.83% <ø> (ø)`
python/lsst/daf/butler/core/dimensions/_graph.py	`81.64% <ø> (ø)`
... and 42 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fdd349...0501a6f. Read the comment docs.

TallJimbo

As I noted in one or two of the line comments, this has a pretty big limitation, in that you can only order by metadata fields if something else has happened to ensure the table that provides those fields has been joined into the query, and this is frequently not the case, especially when datasets are included in the query. In fact, I think with the way queryDimensionRecords works, even metadata fields from the dimension element whose records are being queried are not available for use in order_by, because order_by acts on a subquery generated by queryDataIds that may not join in that table.

The quick solution to this problem is to be less ambitious: remove support for ordering by metadata fields, because ordering by dimension keys should work fine. While I'm sure our colleagues really want to order_by(["day_obs", "seq_num"]), we can tell them to order_by(["exposure"]) as a workaround and they'll get the same order with a better chance of good index usage anyway.

Full support for metadata-field ordering would require moving this logic from Query to QuerySummary and QueryBuilder:

Add a mapping similar to QueryWhereClause.columns to QuerySummary, probably via an analogous nested struct that processes the user's order_by argument and stores the columns it needs. The important thing is that QuerySummary.mustHaveTableJoined include any DimensionElement whose metadata fields are included in the order_by list, and something in QuerySummary should remember the order_by list itself so it can be used later.
Move the logic in _OrderByHandler.query_columns to [something called by] QueryBuilder.finish, operating on QueryBuilder._elements, and then passing the DirectQuery constructor just the list of sqlalchemy columns to order by.
Make Query.order_by return a new Query by calling Query.makeBuilder with an augmented QuerySummary that includes the order_by request, use that to make a QueryBuilder and use that to make a new Query to return.

I am not sure it is worth taking the more difficult/complete road right now, as I think most needs can be met with dimension-key ordering. But this is tied up with the question of how to handle RemoteRegistry - at least a bit - and I'll bring that up on Jira since that's where that conversation started.

python/lsst/daf/butler/datastores/chainedDatastore.py

python/lsst/daf/butler/datastores/fileDatastore.py

python/lsst/daf/butler/core/timespan.py

python/lsst/daf/butler/registry/queries/_query.py

python/lsst/daf/butler/registry/queries/_results.py

andy-slac · 2021-11-11T20:56:26Z

I think with the way queryDimensionRecords works, even metadata fields from the dimension element whose records are being queried are not available for use in order_by, because order_by acts on a subquery generated by queryDataIds that may not join in that table.

I see that queryDimensionRecords works by doing queryDataIds first and then selecting records that match returned DataIds, I don't think there are sub-query involved there. In my dumb implementation of queryDimensionRecords I order queryDataIds results and then re-order records based on that ordering.

andy-slac · 2021-11-11T21:02:29Z

And I think my general naive idea was that to order something based on metadata its corresponding dimension have to be in the query already. If you want to order based on metadata from some other dimension, we'd have to include that dimension into join, but that would be a different query with potentially different constraints?

TallJimbo · 2021-11-11T21:37:46Z

And I think my general naive idea was that to order something based on metadata its corresponding dimension have to be in the query already. If you want to order based on metadata from some other dimension, we'd have to include that dimension into join, but that would be a different query with potentially different constraints?

I think we resolved this question at the middleware standup, but for posterity, I am specifically worried about the case where the dimension is in the query, but as an optimization we did not include its table, either because a dataset with that dimension is in the query, or because some dependent dimension is in the query. The user won't be aware of those optimizations and at present can't control them, so it would be surprising for their order_by to stop working when they are adding new constraints to the query but aren't changing the dimensions.

andy-slac · 2021-11-15T00:04:51Z

@TallJimbo, I'm now trying to do the right thing and follow your recipe:

Make Query.order_by return a new Query by calling Query.makeBuilder with an augmented QuerySummary that includes the order_by request, use that to make a QueryBuilder and use that to make a new Query to return.

but without much success. Query.makeBuilder does make a builder, but that builder does not have _elements populated (naturally). To populate it I'd have to join all dimensions again calling finish(True) but I don't think this is the correct thing to do here as it will join them with the already existing query (sort of creating self-join). Do I need to re-implement makeBuilder to do something different (my guess it's not really possible to fill _elements from existing query).

TallJimbo · 2021-11-16T02:46:05Z

Ah, very good point; I was imaging that makeBuilder was much smarter than it actually is. And I don't think that can really be fixed without changing much more than just makeBuilder; the problem is that a Query just doesn't carry around enough information to reconstruct it. At this point I think it'd be better to go back to your original approach and just document the limitations.

andy-slac · 2021-11-16T04:23:11Z

@TallJimbo, I think I managed to resolve the problem by delaying the construction of a Query until the latest possible moment when order_by and limit are already known. It needed a sort of another level of indirection by passing a factory method instead of Query object to a Result constructor. Could you re-review ~~35f3ab0~~ 20ea772 to see if it's reasonable or can be improved?

For NULL timespans I think one possible option is to assume begin/end as 0 for them so that they sort before any actual timespan, that should make things work as expected for people looking for the most recent values. For advanced handling we could filter NULLs using user expressions (not sure that it is supported right now though).

SQL registry method queryDimensionRecords now returns special iterable object instead of plain iterator. Iterables returned from queryDataIds and queryDimensionRecords have new methods order_by() and limit(). Same two methods were added to Query classes. Extended unit tests for query methods to check this new functionality. Note that remote registry does not support this functionality (yet).

TallJimbo

👍 New version looks good; the extra indirection is pragmatic way to get the functionality in without a major rewrite of the queries subpackage, even though it does add some complexity that makes doing that rewrite feel a bit more important to me (to be clear, most of the problems were preexisting by far).

As for timespans, I'm pretty happy with any ordering. My main concern was actually with the endpoint SQLALchemy accessors on TimespanDatabaseRepresentation themselves, and in particular whether we can define consistent ordered comparison operations on those in the presence of NULL. I think that only matters if we also add support for them in the boolean expression language, so all this ticket does is make that seem a bit easier (when it's actually quite hard). So perhaps we should just add a disclaimer to the docs or comments on those attributes saying that Timespan methods should always be used for comparisons in boolean expressions, and attributes should not be exposed to the expression language in the future without a lot of attention to detail.

andy-slac · 2021-11-16T20:28:34Z

I have added this note to lower method (and similar to upper):

    Notes
    -----
    If database holds ``NULL`` for a timespan then the returned expression
    should evaluate to 0. Main purpose of this and `upper` method is to use
    them in generating SQL, in particular ORDER BY clause, to guarantee a
    predictable ordering. It may potentially be used for transforming
    boolean user expressions into SQL, but it will likely require extra
    attention to ordering issues.

andy-slac force-pushed the tickets/DM-32403 branch from f01e54f to 402c62a Compare November 10, 2021 19:33

andy-slac force-pushed the tickets/DM-32403 branch 3 times, most recently from 6c92507 to f00059c Compare November 11, 2021 03:21

TallJimbo reviewed Nov 11, 2021

View reviewed changes

andy-slac force-pushed the tickets/DM-32403 branch 3 times, most recently from a8b36c1 to e252dde Compare November 12, 2021 05:40

andy-slac force-pushed the tickets/DM-32403 branch from e252dde to 7b052e3 Compare November 16, 2021 01:00

Fix few typos in comments/docstrings

11151e1

andy-slac force-pushed the tickets/DM-32403 branch 2 times, most recently from 48200f1 to 3a0cf92 Compare November 16, 2021 01:51

andy-slac added 3 commits November 15, 2021 20:24

Add orderby/limit/offset options to cli commands.

f89880b

Add release notes and documentation for new methods

230751b

andy-slac force-pushed the tickets/DM-32403 branch from 3a0cf92 to 230751b Compare November 16, 2021 04:25

TallJimbo approved these changes Nov 16, 2021

View reviewed changes

Improve handling of NULL timespans

0501a6f

andy-slac force-pushed the tickets/DM-32403 branch from bc65bb0 to 0501a6f Compare November 16, 2021 20:27

andy-slac merged commit b15e845 into master Nov 16, 2021

andy-slac deleted the tickets/DM-32403 branch November 16, 2021 23:07

timj mentioned this pull request Jan 12, 2023

DM-30439: Refactor DataCoordinate #769

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-32403: Add support for order_by/limit to query results #601

DM-32403: Add support for order_by/limit to query results #601

andy-slac commented Nov 10, 2021 •

edited

codecov bot commented Nov 10, 2021 •

edited

TallJimbo left a comment

andy-slac commented Nov 11, 2021

andy-slac commented Nov 11, 2021

TallJimbo commented Nov 11, 2021 •

edited

andy-slac commented Nov 15, 2021

TallJimbo commented Nov 16, 2021

andy-slac commented Nov 16, 2021 •

edited

TallJimbo left a comment •

edited

andy-slac commented Nov 16, 2021

DM-32403: Add support for order_by/limit to query results #601

DM-32403: Add support for order_by/limit to query results #601

Conversation

andy-slac commented Nov 10, 2021 • edited

Checklist

codecov bot commented Nov 10, 2021 • edited

Codecov Report

TallJimbo left a comment

Choose a reason for hiding this comment

andy-slac commented Nov 11, 2021

andy-slac commented Nov 11, 2021

TallJimbo commented Nov 11, 2021 • edited

andy-slac commented Nov 15, 2021

TallJimbo commented Nov 16, 2021

andy-slac commented Nov 16, 2021 • edited

TallJimbo left a comment • edited

Choose a reason for hiding this comment

andy-slac commented Nov 16, 2021

andy-slac commented Nov 10, 2021 •

edited

codecov bot commented Nov 10, 2021 •

edited

TallJimbo commented Nov 11, 2021 •

edited

andy-slac commented Nov 16, 2021 •

edited

TallJimbo left a comment •

edited