New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-32403: Add support for order_by/limit to query results #601
Conversation
f01e54f
to
402c62a
Compare
Codecov Report
@@ Coverage Diff @@
## master #601 +/- ##
==========================================
+ Coverage 83.54% 83.64% +0.10%
==========================================
Files 241 241
Lines 30316 30624 +308
Branches 4528 4591 +63
==========================================
+ Hits 25326 25614 +288
- Misses 3795 3812 +17
- Partials 1195 1198 +3
Continue to review full report at Codecov.
|
6c92507
to
f00059c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I noted in one or two of the line comments, this has a pretty big limitation, in that you can only order by metadata fields if something else has happened to ensure the table that provides those fields has been joined into the query, and this is frequently not the case, especially when datasets are included in the query. In fact, I think with the way queryDimensionRecords
works, even metadata fields from the dimension element whose records are being queried are not available for use in order_by
, because order_by
acts on a subquery generated by queryDataIds
that may not join in that table.
The quick solution to this problem is to be less ambitious: remove support for ordering by metadata fields, because ordering by dimension keys should work fine. While I'm sure our colleagues really want to order_by(["day_obs", "seq_num"])
, we can tell them to order_by(["exposure"])
as a workaround and they'll get the same order with a better chance of good index usage anyway.
Full support for metadata-field ordering would require moving this logic from Query
to QuerySummary
and QueryBuilder
:
-
Add a mapping similar to
QueryWhereClause.columns
toQuerySummary
, probably via an analogous nested struct that processes the user's order_by argument and stores the columns it needs. The important thing is thatQuerySummary.mustHaveTableJoined
include anyDimensionElement
whose metadata fields are included in theorder_by
list, and something inQuerySummary
should remember theorder_by
list itself so it can be used later. -
Move the logic in
_OrderByHandler.query_columns
to [something called by]QueryBuilder.finish
, operating onQueryBuilder._elements
, and then passing theDirectQuery
constructor just the list of sqlalchemy columns to order by. -
Make
Query.order_by
return a newQuery
by callingQuery.makeBuilder
with an augmentedQuerySummary
that includes theorder_by
request, use that to make aQueryBuilder
and use that to make a newQuery
to return.
I am not sure it is worth taking the more difficult/complete road right now, as I think most needs can be met with dimension-key ordering. But this is tied up with the question of how to handle RemoteRegistry - at least a bit - and I'll bring that up on Jira since that's where that conversation started.
I see that |
And I think my general naive idea was that to order something based on metadata its corresponding dimension have to be in the query already. If you want to order based on metadata from some other dimension, we'd have to include that dimension into join, but that would be a different query with potentially different constraints? |
I think we resolved this question at the middleware standup, but for posterity, I am specifically worried about the case where the dimension is in the query, but as an optimization we did not include its table, either because a dataset with that dimension is in the query, or because some dependent dimension is in the query. The user won't be aware of those optimizations and at present can't control them, so it would be surprising for their order_by to stop working when they are adding new constraints to the query but aren't changing the dimensions. |
a8b36c1
to
e252dde
Compare
@TallJimbo, I'm now trying to do the right thing and follow your recipe:
but without much success. |
e252dde
to
7b052e3
Compare
48200f1
to
3a0cf92
Compare
Ah, very good point; I was imaging that |
@TallJimbo, I think I managed to resolve the problem by delaying the construction of a For NULL timespans I think one possible option is to assume begin/end as 0 for them so that they sort before any actual timespan, that should make things work as expected for people looking for the most recent values. For advanced handling we could filter NULLs using user expressions (not sure that it is supported right now though). |
SQL registry method queryDimensionRecords now returns special iterable object instead of plain iterator. Iterables returned from queryDataIds and queryDimensionRecords have new methods order_by() and limit(). Same two methods were added to Query classes. Extended unit tests for query methods to check this new functionality. Note that remote registry does not support this functionality (yet).
3a0cf92
to
230751b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 New version looks good; the extra indirection is pragmatic way to get the functionality in without a major rewrite of the queries subpackage, even though it does add some complexity that makes doing that rewrite feel a bit more important to me (to be clear, most of the problems were preexisting by far).
As for timespans, I'm pretty happy with any ordering. My main concern was actually with the endpoint SQLALchemy accessors on TimespanDatabaseRepresentation
themselves, and in particular whether we can define consistent ordered comparison operations on those in the presence of NULL. I think that only matters if we also add support for them in the boolean expression language, so all this ticket does is make that seem a bit easier (when it's actually quite hard). So perhaps we should just add a disclaimer to the docs or comments on those attributes saying that Timespan
methods should always be used for comparisons in boolean expressions, and attributes should not be exposed to the expression language in the future without a lot of attention to detail.
bc65bb0
to
0501a6f
Compare
I have added this note to
|
SQL registry method
queryDimensionRecords
now returns special iterableobject instead of plain iterator. Iterables returned from
queryDataIds
and
queryDimensionRecords
have new methodsorder_by()
andlimit()
. Sametwo methods were added to
Query
classes. Extended unit tests for querymethods to check this new functionality.
Note that remote registry does not support this functionality (yet)
Checklist
doc/changes