DM-33164: Reimplement `order_by` method for query results #630

andy-slac · 2022-01-11T04:34:01Z

DataCoordinateQueryResults is now using window function row_number() ordering rank for DataIds query, this simplifies handling of the ordering columns and generation of the materialized query. queryDimensionRecords() is doing ordering in-memory by using an ordering function that compares attributes of DimensionRecord instances.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

codecov · 2022-01-11T04:47:21Z

Codecov Report

Merging #630 (4df5dc9) into main (294c620) will decrease coverage by 0.00%.
The diff coverage is 86.50%.

@@            Coverage Diff             @@
##             main     #630      +/-   ##
==========================================
- Coverage   84.12%   84.11%   -0.01%     
==========================================
  Files         237      237              
  Lines       30075    30138      +63     
  Branches     4996     5014      +18     
==========================================
+ Hits        25301    25351      +50     
- Misses       3635     3643       +8     
- Partials     1139     1144       +5

Impacted Files	Coverage Δ
.../butler/registry/queries/expressions/categorize.py	`80.45% <62.50%> (-6.85%)`	⬇️
...ython/lsst/daf/butler/registry/queries/_results.py	`82.41% <81.81%> (-0.64%)`	⬇️
python/lsst/daf/butler/core/simpleQuery.py	`94.54% <100.00%> (+1.52%)`	⬆️
...ython/lsst/daf/butler/registry/queries/_builder.py	`89.38% <100.00%> (-0.51%)`	⬇️
python/lsst/daf/butler/registry/queries/_query.py	`83.75% <100.00%> (+0.33%)`	⬆️
...ython/lsst/daf/butler/registry/queries/_structs.py	`90.38% <100.00%> (+0.74%)`	⬆️
python/lsst/daf/butler/registry/tests/_registry.py	`99.00% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 294c620...4df5dc9. Read the comment docs.

TallJimbo

Looks good; code looks cleaner overall, I think, in addition to fixing the bug.

But we should probably file a follow-up ticket to actually move the dimension record sorting into the database, and in the meantime caution people like @mfisherlevine that if they want an efficient ORDER BY ... LIMIT 1 on a dimension records query, they should really use queryDataIds, not queryDimensionRecords, at least for now. I will also take a look at moving dimension record sorting into the DB on DM-31725, whenever I can actually get back to it, and that may be the best time to address it if deeper structural changes are needed.

andy-slac · 2022-01-13T17:53:17Z

I think even now (and in unpatched code) LIMIT 1 should work fine, as it is applied in queryDataIds (which is called by queryDimensionRecords). Sorting 1 record in memory should be OK. The problem with the current in-memory sort is when people don't use LIMIT, then it needs to sort potentially large number of records in memory. Even that is not a big issue with this patch, at least until we reach millions of records per query.

`DataCoordinateQueryResults` is now using window function `row_number()` ordering rank for DataIds query, this simplifies handling of the ordering columns and generation of the materialized query. `queryDimensionRecords()` is doing ordering in-memory by using an ordering function that compares attributes of `DimensionRecord` instances.

TallJimbo · 2022-01-13T18:15:53Z

Oh, that makes me realize that I didn't check whether the in-memory sorting of dimension records happens before or after the LIMIT 1 would be replied. If it happens after, doesn't that mean that we'd get a completely arbitrary single record, rather than the first one in the sort order?

andy-slac · 2022-01-13T18:25:35Z

Both LIMIT and ORDER BY apply to a sub-query generated by queryDataId, so limit happens after ordering. And we have to sort returned records again in memory because parent query does not/ cannot preserve order from sub-query. Here is an example of a full query:

SELECT
    main_20210215.exposure.instrument,
    main_20210215.exposure.id,
    -- .. bunch of other columns
    main_20210215.exposure.zenith_angle,
    main_20210215.exposure.timespan
FROM
    main_20210215.exposure
    JOIN (
        SELECT
            main_20210215.physical_filter.band AS band,
            main_20210215.physical_filter.instrument AS instrument,
            main_20210215.physical_filter.name AS physical_filter,
            main_20210215.exposure.id AS exposure,
            row_number() OVER (
                ORDER BY
                    main_20210215.exposure.day_obs ASC
            ) AS _orderby
        FROM
            (
                SELECT
                    main_20210215.dataset_tags_00000002.instrument AS instrument,
                    main_20210215.dataset_tags_00000002.detector AS detector,
                    main_20210215.dataset_tags_00000002.exposure AS exposure
                FROM
                    main_20210215.dataset_tags_00000002
                WHERE
                    main_20210215.dataset_tags_00000002.dataset_type_id = %(dataset_type_id_1)s
                    AND main_20210215.dataset_tags_00000002.collection_name = %(collection_name_1)s
            ) AS raw
            JOIN main_20210215.exposure ON raw.instrument = main_20210215.exposure.instrument
            AND raw.exposure = main_20210215.exposure.id
            JOIN main_20210215.physical_filter ON raw.instrument = main_20210215.physical_filter.instrument
            AND main_20210215.exposure.instrument = main_20210215.physical_filter.instrument
            AND main_20210215.exposure.physical_filter = main_20210215.physical_filter.name
        WHERE
            main_20210215.exposure.day_obs > %(param_1)s
            AND raw.instrument = %(instrument_1)s
            AND main_20210215.exposure.instrument = %(instrument_2)s
            AND main_20210215.physical_filter.instrument = %(instrument_3)s
        ORDER BY
            _orderby
        LIMIT
            %(param_2)s
    ) AS c ON main_20210215.exposure.instrument = c.instrument
    AND main_20210215.exposure.id = c.exposure

TallJimbo · 2022-01-13T18:42:50Z

Ah, that makes sense, thanks. And I'm happy with that; I could imagine it giving the query optimizer some trouble sometimes, but it's hard to do better without much bigger changes.

andy-slac changed the title ~~Reimplement order_by method for query results (DM-33164)~~ DM-33164: Reimplement order_by method for query results Jan 11, 2022

andy-slac force-pushed the tickets/DM-33164 branch 2 times, most recently from 8d7bf43 to 5cde0aa Compare January 11, 2022 04:46

andy-slac force-pushed the tickets/DM-33164 branch 5 times, most recently from 2754147 to 90acda6 Compare January 12, 2022 03:03

TallJimbo approved these changes Jan 13, 2022

View reviewed changes

andy-slac force-pushed the tickets/DM-33164 branch from 90acda6 to 4df5dc9 Compare January 13, 2022 17:55

andy-slac merged commit 8f7140c into main Jan 13, 2022

andy-slac deleted the tickets/DM-33164 branch January 13, 2022 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-33164: Reimplement `order_by` method for query results #630

DM-33164: Reimplement `order_by` method for query results #630

andy-slac commented Jan 11, 2022 •

edited

codecov bot commented Jan 11, 2022 •

edited

TallJimbo left a comment

andy-slac commented Jan 13, 2022

TallJimbo commented Jan 13, 2022

andy-slac commented Jan 13, 2022 •

edited

TallJimbo commented Jan 13, 2022

DM-33164: Reimplement order_by method for query results #630

DM-33164: Reimplement order_by method for query results #630

Conversation

andy-slac commented Jan 11, 2022 • edited

Checklist

codecov bot commented Jan 11, 2022 • edited

Codecov Report

TallJimbo left a comment

Choose a reason for hiding this comment

andy-slac commented Jan 13, 2022

TallJimbo commented Jan 13, 2022

andy-slac commented Jan 13, 2022 • edited

TallJimbo commented Jan 13, 2022

DM-33164: Reimplement `order_by` method for query results #630

DM-33164: Reimplement `order_by` method for query results #630

andy-slac commented Jan 11, 2022 •

edited

codecov bot commented Jan 11, 2022 •

edited

andy-slac commented Jan 13, 2022 •

edited