Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-33164: Reimplement order_by method for query results #630

Merged
merged 1 commit into from Jan 13, 2022

Conversation

andy-slac
Copy link
Contributor

@andy-slac andy-slac commented Jan 11, 2022

DataCoordinateQueryResults is now using window function row_number() ordering rank for DataIds query, this simplifies handling of the ordering columns and generation of the materialized query. queryDimensionRecords() is doing ordering in-memory by using an ordering function that compares attributes of DimensionRecord instances.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@andy-slac andy-slac changed the title Reimplement order_by method for query results (DM-33164) DM-33164: Reimplement order_by method for query results Jan 11, 2022
@andy-slac andy-slac force-pushed the tickets/DM-33164 branch 2 times, most recently from 8d7bf43 to 5cde0aa Compare January 11, 2022 04:46
@codecov
Copy link

codecov bot commented Jan 11, 2022

Codecov Report

Merging #630 (4df5dc9) into main (294c620) will decrease coverage by 0.00%.
The diff coverage is 86.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #630      +/-   ##
==========================================
- Coverage   84.12%   84.11%   -0.01%     
==========================================
  Files         237      237              
  Lines       30075    30138      +63     
  Branches     4996     5014      +18     
==========================================
+ Hits        25301    25351      +50     
- Misses       3635     3643       +8     
- Partials     1139     1144       +5     
Impacted Files Coverage Δ
.../butler/registry/queries/expressions/categorize.py 80.45% <62.50%> (-6.85%) ⬇️
...ython/lsst/daf/butler/registry/queries/_results.py 82.41% <81.81%> (-0.64%) ⬇️
python/lsst/daf/butler/core/simpleQuery.py 94.54% <100.00%> (+1.52%) ⬆️
...ython/lsst/daf/butler/registry/queries/_builder.py 89.38% <100.00%> (-0.51%) ⬇️
python/lsst/daf/butler/registry/queries/_query.py 83.75% <100.00%> (+0.33%) ⬆️
...ython/lsst/daf/butler/registry/queries/_structs.py 90.38% <100.00%> (+0.74%) ⬆️
python/lsst/daf/butler/registry/tests/_registry.py 99.00% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 294c620...4df5dc9. Read the comment docs.

@andy-slac andy-slac force-pushed the tickets/DM-33164 branch 5 times, most recently from 2754147 to 90acda6 Compare January 12, 2022 03:03
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; code looks cleaner overall, I think, in addition to fixing the bug.

But we should probably file a follow-up ticket to actually move the dimension record sorting into the database, and in the meantime caution people like @mfisherlevine that if they want an efficient ORDER BY ... LIMIT 1 on a dimension records query, they should really use queryDataIds, not queryDimensionRecords, at least for now. I will also take a look at moving dimension record sorting into the DB on DM-31725, whenever I can actually get back to it, and that may be the best time to address it if deeper structural changes are needed.

@andy-slac
Copy link
Contributor Author

I think even now (and in unpatched code) LIMIT 1 should work fine, as it is applied in queryDataIds (which is called by queryDimensionRecords). Sorting 1 record in memory should be OK. The problem with the current in-memory sort is when people don't use LIMIT, then it needs to sort potentially large number of records in memory. Even that is not a big issue with this patch, at least until we reach millions of records per query.

`DataCoordinateQueryResults` is now using window function `row_number()`
ordering rank for DataIds query, this simplifies handling of the
ordering columns and generation of the materialized query.
`queryDimensionRecords()` is doing ordering in-memory by using an
ordering function that compares attributes of `DimensionRecord`
instances.
@TallJimbo
Copy link
Member

Oh, that makes me realize that I didn't check whether the in-memory sorting of dimension records happens before or after the LIMIT 1 would be replied. If it happens after, doesn't that mean that we'd get a completely arbitrary single record, rather than the first one in the sort order?

@andy-slac
Copy link
Contributor Author

andy-slac commented Jan 13, 2022

Both LIMIT and ORDER BY apply to a sub-query generated by queryDataId, so limit happens after ordering. And we have to sort returned records again in memory because parent query does not/ cannot preserve order from sub-query. Here is an example of a full query:

SELECT
    main_20210215.exposure.instrument,
    main_20210215.exposure.id,
    -- .. bunch of other columns
    main_20210215.exposure.zenith_angle,
    main_20210215.exposure.timespan
FROM
    main_20210215.exposure
    JOIN (
        SELECT
            main_20210215.physical_filter.band AS band,
            main_20210215.physical_filter.instrument AS instrument,
            main_20210215.physical_filter.name AS physical_filter,
            main_20210215.exposure.id AS exposure,
            row_number() OVER (
                ORDER BY
                    main_20210215.exposure.day_obs ASC
            ) AS _orderby
        FROM
            (
                SELECT
                    main_20210215.dataset_tags_00000002.instrument AS instrument,
                    main_20210215.dataset_tags_00000002.detector AS detector,
                    main_20210215.dataset_tags_00000002.exposure AS exposure
                FROM
                    main_20210215.dataset_tags_00000002
                WHERE
                    main_20210215.dataset_tags_00000002.dataset_type_id = %(dataset_type_id_1)s
                    AND main_20210215.dataset_tags_00000002.collection_name = %(collection_name_1)s
            ) AS raw
            JOIN main_20210215.exposure ON raw.instrument = main_20210215.exposure.instrument
            AND raw.exposure = main_20210215.exposure.id
            JOIN main_20210215.physical_filter ON raw.instrument = main_20210215.physical_filter.instrument
            AND main_20210215.exposure.instrument = main_20210215.physical_filter.instrument
            AND main_20210215.exposure.physical_filter = main_20210215.physical_filter.name
        WHERE
            main_20210215.exposure.day_obs > %(param_1)s
            AND raw.instrument = %(instrument_1)s
            AND main_20210215.exposure.instrument = %(instrument_2)s
            AND main_20210215.physical_filter.instrument = %(instrument_3)s
        ORDER BY
            _orderby
        LIMIT
            %(param_2)s
    ) AS c ON main_20210215.exposure.instrument = c.instrument
    AND main_20210215.exposure.id = c.exposure

@TallJimbo
Copy link
Member

Ah, that makes sense, thanks. And I'm happy with that; I could imagine it giving the query optimizer some trouble sometimes, but it's hard to do better without much bigger changes.

@andy-slac andy-slac merged commit 8f7140c into main Jan 13, 2022
@andy-slac andy-slac deleted the tickets/DM-33164 branch January 13, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants