Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of token sort #3823

Merged
merged 1 commit into from
Jan 27, 2024
Merged

Conversation

codyebberson
Copy link
Member

We found a particularly poor performing query today.

The FHIR query was: Observation?_sort=code, all observations sorted by "code"

This was run in a Medplum project with over 3 million Observation resources.

That FHIR query translated to the following SQL:

SELECT "Observation"."id", "Observation"."content"
FROM "Observation"
INNER JOIN "Observation_Token" AS "T1"
ON ("Observation"."id" = "T1"."resourceId" AND "T1"."code" = $1 AND "T1"."code" = $2)
WHERE "Observation"."deleted" = $3 
AND "Observation"."compartments" IS NOT NULL 
AND "Observation"."compartments" && ARRAY[$4,$5]::UUID[]
GROUP BY "Observation"."id" 
ORDER BY MIN("T1"."value") 
LIMIT ?

The first red flag here is ORDER BY MIN("T1"."value") - when using an aggregate function like this, Postgres cannot take advantage of the index, and the performance tanks.

Why use an aggregate function in the first place? When joining on lookup tables, we need to ensure that we only return one row per resource. We need to handle the case where there are multiple matching rows in the lookup table.

This PR converts this into the following query:

SELECT "Observation"."id", "Observation"."content" 
FROM "Observation" 
INNER JOIN (
  SELECT DISTINCT ON ("Observation_Token"."resourceId") "Observation_Token"."resourceId", "Observation_Token"."value" 
  FROM "Observation_Token"
  WHERE "Observation_Token"."code" = $1
) AS "T1" ON "Observation"."id" = "T1"."resourceId" 
WHERE "Observation"."deleted" = $2 
AND "Observation"."compartments" IS NOT NULL 
AND "Observation"."compartments" && ARRAY[$3,$4]::UUID[]
ORDER BY "T1"."value" 
LIMIT ?

This moves the "only one resource" logic into the subquery rather than in the outer query.

Postgres 10+ is quite smart about aggregating this: https://www.percona.com/blog/sql-optimizations-in-postgresql-in-vs-exists-vs-any-all-vs-join/

medplum=# EXPLAIN
medplum-# SELECT "Observation"."id", "Observation"."content"
medplum-# FROM "Observation"
medplum-# INNER JOIN "Observation_Token" AS "T1" ON ("Observation"."id" = "T1"."resourceId" AND "T1"."code" = 'code')
medplum-# WHERE ("Observation"."deleted" = false AND ("Observation"."compartments" IS NOT NULL AND "Observation"."compartments" && ARRAY['88bf1215-fe40-4dc6-80cb-a9540ddd9
8a5']::UUID[]))
medplum-# GROUP BY "Observation"."id"
medplum-# ORDER BY MIN("T1"."value")
medplum-# LIMIT 20;
                                                                                QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 Limit  (cost=38491.98..38492.03 rows=20 width=1237)
   ->  Sort  (cost=38491.98..38559.82 rows=27133 width=1237)
         Sort Key: (min("T1".value))
         ->  Finalize GroupAggregate  (cost=32851.46..37769.98 rows=27133 width=1237)
               Group Key: "Observation".id
               ->  Gather Merge  (cost=32851.46..37318.77 rows=35976 width=1237)
                     Workers Planned: 2
                     ->  Partial GroupAggregate  (cost=31851.44..32166.23 rows=17988 width=1237)
                           Group Key: "Observation".id
                           ->  Sort  (cost=31851.44..31896.41 rows=17988 width=1214)
                                 Sort Key: "Observation".id
                                 ->  Parallel Hash Join  (cost=6209.40..21049.66 rows=17988 width=1214)
                                       Hash Cond: ("Observation".id = "T1"."resourceId")
                                       ->  Parallel Seq Scan on "Observation"  (cost=0.00..14694.65 rows=11305 width=1205)
                                             Filter: ((NOT deleted) AND (compartments IS NOT NULL) AND (compartments && '{88bf1215-fe40-4dc6-80cb-a9540ddd98a5}'::uuid[]))
                                       ->  Parallel Hash  (cost=5622.84..5622.84 rows=46925 width=25)
                                             ->  Parallel Seq Scan on "Observation_Token" "T1"  (cost=0.00..5622.84 rows=46925 width=25)
                                                   Filter: (code = 'code'::text)
(18 rows)

medplum=# EXPLAIN
medplum-# SELECT
medplum-#     "Observation"."id",
medplum-#     "Observation"."content"
medplum-# FROM
medplum-#     "Observation"
medplum-# INNER JOIN (
medplum(#     SELECT DISTINCT ON ("resourceId")
medplum(#         "resourceId",
medplum(#         "value"
medplum(#     FROM
medplum(#         "Observation_Token"
medplum(#     WHERE
medplum(#         "code" = 'code'
medplum(#     ORDER BY
medplum(#         "resourceId",
medplum(#         "value" -- or whatever criteria you want for choosing the row
medplum(# ) AS "T1" ON "Observation"."id" = "T1"."resourceId"
medplum-# WHERE
medplum-#     "Observation"."deleted" = false
medplum-#     AND "Observation"."compartments" IS NOT NULL
medplum-#     AND "Observation"."compartments" && ARRAY['88bf1215-fe40-4dc6-80cb-a9540ddd98a5']::UUID[]
medplum-# ORDER BY
medplum-#     "T1"."value"
medplum-# LIMIT 20;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=37930.03..37930.08 rows=20 width=1214)
   ->  Sort  (cost=37930.03..37979.31 rows=19713 width=1214)
         Sort Key: "T1".value
         ->  Hash Join  (cost=22123.49..37405.47 rows=19713 width=1214)
               Hash Cond: ("Observation".id = "T1"."resourceId")
               ->  Seq Scan on "Observation"  (cost=0.00..15210.76 rows=27133 width=1205)
                     Filter: ((NOT deleted) AND (compartments IS NOT NULL) AND (compartments && '{88bf1215-fe40-4dc6-80cb-a9540ddd98a5}'::uuid[]))
               ->  Hash  (cost=21480.69..21480.69 rows=51424 width=25)
                     ->  Subquery Scan on "T1"  (cost=20403.34..21480.69 rows=51424 width=25)
                           ->  Unique  (cost=20403.34..20966.45 rows=51424 width=25)
                                 ->  Sort  (cost=20403.34..20684.90 rows=112621 width=25)
                                       Sort Key: "Observation_Token"."resourceId", "Observation_Token".value
                                       ->  Seq Scan on "Observation_Token"  (cost=0.00..8258.81 rows=112621 width=25)
                                             Filter: (code = 'code'::text)
(14 rows)

Analysis from ChatGPT:

To determine which query will be more performant, let's analyze and compare the key components of their execution plans:

Query 1 (with GROUP BY and ORDER BY MIN("T1"."value"))

  1. Cost Estimate: The total cost estimate goes from 38491.98 to 38559.82. This is relatively high, indicating a potentially more expensive operation.

  2. Sort and GroupAggregate: It performs a sort operation for the GROUP BY and then a GroupAggregate. These operations, especially on large datasets, can be computationally expensive.

  3. Parallel Execution: The query uses parallel processing (Parallel Seq Scan and Parallel Hash). Parallel processing can improve performance on systems with multiple CPU cores.

Query 2 (with DISTINCT ON in a Subquery)

  1. Cost Estimate: The total cost estimate ranges from 37930.03 to 37979.31, which is slightly lower than the first query.

  2. Hash Join and Seq Scan: This query also performs a hash join, but it relies on a subquery with a DISTINCT ON operation. This approach is generally more efficient than a GroupAggregate for selecting distinct rows.

  3. Subquery Scan and Unique: The subquery applies a Unique operation after sorting, which is typically more efficient than grouping and aggregating.

Performance Comparison and Insights

  • Cost Estimates: The cost estimates for the second query are slightly lower than the first, suggesting it might be more efficient. However, the difference is not substantial.

  • Operation Complexity: The first query's use of GroupAggregate is typically more expensive than the second query's use of Unique after sorting in the subquery.

  • Parallelism: Both queries use parallel processing, which can improve performance on multi-core systems.

Conclusion

Based on the execution plans, the second query (with DISTINCT ON in the subquery) is likely to be more performant than the first query, primarily due to the more efficient handling of distinct values and potentially less costly operations.

@codyebberson codyebberson requested a review from a team as a code owner January 27, 2024 01:09
Copy link

vercel bot commented Jan 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments
Name Status Preview Updated (UTC)
medplum-app ⬜️ Ignored (Inspect) Jan 27, 2024 1:09am
medplum-storybook ⬜️ Ignored (Inspect) Jan 27, 2024 1:09am
medplum-www ⬜️ Ignored (Inspect) Jan 27, 2024 1:09am

@reshmakh reshmakh added the search Features and fixes related to search label Jan 27, 2024
@reshmakh reshmakh added this to the January 31st, 2024 milestone Jan 27, 2024
Copy link

Messages
📖 @medplum/core: 153.9 kB
📖 @medplum/react: 337.7 kB

Generated by 🚫 dangerJS against c87ddb2

Copy link

sonarcloud bot commented Jan 27, 2024

@codyebberson codyebberson merged commit 9c35dc2 into main Jan 27, 2024
18 of 19 checks passed
@codyebberson codyebberson deleted the cody-token-sort-performance branch January 27, 2024 17:05
medplumbot added a commit that referenced this pull request Jan 31, 2024
Fixes #3794 - MeasureReport.period search (#3850)
Extra check for vmcontext bots (#3863)
Add and use vite-plugin-turbosnap (#3849)
Downgrade chromatic (#3848)
Repo sql fixes for cockroachdb (#3844)
Remove Health Gorilla from medplum-demo-bots (#3845)
fix-3815 cache presigned s3 binary urls (#3834)
Use tsvector index for token text search (#3791)
rate limit should return `OperationOutcome` (#3843)
Add global var "module" to vm context bots (#3842)
Fix lookup table tsv indexes (#3841)
Always use estimate count first (#3840)
Disambiguate getClient (#3839)
Fix invalid mermaid graph in diagnostic catalog docs (#3836)
fix-3809 race condition in Subscription extension fhir-path-criteria-expression %previous value lookup (#3810)
Fix Sonar code smells: mark React props readonly (#3832)
RDS proxy (#3827)
Fixed lookup tables in migration generator (#3830)
Fixed deprecated jest matchers (#3831)
Update README.md (#3828)
Update fhir-basics.md (#3829)
Case study content and images (#3820)
Added rdsReaderInstanceType and RDS upgrade docs (#3826)
Dependency upgrades (#3825)
Separate search popup menus for 'text' and 'token' (#3824)
Improve performance of token sort (#3823)
Additional logging (#3790)
Fix calendar input button style (#3817)
Don't add _total default in SearchControl (#3818)
Dark mode (#3814)
Fixes #3812 - FHIR profile cache bug (#3813)
Document using medplum client to integrate with external FHIR servers (#3811)
Use specific advisory locks (#3805)
Nested transactions (#3788)
Fix signin page on graphiql (#3802)
fix(heartbeat): start heartbeat on first bind to sub (#3793)
Fix async job tests (#3795)
Document using vm context bots (#3784)
Refactored access policy docs based on customer feedback (#3785)
Support Redis TLS config from Env (#3787)
feat(subscriptions): add `heartbeat` for WS subs (#3740)
Update Bot metrics (#3763)
github-merge-queue bot pushed a commit that referenced this pull request Jan 31, 2024
Fixes #3794 - MeasureReport.period search (#3850)
Extra check for vmcontext bots (#3863)
Add and use vite-plugin-turbosnap (#3849)
Downgrade chromatic (#3848)
Repo sql fixes for cockroachdb (#3844)
Remove Health Gorilla from medplum-demo-bots (#3845)
fix-3815 cache presigned s3 binary urls (#3834)
Use tsvector index for token text search (#3791)
rate limit should return `OperationOutcome` (#3843)
Add global var "module" to vm context bots (#3842)
Fix lookup table tsv indexes (#3841)
Always use estimate count first (#3840)
Disambiguate getClient (#3839)
Fix invalid mermaid graph in diagnostic catalog docs (#3836)
fix-3809 race condition in Subscription extension fhir-path-criteria-expression %previous value lookup (#3810)
Fix Sonar code smells: mark React props readonly (#3832)
RDS proxy (#3827)
Fixed lookup tables in migration generator (#3830)
Fixed deprecated jest matchers (#3831)
Update README.md (#3828)
Update fhir-basics.md (#3829)
Case study content and images (#3820)
Added rdsReaderInstanceType and RDS upgrade docs (#3826)
Dependency upgrades (#3825)
Separate search popup menus for 'text' and 'token' (#3824)
Improve performance of token sort (#3823)
Additional logging (#3790)
Fix calendar input button style (#3817)
Don't add _total default in SearchControl (#3818)
Dark mode (#3814)
Fixes #3812 - FHIR profile cache bug (#3813)
Document using medplum client to integrate with external FHIR servers (#3811)
Use specific advisory locks (#3805)
Nested transactions (#3788)
Fix signin page on graphiql (#3802)
fix(heartbeat): start heartbeat on first bind to sub (#3793)
Fix async job tests (#3795)
Document using vm context bots (#3784)
Refactored access policy docs based on customer feedback (#3785)
Support Redis TLS config from Env (#3787)
feat(subscriptions): add `heartbeat` for WS subs (#3740)
Update Bot metrics (#3763)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search Features and fixes related to search
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

3 participants