-
-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: fix missing index on courier list count #3002
Merged
Merged
Changes from 1 commit
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
6069360
chore: fix missing index on courier list count
jonas-jonas cf156c7
chore: add reproduction test for `nid` escape
jonas-jonas f7fe751
chore: fix list courier messages index
jonas-jonas 99ef939
chore: update ory/x
jonas-jonas e70720a
chore: fix index syntax
jonas-jonas cb66eaf
chore: fix index ordering
jonas-jonas aaf49e5
chore: add status and recipient indices
jonas-jonas aa80826
Merge remote-tracking branch 'origin/master' into fix/missingIndexOnC…
jonas-jonas c6f8da6
Merge remote-tracking branch 'origin/master' into fix/missingIndexOnC…
jonas-jonas 815a593
chore: remove `id` from indices
jonas-jonas 31cc37b
Merge remote-tracking branch 'origin/master' into fix/missingIndexOnC…
jonas-jonas File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
3 changes: 3 additions & 0 deletions
3
persistence/sql/migrations/sql/20230104193739000000_courier_list_index.down.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
DROP INDEX courier_messages_status_idx ON courier_messages; | ||
|
||
DROP INDEX courier_messages_status_idx ON courier_messages; |
3 changes: 3 additions & 0 deletions
3
persistence/sql/migrations/sql/20230104193739000000_courier_list_index.up.sql
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
CREATE INDEX courier_messages_status_idx ON courier_messages (status); | ||
|
||
CREATE INDEX courier_messages_status_idx ON courier_messages (recipient); | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indices are a way to tell the database which queries it should expect, so that it can optimize the lookup for these. That's why indices should always include all columns for a specific query. Optimally, indices are built in such a way that they can be re-used elsewhere.
Queries can benefit from an index even if they only filter a prefix of its columns. For example, if you create an index of columns (A, B, C), queries filtering (A) or (A, B) can use the index, so you don't need to also index (A).
The more variety a column has, the better it is in the index, because it avoids hot spots. So always put the column with the highest "entropy" first:
In the example below, if recipient is not required, you should be able to change the query to
so that you don't need to add another index (please verify this first though).
Furthermore, you're creating the same index name twice, which will result in an error.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now done some more testing, and benchmarking on a CRDB instance.
The following results have been gathered using
EXPLAIN (opt, verbose)
on CRDB.To cover all possible queries in the "ListMessages" method with an index, we would need 4 indexes:
CREATE INDEX … (nid, created_at DESC, id)
if neitherstatus
norrecipient
is suppliedCREATE INDEX … (nid, status, created_at DESC, id)
if onlystatus
is suppliedCREATE INDEX … (nid, recipient, created_at DESC, id)
if onlyrecipient
is suppliedCREATE INDEX … (nid, status, recipient, created_at DESC, id)
if both are suppliedIIUC, this would probably hinder the performance, because it would make writes slower. It is also not scalable, if we want to add a filter for say
template_type
in the future.I'd propose to only create
CREATE INDEX … (nid, created_at DESC, id)
to cover the case where neitherstatus
norrecipient
is supplied, as that is probably the most common case. Additionally, we can add separate indexes onstatus
andrecipient
:CREATE INDEX ... (status)
(and(recipient)
respectively).I have looked at the query plans for the cases and these are the results:
no filter
cost=14.82
status = 3
cost=18.47
recipient = 'x@ory.sh'
cost=18.47
status = 3 AND recipient = 'x@ory.sh'
cost=18.48
(Without the separate indices on
status
andrecipient
the cost would becost=21.37
for the queries filtering for them.)Another improvement would be to use the
STORING
clause in the index (CREATE INDEX … (nid, created_at DESC, id) STORING (body, template_data, etc...)
), to avoid the index join with the tablecourier_messages
and return the data from the index directly, which would bring the cost for the query with no filters down to9.37
. But this requires us to define the columns in the index definition, which is possible, but IMO out of the scope of this PR. Let me know how you want me to proceed here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to add, since I didn't it include in the comments yet: Adding
recipient != ''
to the query and creating the index withCREATE INDEX ... (nid, recipient, status, created_at DESC, id)
does have an effect:recipient != '' AND status != -1
cost=17.32
recipient != '' AND status = 3
cost=17.32
recipient = 'x@ory.sh' AND status != -1
cost=18.48
status = 3 AND recipient = 'x@ory.sh'
cost=14.87
So in essence, it slows down the query with no filters, and speeds up the others a bit. IMO that's not really what we want, because it's more likely that users want to see just any message, until we have proper filtering, using a search engine. But do let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the expected query pattern here, realistically? The
status
column has low entropy and, if combined with a search forrecipient
, a scan through the result set is not a problem. A sequential scan isn't a problem if the set to scan is small. Combined indices beyond these:CREATE INDEX … (nid, created_at DESC, id)
CREATE INDEX … (nid, status, created_at DESC, id)
CREATE INDEX … (nid, recipient, created_at DESC, id)
probably don't give us anything useful.
I recommend you check if those indices get used in
ListMessages
(even if more filters are active) and call it a day.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alnr. From my POV, users would initially go to the email delivery view (where we show this data) with no filters active. Then they might filter for
Abandoned
messages and might search for a specific recipient. So we definitely need an index on the query with no filters, might need an index on thestatus
query, and might need one on therecipient
filter query. While possible, a query with both astatus
and arecipient
filter is not that useful, so probably not to be expected.I have settled on creating
CREATE INDEX … (nid, created_at DESC, id)
and indices forstatus
(already exists) andrecipient
(new in this PR).That way, we get these costs:
no filter
cost=14.82
status = 3
cost=18.47
recipient = 'x@ory.sh'
cost=18.47
status = 3 AND recipient = 'x@ory.sh'
cost=18.48
If we additionally create:
CREATE INDEX … (nid, status, created_at DESC, id)
CREATE INDEX … (nid, recipient, created_at DESC, id)
We get
cost=14.82
on the queries that have either of the fields, but not for both, which would be fine, IMO. Question is, what is the overhead for writes, if we have all 3 of these indices active? I'd say we will do many more writes to this table, than this admin query is executed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating 3 indices shouldn't be a problem. I'd say go for it and revisit if there's a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I added the 2 missing indices to the PR. Should be good to go now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, we always have these query components:
If the user applies filters, they also add
recipient
,status
, or both. Thus, this should handle all query cases:Given that you put recipient first:
Please note that
created_at DESC
has no effect, as we explicitly useORDER BY created_at DESC
, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indiciesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing, I can now say, that we need to put
recipient
andstatus
first, beforecreated_at
in the index, otherwise the DB refuses to use the index in the query. Also see below ondesc
vsasc
in the index.That is not true from my testing on a CRDB, as you can see here:
CREATE INDEX ... (nid, created_at); // this index is not used in the query plan and cost = 17.27
CREATE INDEX ... (nid, created_at DESC); // index is used in the query plan and cost = 14.82
But indeed,
id
does not seem to be needed. I'll remove it.