Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: fix missing index on courier list count #3002

Merged
merged 11 commits into from
Jan 10, 2023
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
DROP INDEX courier_messages_status_idx ON courier_messages;

DROP INDEX courier_messages_status_idx ON courier_messages;
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
CREATE INDEX courier_messages_status_idx ON courier_messages (status);

CREATE INDEX courier_messages_status_idx ON courier_messages (recipient);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indices are a way to tell the database which queries it should expect, so that it can optimize the lookup for these. That's why indices should always include all columns for a specific query. Optimally, indices are built in such a way that they can be re-used elsewhere.

Queries can benefit from an index even if they only filter a prefix of its columns. For example, if you create an index of columns (A, B, C), queries filtering (A) or (A, B) can use the index, so you don't need to also index (A).

The more variety a column has, the better it is in the index, because it avoids hot spots. So always put the column with the highest "entropy" first:

Suggested change
CREATE INDEX courier_messages_status_idx ON courier_messages (recipient);
CREATE INDEX courier_messages_recipient_status_idx ON courier_messages (nid, recipient, status);

In the example below, if recipient is not required, you should be able to change the query to

	if filter.Recipient != "" {
 		q = q.Where("recipient=?", filter.Recipient)
 	} else {
 		q = q.Where("recipient != ?", "")
  }

so that you don't need to add another index (please verify this first though).

Furthermore, you're creating the same index name twice, which will result in an error.

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now done some more testing, and benchmarking on a CRDB instance.

The following results have been gathered using EXPLAIN (opt, verbose) on CRDB.

To cover all possible queries in the "ListMessages" method with an index, we would need 4 indexes:

  • CREATE INDEX … (nid, created_at DESC, id) if neither status nor recipient is supplied
  • CREATE INDEX … (nid, status, created_at DESC, id) if only status is supplied
  • CREATE INDEX … (nid, recipient, created_at DESC, id) if only recipient is supplied
  • CREATE INDEX … (nid, status, recipient, created_at DESC, id) if both are supplied

IIUC, this would probably hinder the performance, because it would make writes slower. It is also not scalable, if we want to add a filter for say template_type in the future.

I'd propose to only create CREATE INDEX … (nid, created_at DESC, id) to cover the case where neither status nor recipient is supplied, as that is probably the most common case. Additionally, we can add separate indexes on status and recipient: CREATE INDEX ... (status) (and (recipient) respectively).

I have looked at the query plans for the cases and these are the results:

 filter  cost
 no filter  cost=14.82
 status = 3  cost=18.47
 recipient = 'x@ory.sh'  cost=18.47
 status = 3 AND recipient = 'x@ory.sh'  cost=18.48

(Without the separate indices on status and recipient the cost would be cost=21.37 for the queries filtering for them.)

Another improvement would be to use the STORING clause in the index (CREATE INDEX … (nid, created_at DESC, id) STORING (body, template_data, etc...)), to avoid the index join with the table courier_messages and return the data from the index directly, which would bring the cost for the query with no filters down to 9.37. But this requires us to define the columns in the index definition, which is possible, but IMO out of the scope of this PR. Let me know how you want me to proceed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to add, since I didn't it include in the comments yet: Adding recipient != '' to the query and creating the index with CREATE INDEX ... (nid, recipient, status, created_at DESC, id) does have an effect:

filter cost
recipient != '' AND status != -1 cost=17.32
recipient != '' AND status = 3 cost=17.32
recipient = 'x@ory.sh' AND status != -1 cost=18.48
status = 3 AND recipient = 'x@ory.sh' cost=14.87

So in essence, it slows down the query with no filters, and speeds up the others a bit. IMO that's not really what we want, because it's more likely that users want to see just any message, until we have proper filtering, using a search engine. But do let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the expected query pattern here, realistically? The status column has low entropy and, if combined with a search for recipient, a scan through the result set is not a problem. A sequential scan isn't a problem if the set to scan is small. Combined indices beyond these:

  • CREATE INDEX … (nid, created_at DESC, id)
  • CREATE INDEX … (nid, status, created_at DESC, id)
  • CREATE INDEX … (nid, recipient, created_at DESC, id)

probably don't give us anything useful.

I recommend you check if those indices get used in ListMessages (even if more filters are active) and call it a day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alnr. From my POV, users would initially go to the email delivery view (where we show this data) with no filters active. Then they might filter for Abandoned messages and might search for a specific recipient. So we definitely need an index on the query with no filters, might need an index on the status query, and might need one on the recipient filter query. While possible, a query with both a status and a recipient filter is not that useful, so probably not to be expected.

I have settled on creating CREATE INDEX … (nid, created_at DESC, id) and indices for status (already exists) and recipient (new in this PR).

That way, we get these costs:

 filter  cost
 no filter  cost=14.82
 status = 3  cost=18.47
 recipient = 'x@ory.sh'  cost=18.47
 status = 3 AND recipient = 'x@ory.sh'  cost=18.48

If we additionally create:

  • CREATE INDEX … (nid, status, created_at DESC, id)
  • CREATE INDEX … (nid, recipient, created_at DESC, id)

We get cost=14.82 on the queries that have either of the fields, but not for both, which would be fine, IMO. Question is, what is the overhead for writes, if we have all 3 of these indices active? I'd say we will do many more writes to this table, than this admin query is executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating 3 indices shouldn't be a problem. I'd say go for it and revisit if there's a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I added the 2 missing indices to the PR. Should be good to go now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, we always have these query components:

  • nid, created_at

If the user applies filters, they also add recipient, status, or both. Thus, this should handle all query cases:

CREATE INDEX … (nid ASC, created_at ASC, recipient ASC, status ASC)
CREATE INDEX … (nid ASC, created_at ASC, status ASC)

Given that you put recipient first:

	q := p.GetConnection(ctx).Where("nid=?", p.NetworkID(ctx))
	if filter.Recipient != "" {
 		q = q.Where("recipient=?", filter.Recipient)
 	}
	if filter.Status != nil {
		q = q.Where("status=?", *filter.Status)
	}

Please note that created_at DESC has no effect, as we explicitly use ORDER BY created_at DESC, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing, I can now say, that we need to put recipient and status first, before created_at in the index, otherwise the DB refuses to use the index in the query. Also see below on desc vs asc in the index.

Please note that created_at DESC has no effect, as we explicitly use ORDER BY created_at DESC, which is not using the index for sorting but instead scans all ranges and orders them. Typically, we sort stuff ASC in indicies

That is not true from my testing on a CRDB, as you can see here:

  • CREATE INDEX ... (nid, created_at); // this index is not used in the query plan and cost = 17.27
  • CREATE INDEX ... (nid, created_at DESC); // index is used in the query plan and cost = 14.82

But indeed, id does not seem to be needed. I'll remove it.

10 changes: 5 additions & 5 deletions persistence/sql/persister_courier.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,11 @@ func (p *Persister) ListMessages(ctx context.Context, filter courier.ListCourier
q = q.Where("recipient=?", filter.Recipient)
}

count, err := q.Count(&courier.Message{})
if err != nil {
return nil, 0, nil, sqlcon.HandleError(err)
}

opts = append(opts, keysetpagination.WithDefaultToken(new(courier.Message).DefaultPageToken()))
opts = append(opts, keysetpagination.WithDefaultSize(10))
opts = append(opts, keysetpagination.WithColumn("created_at", "DESC"))
Expand All @@ -56,11 +61,6 @@ func (p *Persister) ListMessages(ctx context.Context, filter courier.ListCourier
return nil, 0, nil, sqlcon.HandleError(err)
}

count, err := q.Count(&courier.Message{})
if err != nil {
return nil, 0, nil, sqlcon.HandleError(err)
}

messages, nextPage := keysetpagination.Result(messages, paginator)
return messages, int64(count), nextPage, nil
}
Expand Down