Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track queries using Postgres internal query ID #339
Track queries using Postgres internal query ID #339
Changes from 4 commits
a523c6a
4114c4f
0564c3b
332c949
5f2ebaa
11c68fd
f5e0e53
eb504d1
00770c3
6020175
9333a0b
ff7f99b
817402a
6b3a25c
5092b80
a978e8a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like query IDs are globally unique, since the relation ID of referenced tables is used when generating the fingerprint:
https://github.com/postgres/postgres/blob/fc7852c6cb89a5384e0b4ad30874de92f63f88be/src/backend/utils/misc/queryjumble.c#L285
The question is: are duplicate query IDs across databases (that were for entirely different queries) a big enough issue that we need to track which database the query ID came from in
QueryIdentityMap
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking pg_stat_statements in our own database, it looks like we have ~0.15% dupes. That's pretty small, but maybe not small enough that we can just ignore it. In our case, though, all the dupes are in one database. What's the impact of collisions here? We may associate stats with the incorrect query text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the stats could end up being associated with the wrong query in that case. But we only fall back to this identity map for queries that have churned out of
pg_stat_statements
since the last full snapshot so it may not be a big deal.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you think altogether it's unlikely enough that we can just ignore it? I'm a little reluctant to go with that (because if this does bite you, it will be basically impossible to figure out), but given the odds, it doesn't seem unreasonable, and I don't have a great idea on how to surface the problem. (And I don't think it's worth talking about it as a potential problem, since it's so unlikely).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to think through this:
Queries on two different databases having the same queryid, e.g.
SELECT 1
would have the same queryid, despite it being on different databases (since it doesn't involve any table OID that'd be database specific)A hash collision with the Postgres
queryid
, i.e. two entirely unrelated queries in two different databases having the same queryid (if it was the same database we wouldn't know about it, since they'd just end up being a single entry in pgss)If I understand the conversation so far, you're talking about (2) here, though I'm not sure if the 0.15% statistic that @msakrejda referenced is for actual hash collisions? (I'd be surprised, I'm assuming its more likely these are actually identical queries).
Assuming the hash collision case (2), what would happen today:
Now assuming that in a future iteration of this PR we actually check the cache first and don't fingerprint if needed, the problem here would be that we are now associating to the wrong query in the case of the collision.
I'm 50/50 on whether this is a problem in practice. Hash collisions could occur (and the standard Postgres hash function used in pgss isn't the best hash function either), but it does seem a quite unlikely edge case, and we are not in a security sensitive context here, the way I look at it.
The one thing I would be worried about from a security perspective is in case we consider caching query text here as well (or if this mapping is relied on elsewhere to get the text), because doing a simple mapping here could cause query text to cross a security boundary (in the pganalyze app you may be restricted to viewing a single database on a server).
Maybe its better to just use the full triplet (i.e.
(database_id, role_id, queryid)
) for the mapping here, to avoid any accidental errors?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing to a composite key would necessarily mean increasing the cache size limit. The collector already has an unresolved memory consumption issue, and I worry that increasing the cache size further is going to make that worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on building a standalone benchmark script when I'm back from holidays. There might be an argument to drop
PostgresStatementKey
across the board in favor of the query ID on its own.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With cache size limit you mean the 100k entries limit on the map?
I'm not sure if that'd need to be raised if the key also includes the database ID - the overall number of entries shouldn't actually increase much, unless we have hash collisions (which would no longer be the same entry), or the same query without table references runs on multiple databases.
Adding the role ID on the other hand may have such an effect (e.g. imagine a workload using short-lived roles), so there may be an argument for only doing
(database_id, queryid)
without the role ID.I don't think we can drop it across the board because of the aforementioned issue with the same queryid having different text on different databases, and that being security-relevant (e.g. imagine db1 is for tenant1, and has a query comment like
SELECT 1 /* user.email: tenant1@example.com */
and db2 hasSELECT 1 /* user.email: tenant2@example.com */
)Therefore we'd need to keep the
PostgresStatementKey
use for any maps mapping to query texts at the very least, and anything related to statistics is also problematic (since there you need it to count each database separately).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were queries with identical queryids occurring multiple times in
pg_stat_statements
. However, investigating further, it looks like this is all due to different records across separateuserid
s inpg_stat_statements
. If I filter down to a single user, we have no collisions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
FWIW, indexing into an array (that contains a map per database) looks to be significantly faster than using a composite key: https://gist.github.com/seanlinsley/dd7b2bf8d09b6ba710d27044794f86c5#file-map_test-go-L182
But for this PR, it doesn't seem like
QueryIdentityMap
needs to be scoped per database.