Track queries using Postgres internal query ID #339

seanlinsley · 2022-11-30T22:48:28Z

This PR adds an internal query ID cache to improve our ability to match query stats to the correct fingerprint, even if that query happens to churn out of pg_stat_statements by the time the next full snapshot rolls around.

This also adds improved tracking of truncated queries using pg_stat_activity.query_id in Postgres 14 and above.

seanlinsley · 2022-12-09T17:44:00Z

state/postgres_statement.go

+	LastSeen time.Time
+}
+
+type QueryIdentityMap map[int64]QueryIdentity


It looks like query IDs are globally unique, since the relation ID of referenced tables is used when generating the fingerprint:

https://github.com/postgres/postgres/blob/fc7852c6cb89a5384e0b4ad30874de92f63f88be/src/backend/utils/misc/queryjumble.c#L285

The question is: are duplicate query IDs across databases (that were for entirely different queries) a big enough issue that we need to track which database the query ID came from in QueryIdentityMap?

Checking pg_stat_statements in our own database, it looks like we have ~0.15% dupes. That's pretty small, but maybe not small enough that we can just ignore it. In our case, though, all the dupes are in one database. What's the impact of collisions here? We may associate stats with the incorrect query text?

Yes, the stats could end up being associated with the wrong query in that case. But we only fall back to this identity map for queries that have churned out of pg_stat_statements since the last full snapshot so it may not be a big deal.

So you think altogether it's unlikely enough that we can just ignore it? I'm a little reluctant to go with that (because if this does bite you, it will be basically impossible to figure out), but given the odds, it doesn't seem unreasonable, and I don't have a great idea on how to surface the problem. (And I don't think it's worth talking about it as a potential problem, since it's so unlikely).

Trying to think through this:

Queries on two different databases having the same queryid, e.g. SELECT 1 would have the same queryid, despite it being on different databases (since it doesn't involve any table OID that'd be database specific)

A hash collision with the Postgres queryid, i.e. two entirely unrelated queries in two different databases having the same queryid (if it was the same database we wouldn't know about it, since they'd just end up being a single entry in pgss)

If I understand the conversation so far, you're talking about (2) here, though I'm not sure if the 0.15% statistic that @msakrejda referenced is for actual hash collisions? (I'd be surprised, I'm assuming its more likely these are actually identical queries).

Assuming the hash collision case (2), what would happen today:

database A has queryid 1, and querytext X

database B also has queryid 1, but querytext Y

we loop through the stat statements entries, fingerprinting each based on the querytext, and end up with two distinct query fingerprints (and thus associate to the correct query entry in pganalyze)

Now assuming that in a future iteration of this PR we actually check the cache first and don't fingerprint if needed, the problem here would be that we are now associating to the wrong query in the case of the collision.

I'm 50/50 on whether this is a problem in practice. Hash collisions could occur (and the standard Postgres hash function used in pgss isn't the best hash function either), but it does seem a quite unlikely edge case, and we are not in a security sensitive context here, the way I look at it.

The one thing I would be worried about from a security perspective is in case we consider caching query text here as well (or if this mapping is relied on elsewhere to get the text), because doing a simple mapping here could cause query text to cross a security boundary (in the pganalyze app you may be restricted to viewing a single database on a server).

Maybe its better to just use the full triplet (i.e. (database_id, role_id, queryid)) for the mapping here, to avoid any accidental errors?

Changing to a composite key would necessarily mean increasing the cache size limit. The collector already has an unresolved memory consumption issue, and I worry that increasing the cache size further is going to make that worse.

I'll work on building a standalone benchmark script when I'm back from holidays. There might be an argument to drop PostgresStatementKey across the board in favor of the query ID on its own.

Changing to a composite key would necessarily mean increasing the cache size limit.

With cache size limit you mean the 100k entries limit on the map?

I'm not sure if that'd need to be raised if the key also includes the database ID - the overall number of entries shouldn't actually increase much, unless we have hash collisions (which would no longer be the same entry), or the same query without table references runs on multiple databases.

Adding the role ID on the other hand may have such an effect (e.g. imagine a workload using short-lived roles), so there may be an argument for only doing (database_id, queryid) without the role ID.

There might be an argument to drop PostgresStatementKey across the board in favor of the query ID on its own.

I don't think we can drop it across the board because of the aforementioned issue with the same queryid having different text on different databases, and that being security-relevant (e.g. imagine db1 is for tenant1, and has a query comment like SELECT 1 /* user.email: tenant1@example.com */ and db2 has SELECT 1 /* user.email: tenant2@example.com */)

Therefore we'd need to keep the PostgresStatementKey use for any maps mapping to query texts at the very least, and anything related to statistics is also problematic (since there you need it to count each database separately).

If I understand the conversation so far, you're talking about (2) here, though I'm not sure if the 0.15% statistic that @msakrejda referenced is for actual hash collisions? (I'd be surprised, I'm assuming its more likely these are actually identical queries).

They were queries with identical queryids occurring multiple times in pg_stat_statements. However, investigating further, it looks like this is all due to different records across separate userids in pg_stat_statements. If I filter down to a single user, we have no collisions.

Therefore we'd need to keep the PostgresStatementKey use for any maps mapping to query texts at the very least, and anything related to statistics is also problematic (since there you need it to count each database separately).

Good point.

FWIW, indexing into an array (that contains a map per database) looks to be significantly faster than using a composite key: https://gist.github.com/seanlinsley/dd7b2bf8d09b6ba710d27044794f86c5#file-map_test-go-L182

But for this PR, it doesn't seem like QueryIdentityMap needs to be scoped per database.

runner/full.go

msakrejda

I'm not sure what to do about the risk of dupes here. The chance is very small, but the consequences seem potentially very confusing, and there's not a great way for us to diagnose this, right? I guess we could log warnings if count(queryid) differs from count(distinct queryid) but I don't think we can make them actionable. Or we could disable this behavior in those cases? Or should we make this configurable via collector setting? I think even then, we'd need to highlight collisions somehow, since this could get pretty confusing.

Thoughts?

msakrejda · 2022-12-13T19:28:30Z

state/postgres_statement.go

+	LastSeen time.Time
+}
+
+type QueryIdentityMap map[int64]QueryIdentity


Checking pg_stat_statements in our own database, it looks like we have ~0.15% dupes. That's pretty small, but maybe not small enough that we can just ignore it. In our case, though, all the dupes are in one database. What's the impact of collisions here? We may associate stats with the incorrect query text?

output/transform/util.go

lfittl

Overall looks like the right direction - there is a concurrency issue here with accessing the server.PrevState in an activity snapshot whilst not holding the server.StateMutex.

I think we should resolve that by just giving this its own variable and mutex on the Server struct, and then lock/unlock the mutex as needed.

input/postgres/statements.go

output/transform/util.go

runner/full.go

output/transform/util.go

- cache time.Now() when called in a loop - store query identities on server - add RWMutex to prevent concurrent writes (while still allowing concurrent reads)

seanlinsley · 2023-01-05T20:19:38Z

Okay, this is ready for another round of review. Note I used a RWMutex so compact snapshots can read from the map without blocking (since only the full snapshot updates it).

…persistently

input/postgres/statements.go

output/transform/util.go

runner/full.go

state/postgres_statement.go

lfittl

Nice - I think this looks good!

One more minor comment re: a missing mutex unlock, but otherwise I think this is good to ship.

input/postgres/statements.go

…persistently

seanlinsley · 2023-06-22T16:27:01Z

Due to the high rate of churn, this query ID cache is ineffective in practice. Reducing churn requires changes to how applications write their queries. We've added a message in-app to explain what customers can do:

seanlinsley added 2 commits November 30, 2022 16:47

Track query IDs to reduce impact of pg_stat_statement pruning

a523c6a

init the map if it's nil

4114c4f

seanlinsley commented Dec 9, 2022

View reviewed changes

seanlinsley added 2 commits December 9, 2022 11:44

go fmt

0564c3b

identify truncated queries in pg_stat_activity using cached query_id

332c949

seanlinsley changed the title ~~Track query IDs to reduce impact of pg_stat_statement pruning~~ Improve tracking of truncated and churned queries using Postgres internal query ID Dec 9, 2022

seanlinsley marked this pull request as ready for review December 9, 2022 20:12

seanlinsley requested review from a team, lfittl and msakrejda December 9, 2022 20:12

keiko713 reviewed Dec 13, 2022

View reviewed changes

runner/full.go Outdated Show resolved Hide resolved

msakrejda reviewed Dec 13, 2022

View reviewed changes

seanlinsley added 3 commits December 15, 2022 11:19

review updates

5f2ebaa

go fmt

11c68fd

fix the pg_stat_activity column order

f5e0e53

seanlinsley mentioned this pull request Dec 16, 2022

Use query ID cache to skip query fingerprint calculation #346

Closed

lfittl reviewed Dec 16, 2022

View reviewed changes

input/postgres/statements.go Show resolved Hide resolved

input/postgres/statements.go Show resolved Hide resolved

output/transform/util.go Outdated Show resolved Hide resolved

runner/full.go Outdated Show resolved Hide resolved

output/transform/util.go Outdated Show resolved Hide resolved

msakrejda mentioned this pull request Dec 20, 2022

Drop support for old Postgres versions #343

Merged

seanlinsley added 3 commits January 5, 2023 13:52

review updates

eb504d1

- cache time.Now() when called in a loop - store query identities on server - add RWMutex to prevent concurrent writes (while still allowing concurrent reads)

test fix

00770c3

.

6020175

seanlinsley requested review from lfittl, keiko713 and msakrejda January 5, 2023 20:19

Merge remote-tracking branch 'origin/main' into track-query-identity-…

9333a0b

…persistently

lfittl reviewed Jan 17, 2023

View reviewed changes

review updates

ff7f99b

seanlinsley requested a review from lfittl January 20, 2023 03:10

lfittl approved these changes Jan 20, 2023

View reviewed changes

input/postgres/statements.go Show resolved Hide resolved

seanlinsley added 2 commits January 20, 2023 13:26

ensure mutex is unlocked when an error occurs

817402a

Merge remote-tracking branch 'origin/main' into track-query-identity-…

6b3a25c

…persistently

seanlinsley changed the title ~~Improve tracking of truncated and churned queries using Postgres internal query ID~~ Track queries using Postgres internal query ID Feb 27, 2023

seanlinsley and others added 2 commits March 1, 2023 08:56

Merge remote-tracking branch 'origin/main' into track-query-identity-…

5092b80

…persistently

Merge branch 'main' into track-query-identity-persistently

a978e8a

seanlinsley closed this Jun 22, 2023

seanlinsley deleted the track-query-identity-persistently branch June 22, 2023 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track queries using Postgres internal query ID #339

Track queries using Postgres internal query ID #339

seanlinsley commented Nov 30, 2022 •

edited

Loading

seanlinsley Dec 9, 2022

msakrejda Dec 13, 2022

seanlinsley Dec 13, 2022

msakrejda Dec 13, 2022

lfittl Dec 16, 2022

seanlinsley Dec 16, 2022

seanlinsley Dec 16, 2022

lfittl Dec 16, 2022 •

edited

Loading

msakrejda Dec 17, 2022

seanlinsley Dec 28, 2022 •

edited

Loading

msakrejda left a comment

msakrejda Dec 13, 2022

lfittl left a comment

seanlinsley commented Jan 5, 2023

lfittl left a comment

seanlinsley commented Jun 22, 2023

Track queries using Postgres internal query ID #339

Track queries using Postgres internal query ID #339

Conversation

seanlinsley commented Nov 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lfittl Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seanlinsley Dec 28, 2022 • edited Loading

Choose a reason for hiding this comment

msakrejda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lfittl left a comment

Choose a reason for hiding this comment

seanlinsley commented Jan 5, 2023

lfittl left a comment

Choose a reason for hiding this comment

seanlinsley commented Jun 22, 2023

seanlinsley commented Nov 30, 2022 •

edited

Loading

lfittl Dec 16, 2022 •

edited

Loading

seanlinsley Dec 28, 2022 •

edited

Loading