Overhead from the pg_stat_kcache extension #41

vitabaks · 2024-01-18T20:49:06Z

Please take a look at the following results of the synthetic (read-only) pgbench test
Which we run on servers: c3-standard-176 (176 vCPU Intel, 704 GB Memory) and c3d-standard-360 (360 vCPU AMD, 1440 GB Memory).

We observe the degradation with over 100 clients:

[c1|workload_pgbench] 2024-01-05 04:03:30 tps = 12177
[c50|workload_pgbench] 2024-01-05 04:09:01 tps = 565902
[c100|workload_pgbench] 2024-01-05 04:14:34 tps = 696601
[c150|workload_pgbench] 2024-01-05 04:20:07 tps = 402113
[c200|workload_pgbench] 2024-01-05 04:25:40 tps = 314192
[c250|workload_pgbench] 2024-01-05 04:31:14 tps = 289959
[c300|workload_pgbench] 2024-01-05 04:36:47 tps = 281305
[c350|workload_pgbench] 2024-01-05 04:42:20 tps = 285510
[c400|workload_pgbench] 2024-01-05 04:47:53 tps = 277419

Analyzing the expectation event profile (based on pg_wait_sampling), we see how the number of LWLock:pg_stat_kcache wait events increases with an increase in the number of clients, until eventually pg_stat_kcache becomes the TOP-1 event in the profile.

1 client:

No 'pg_stat_kcache' wait event

50 clients:

'pg_stat_kcache' wait event is present

100 clients:

'pg_stat_kcache' wait event in TOP-5

150 clients:

'pg_stat_kcache' wait event is TOP-1

In the attachment you will find artifacts including settings, postgres stats, logs, and more:

ARTIFACTS.zip

The text was updated successfully, but these errors were encountered:

NikolayS · 2024-01-18T21:18:24Z

Thanks @vitabaks for posting! Worth noting that this comes from our (postgres.ai) new bot activities; it's working on top of gpt4 turbo with lots of additional components https://twitter.com/samokhvalov/status/1743151620555477083

@anayrat raised a couple of questions in https://twitter.com/Adrien_nayrat/status/1744288348217151991

the whole pipeline is here, we collect 70+ artifacts for each iteration, browsable here (or just see the .zip provided above, it's the same). The non-AI part of the automation is here. We can quickly reproduce things if additional checks are needed, but it should be straightforward on any machine.

Also interesting that pgss demonstrates noticeable overhead as well at this scale for trivial pgbench workloads https://twitter.com/postgres_ai/status/1747690825709215793 – obviously, contention to update stats for a single query record. But pgss overhead is much, much lower than pgsk's one – so question is, why so significant difference.

rjuju · 2024-01-19T00:16:54Z

Hi @vitabaks

I'm assuming that the bottleneck is coming from the internal lock that protects the array where we store the queryid for each backend in case parallel workers will be launched. That lock was initially added as a precaution but there shouldn't be any risk of concurrent modification while reading the value so I don't think it's necessary. That lock should have been harmless but indeed in case of high client count I can see how it would affect performance.

Can you try with the "remove_queryids_lock" branch that I just pushed? https://github.com/powa-team/pg_stat_kcache/tree/remove_queryids_lock

vitabaks · 2024-01-19T09:32:19Z

Can you try with the "remove_queryids_lock" branch that I just pushed? https://github.com/powa-team/pg_stat_kcache/tree/remove_queryids_lock

@rjuju Thanks for the quick response.

I tested your patch (on c3d-standard-360), here is the result:

Result:

Without a patch:

[c200_kcache_release|workload_pgbench] 2024-01-19 09:12:46 tps = 358295

'pg_stat_kcache' wait event is TOP-1

With a patch:

[c200_kcache_remove_queryids_lock|workload_pgbench] 2024-01-19 09:23:11 tps = 976167

No 'pg_stat_kcache' wait event
Remains 'Timeout:SpinDelay' but this already (as it turned out) refers to pg_stat_statements extension.

Artifacts: https://gitlab.com/postgres-ai/postgresql-consulting/tests-and-benchmarks/-/jobs/5965632917/artifacts/browse/ARTIFACTS/

Conclusions

By eliminating the LWLock:pg_stat_kcache, we achieved a performance increase of over two times. Specifically, the second test shows a 172.5% increase in TPS compared to the first, indicating more than double the transaction throughput.

rjuju · 2024-01-19T10:01:01Z

@vitabaks thanks for the testing! And that's a great news that this is enough to remove the overhead.

I just merged the commit in the main branch. I will wait a bit just in case and do a release early next week.

anayrat · 2024-01-19T10:05:52Z

Thanks guys !

vitabaks · 2024-01-19T10:23:28Z

Thanks!

rjuju · 2024-01-24T09:53:13Z

I just released version 2.2.3! Thanks again for the report and testing the patch!

vitabaks changed the title ~~Overhead from the pg_kstat_cache extension~~ Overhead from the pg_stat_kcache extension Jan 18, 2024

rjuju self-assigned this Jan 19, 2024

rjuju added the enhancement label Jan 19, 2024

rjuju closed this as completed Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhead from the pg_stat_kcache extension #41

Overhead from the pg_stat_kcache extension #41

vitabaks commented Jan 18, 2024 •

edited

NikolayS commented Jan 18, 2024 •

edited

rjuju commented Jan 19, 2024

vitabaks commented Jan 19, 2024 •

edited

rjuju commented Jan 19, 2024

anayrat commented Jan 19, 2024

vitabaks commented Jan 19, 2024

rjuju commented Jan 24, 2024

Overhead from the pg_stat_kcache extension #41

Overhead from the pg_stat_kcache extension #41

Comments

vitabaks commented Jan 18, 2024 • edited

NikolayS commented Jan 18, 2024 • edited

rjuju commented Jan 19, 2024

vitabaks commented Jan 19, 2024 • edited

Result:

Conclusions

rjuju commented Jan 19, 2024

anayrat commented Jan 19, 2024

vitabaks commented Jan 19, 2024

rjuju commented Jan 24, 2024

vitabaks commented Jan 18, 2024 •

edited

NikolayS commented Jan 18, 2024 •

edited

vitabaks commented Jan 19, 2024 •

edited