runtime: aggressive gc assist with many goroutines #56966
This is an extraction of a private thread with @mknyszek.
While working with a customer that is running CockroachDB, we found a class of workloads that are severely (and reproducibly) impacted by GC assist. The situation occurs when an app opens many (O(10k)) SQL connections to CockroachDB and applies a moderately memory-intensive workload (O(1KB) reads and writes @ 30k qps). We've found that this leads to severe tail latency blips (p50=2ms, p99=70ms) and have pinpointed the effect to GC assist.
Increasing GOGC does improve tail latency. However, the improvement comes from running fewer GCs. When GC was running, the impact on tail latency appears to be about the same.
Go execution traces show GC assist kick in across workload goroutines almost immediately (within a few ms) after the background GC process starts. It then consumes the majority of on-cpu time on these goroutines for the duration of the background GC duration.
Here is a collection of gctraces from different variants of the test using
An interesting note is that the investigations began when we noticed higher tail latency when the same workload (30k qps) was split across more SQL connections. CockroachDB maintains two goroutines per active connection. An early test found the following:
The easiest way to reproduce is using CockroachDB's internal cluster orchestration tool called
If roachprod is not an option, than the steps are:
I'm happy to help get these reproduction steps working in other environments.
The text was updated successfully, but these errors were encountered:
AFAICT what's going on here is that the pacer is built to schedule riiiight on the edge of no assists in the steady-state, to avoid starting earlier than it absolutely has to. However, when there's some noise (I see that the amount of goroutine stacks vary a good bit, but there's likely also other noise) the pacer may constantly be either ahead or behind of where it needs to be. The amount of assists here is not super high, it's just that it hovers around 2-7%.
There are two ways we can fix this. (1) is to bound the latency of assists, but that would require a bigger change. (2) is to modify the pacer to always start a little early hedge against noise. This could make assists consistently 0% of GC CPU time, but comes at the cost of a slightly longer GC cycle. This might lead to a slightly higher frequency of GC cycles in some cases, slightly more write barriers executed, and slightly more memory used for floating garbage. However, I suspect that the amount of hedging required to resolve this issue is small enough that it won't meaningfully change performance otherwise (especially when things aren't very noisy).
My main idea is to have the pacer always use a slightly higher cons/mark ratio. For example, instead of taking the average of the last 2 cycles as a prediction for the next one like it currently does, take the maximum of the two, or weight the larger cons/mark more heavily. (This might be too noisy, and maybe we want to track 3 GC cycles or something, but it's a decent starting point.)
Thanks @mknyszek! Hedging against noise to avoid GC assist makes sense to me. There's a high, direct penalty to starting late and then pushing assist work onto foreground goroutines. Meanwhile, there's a lower, indirect penalty to starting early and running GC for slightly longer.
There's a tension here between reliable/predictable performance and absolute performance. Different systems will want to live at different spots on that spectrum. I'm hopeful a small degree of hedging will meaningfully move us in the direction of predictability without compromising much on absolute performance.
I'm happy to test out fixes on this workload to see if they help.
Sorry for the delay! I got a bit caught up with other things, and after some more thought and a little experimentation, I think indeed this idea
is indeed too noisy.
I agree. I suspect aggregate performance metrics like throughput tend to be less sensitive to relatively small changes in pacing, so a pure win (or close to one) feels achievable here.
I'm going to be out on vacation for a few days, just wanted to leave a ping here that I hadn't forgotten about it, and that it's still on my radar. :)
Thank you! I'll ping you when I have something more concrete. I have a couple more ideas I'm playing with, I'll aim to get back to you next week (I'm out for the rest of this week).
Okay, it's been a full month and I'm back. I experimented with a whole lot of things (The test case I'm using here is uniformly random noise in the measured cons/mark ratio), and I think the best hedging mechanism is in fact a maximum, it's just that a 2 cycle window is pretty bad. A 5 cycle window is pretty good, but it's not very responsive when the application suddenly goes idle; that's 5 consecutive cycles where the pacer is going to be too conservative. Maybe that's fine though. When you're idle these things tend to not matter so much, and it's a constant factor hit. I'm going to try benchmarking this ASAP; I'll send a CL if it looks OK.