Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: aggressive gc assist with many goroutines #56966

Open
nvanbenschoten opened this issue Nov 28, 2022 · 5 comments
Open

runtime: aggressive gc assist with many goroutines #56966

nvanbenschoten opened this issue Nov 28, 2022 · 5 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@nvanbenschoten
Copy link
Contributor

This is an extraction of a private thread with @mknyszek.

While working with a customer that is running CockroachDB, we found a class of workloads that are severely (and reproducibly) impacted by GC assist. The situation occurs when an app opens many (O(10k)) SQL connections to CockroachDB and applies a moderately memory-intensive workload (O(1KB) reads and writes @ 30k qps). We've found that this leads to severe tail latency blips (p50=2ms, p99=70ms) and have pinpointed the effect to GC assist.

This effect is present with go1.17.11 and go1.19.1. It is also present on master with 4179552. However, the degraded tail latency disappears if GC assist is disabled by commenting out this line.

name \ p99(ms)  17.txt     17-noassist.txt  19.txt     19-noassist.txt  19-4179552.txt
kv95            54.5 ±46%         2.5 ± 0%  73.4 ±43%         3.9 ±96%       41.9 ± 5%

Increasing GOGC does improve tail latency. However, the improvement comes from running fewer GCs. When GC was running, the impact on tail latency appears to be about the same.

name \ p99(ms)  17-gogc-300.txt  19-gogc-300.txt  19-4179552-gogc-300.txt
kv95                  44.2 ±47%        18.8 ±24%               22.6 ±131%

Go execution traces show GC assist kick in across workload goroutines almost immediately (within a few ms) after the background GC process starts. It then consumes the majority of on-cpu time on these goroutines for the duration of the background GC duration.

Screen Shot 2022-11-28 at 4 40 28 PM

Here is a collection of gctraces from different variants of the test using GODEBUG=gctrace=1,gcpacertrace=1:

go1.17_gogc_100.txt
go1.17_gogc_300.txt
go1.17_gogc_900.txt
go1.19_gogc_100.txt
go1.19_gogc_300.txt
go1.19_gogc_900.txt

An interesting note is that the investigations began when we noticed higher tail latency when the same workload (30k qps) was split across more SQL connections. CockroachDB maintains two goroutines per active connection. An early test found the following:

vCPUs Go GC SQL connections p99 latency
30 (1 socket) Default 512 2.0
10,000 28.3
60 (2 sockets) Default 512 3.1
10,000 67.1

To reproduce

The easiest way to reproduce is using CockroachDB's internal cluster orchestration tool called roachprod. With that tool, the reprudction steps are:

export CLUSTER="${USER}-test"
roachprod create $CLUSTER -n2 --gce-machine-type='c2-standard-30' --local-ssd=false
roachprod stage $CLUSTER release v22.2.0-rc.3
roachprod start $CLUSTER:1
roachprod run   $CLUSTER:2 -- "./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m {pgurl:1}"

If roachprod is not an option, than the steps are:

  1. create a pair of c2-standard-30 VM instances
  2. stage a CockroachDB binary on each
  3. start CockroachDB on the first VM
  4. run the following from the second VM:
./cockroach workload run kv --init --read-percent=95 --min-block-bytes=1024 --max-block-bytes=1024 --concurrency=10000 --max-rate=30000 --ramp=1m --duration=5m 'postgresql://root@<INSERT VM1 HOSTNAME HERE>:26257?sslmode=disable'

I'm happy to help get these reproduction steps working in other environments.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 28, 2022
@mknyszek mknyszek self-assigned this Nov 28, 2022
@mknyszek mknyszek added this to the Go1.21 milestone Nov 28, 2022
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 28, 2022
@mknyszek
Copy link
Contributor

Thank you!

AFAICT what's going on here is that the pacer is built to schedule riiiight on the edge of no assists in the steady-state, to avoid starting earlier than it absolutely has to. However, when there's some noise (I see that the amount of goroutine stacks vary a good bit, but there's likely also other noise) the pacer may constantly be either ahead or behind of where it needs to be. The amount of assists here is not super high, it's just that it hovers around 2-7%.

There are two ways we can fix this. (1) is to bound the latency of assists, but that would require a bigger change. (2) is to modify the pacer to always start a little early hedge against noise. This could make assists consistently 0% of GC CPU time, but comes at the cost of a slightly longer GC cycle. This might lead to a slightly higher frequency of GC cycles in some cases, slightly more write barriers executed, and slightly more memory used for floating garbage. However, I suspect that the amount of hedging required to resolve this issue is small enough that it won't meaningfully change performance otherwise (especially when things aren't very noisy).

My main idea is to have the pacer always use a slightly higher cons/mark ratio. For example, instead of taking the average of the last 2 cycles as a prediction for the next one like it currently does, take the maximum of the two, or weight the larger cons/mark more heavily. (This might be too noisy, and maybe we want to track 3 GC cycles or something, but it's a decent starting point.)

@nvanbenschoten
Copy link
Contributor Author

Thanks @mknyszek! Hedging against noise to avoid GC assist makes sense to me. There's a high, direct penalty to starting late and then pushing assist work onto foreground goroutines. Meanwhile, there's a lower, indirect penalty to starting early and running GC for slightly longer.

There's a tension here between reliable/predictable performance and absolute performance. Different systems will want to live at different spots on that spectrum. I'm hopeful a small degree of hedging will meaningfully move us in the direction of predictability without compromising much on absolute performance.

I'm happy to test out fixes on this workload to see if they help.

@mknyszek
Copy link
Contributor

Sorry for the delay! I got a bit caught up with other things, and after some more thought and a little experimentation, I think indeed this idea

For example, instead of taking the average of the last 2 cycles as a prediction for the next one like it currently does, take the maximum of the two, or weight the larger cons/mark more heavily. (This might be too noisy, and maybe we want to track 3 GC cycles or something, but it's a decent starting point.)

is indeed too noisy.

I'm hopeful a small degree of hedging will meaningfully move us in the direction of predictability without compromising much on absolute performance.

I agree. I suspect aggregate performance metrics like throughput tend to be less sensitive to relatively small changes in pacing, so a pure win (or close to one) feels achievable here.

I'm going to be out on vacation for a few days, just wanted to leave a ping here that I hadn't forgotten about it, and that it's still on my radar. :)

I'm happy to test out fixes on this workload to see if they help.

Thank you! I'll ping you when I have something more concrete. I have a couple more ideas I'm playing with, I'll aim to get back to you next week (I'm out for the rest of this week).

@nvanbenschoten
Copy link
Contributor Author

Thanks for the update @mknyszek. Enjoy your vacation!

@mknyszek
Copy link
Contributor

Okay, it's been a full month and I'm back. I experimented with a whole lot of things (The test case I'm using here is uniformly random noise in the measured cons/mark ratio), and I think the best hedging mechanism is in fact a maximum, it's just that a 2 cycle window is pretty bad. A 5 cycle window is pretty good, but it's not very responsive when the application suddenly goes idle; that's 5 consecutive cycles where the pacer is going to be too conservative. Maybe that's fine though. When you're idle these things tend to not matter so much, and it's a constant factor hit. I'm going to try benchmarking this ASAP; I'll send a CL if it looks OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

3 participants