Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: long GC STW pauses (≥80ms) #19378
What version of Go are you using (
Here is another
@obeattie, thanks for the execution traces. Unfortunately, I spent a while looking at them and wasn't able to glean much about the actual problem.
It's interesting that these are always periodic GCs and that your heap is under 4 MB. Looking through sweep termination, I could make guesses about what's going on, but they would all be shots in the dark. Background sweeping has long finished sweeping before we enter sweep termination, so I don't think the sweep termination part of sweep termination is the problem.
Could you collect a pprof profile? Given how idle your program generally is, I suspect you would get some samples under runtime.forcegchelper that may give us a better idea of what's holding it up. You might want to crank up the profile sampling rate with
So sorry for leaving this open for so long with no follow-up.
We have been able to mitigate the impact of this issue somewhat but it is still happening. Our investigations were hampered by some infrastructure issues. We've resolved those issues in the last week though, so we should be able to look into this more deeply now.
My current thinking is that some behaviour of the Linux scheduler is involved. The fact that the GC times are so closely clustered around 100ms – which is the default
I got interested in your issue and decided to try to see if I could prove or disprove your theory about CFS before digging into it from a Go point of view.
I wrote a little program that does network transactions (talking with itself, via localhost) and dying as soon as it sees one transaction that takes more than 80ms. It is here: https://play.golang.org/p/7ZXSoCeb18
(Put it in a directory called "server" and do "go build" on it.)
Running "./server" does not trigger the long transactions, even with significant other load on the machine. (See the -spin argument for how I was generating load.)
But it is possible to trigger the 80 ms transactions by putting the server inside of a cgroup which is using cpu.cfs_quota_us. To see it in action:
In another window, do the following:
As quota approaches period, the max latency goes up, eventually reaching 80ms and the log.Fatal triggers.
If quota is at 200% of period (i.e. quota = 200000, period = 100000), you can even run "./server -spin" next to "./server" and you'll end up with max latencies of 50ms. (I can't quite explain why that is; I expected the max latency to still be 100 ms in that case, maybe I just got lucky.)
What's happening is that when the quota is exhausted, all processes in the cgroup are starved of time slices until the next period starts. If period is 100ms, you can eventually expect to get unlucky and have one transaction take up to 100ms.
The solution is to use a lower period, so that the worst case wait to finish any given transaction is lower. This config allows the spinner and the server to coexist with max latencies of 50 ms, at least on my machine:
Are you running your server inside of Docker, managed by kubernetes?
If so, there's still some work to do to figure out how to get your cgroups
The second URL seems to indicate that Kubernetes will not allow you to change the period. :(
@jeffallen Thank you so much for contributing your time to investigate this. Your findings match mine, and indeed these problems are seen when running under Kubernetes with CFS quotas enforced. I will need to work out some way to change the period under k8s…
Based on these findings I'm now confident that this is not a Go bug and I'm going to close the ticket. I think that fixing #19497 would be very helpful to anyone facing similar issues; debugging behaviour like this is somewhat like trying to find a needle in a haystack at the moment.
Thank you to everyone who's helped with this