After upgrading our production servers with a version built with go 1.17.2 we saw a radically different GC pause profile (comparing to go 1.16.6):
What's plotted here is a change in PauseTotalNs measured every minute. The marker corresponds to the time of the rollout. Note that it did not start straight away, only when the load kicked in in the morning.
Zooming in showed there are "bursts" among otherwise normal sub-millisecond runs:
To confirm this is a regression we did a build of the same code using go 1.16.6 and deployed it to one of the nodes. Also, we restarted another node at the same time, but left it running with 1.17.2. Comparing the graphs for the two nodes shows identical profile, except for the bursts which only occur on the node running 1.17.2:
(green represents 1.16, yellow -- 1.17).
Here is what it looks like in the trace (we've managed to get a couple of examples):
The root cause is not immediately obvious to me, but I suspect there is a rare race condition which sometimes prevents STW to complete. Note, there is one event recorded during STW. In the end stack trace it shows runtime.selectgo:327. However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".
We haven't been able to create a reproducible case for it. It looks like this only happens when heap size grows to a few tens of GB and then the frequency (but not the size) of the bursts depends on the load.
If there is any additional info required please let me know.
What did you expect to see?
Normal, sub-millisecond GC pauses.
What did you see instead?
Random bursts of up to 100ms.
The text was updated successfully, but these errors were encountered:
However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".
This in particular does sound like the potential problem you describe. Specifically STW has to get all Ps to stop, so each P should have a "proc stop" event before STW continues. (Note that a P may already be stopped, so "proc stop" could be before STW). If a P takes a long time to stop, then that would block everything else.
I'd love to look at the trace you've collected to see if this is the case and what else is going on, if you think that is something you can share.