Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: gc pause bursts after upgrading from 1.16 to 1.17 #49542

Open
dop251 opened this issue Nov 12, 2021 · 7 comments
Open

runtime: gc pause bursts after upgrading from 1.16 to 1.17 #49542

dop251 opened this issue Nov 12, 2021 · 7 comments

Comments

@dop251
Copy link

@dop251 dop251 commented Nov 12, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.2 linux/amd64

Does this issue reproduce with the latest release?

N/A

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dop/.cache/go-build"
GOENV="/home/dop/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/dop/pkg/mod"
GOOS="linux"
GOPATH="/home/dop/projects"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526968537=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After upgrading our production servers with a version built with go 1.17.2 we saw a radically different GC pause profile (comparing to go 1.16.6):

image

What's plotted here is a change in PauseTotalNs measured every minute. The marker corresponds to the time of the rollout. Note that it did not start straight away, only when the load kicked in in the morning.

Zooming in showed there are "bursts" among otherwise normal sub-millisecond runs:

image

To confirm this is a regression we did a build of the same code using go 1.16.6 and deployed it to one of the nodes. Also, we restarted another node at the same time, but left it running with 1.17.2. Comparing the graphs for the two nodes shows identical profile, except for the bursts which only occur on the node running 1.17.2:

image

(green represents 1.16, yellow -- 1.17).

Here is what it looks like in the trace (we've managed to get a couple of examples):

image

The root cause is not immediately obvious to me, but I suspect there is a rare race condition which sometimes prevents STW to complete. Note, there is one event recorded during STW. In the end stack trace it shows runtime.selectgo:327. However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".

We haven't been able to create a reproducible case for it. It looks like this only happens when heap size grows to a few tens of GB and then the frequency (but not the size) of the bursts depends on the load.

If there is any additional info required please let me know.

What did you expect to see?

Normal, sub-millisecond GC pauses.

What did you see instead?

Random bursts of up to 100ms.

@cagedmantis cagedmantis changed the title GC pause bursts after upgrading from 1.16 to 1.17 runtime: gc pause bursts after upgrading from 1.16 to 1.17 Nov 12, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Nov 12, 2021
@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Nov 12, 2021

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 12, 2021

@dop251 in your trace this is a forced GC (runtime.forcegchelper). Does that apply to all of the examples you've captured in the execution tracer?

Loading

@dop251
Copy link
Author

@dop251 dop251 commented Nov 12, 2021

I have seen 3 examples and yes, it is runtime.forcegchelper in all of them.

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 12, 2021

Thanks, one more question:

However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".

This in particular does sound like the potential problem you describe. Specifically STW has to get all Ps to stop, so each P should have a "proc stop" event before STW continues. (Note that a P may already be stopped, so "proc stop" could be before STW). If a P takes a long time to stop, then that would block everything else.

I'd love to look at the trace you've collected to see if this is the case and what else is going on, if you think that is something you can share.

Loading

@dop251
Copy link
Author

@dop251 dop251 commented Nov 12, 2021

Is there a way to share it privately?

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 12, 2021

Sure, feel free to email me mpratt AT google.com.

Loading

@dop251
Copy link
Author

@dop251 dop251 commented Nov 12, 2021

Shared a Google Drive folder with you.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants