Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: gc pause bursts after upgrading from 1.16 to 1.17 #49542

Open
dop251 opened this issue Nov 12, 2021 · 7 comments
Open

runtime: gc pause bursts after upgrading from 1.16 to 1.17 #49542

dop251 opened this issue Nov 12, 2021 · 7 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dop251
Copy link

dop251 commented Nov 12, 2021

What version of Go are you using (go version)?

$ go version
go version go1.17.2 linux/amd64

Does this issue reproduce with the latest release?

N/A

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/dop/.cache/go-build"
GOENV="/home/dop/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/dop/pkg/mod"
GOOS="linux"
GOPATH="/home/dop/projects"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526968537=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After upgrading our production servers with a version built with go 1.17.2 we saw a radically different GC pause profile (comparing to go 1.16.6):

image

What's plotted here is a change in PauseTotalNs measured every minute. The marker corresponds to the time of the rollout. Note that it did not start straight away, only when the load kicked in in the morning.

Zooming in showed there are "bursts" among otherwise normal sub-millisecond runs:

image

To confirm this is a regression we did a build of the same code using go 1.16.6 and deployed it to one of the nodes. Also, we restarted another node at the same time, but left it running with 1.17.2. Comparing the graphs for the two nodes shows identical profile, except for the bursts which only occur on the node running 1.17.2:

image

(green represents 1.16, yellow -- 1.17).

Here is what it looks like in the trace (we've managed to get a couple of examples):

image

The root cause is not immediately obvious to me, but I suspect there is a rare race condition which sometimes prevents STW to complete. Note, there is one event recorded during STW. In the end stack trace it shows runtime.selectgo:327. However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".

We haven't been able to create a reproducible case for it. It looks like this only happens when heap size grows to a few tens of GB and then the frequency (but not the size) of the bursts depends on the load.

If there is any additional info required please let me know.

What did you expect to see?

Normal, sub-millisecond GC pauses.

What did you see instead?

Random bursts of up to 100ms.

@cagedmantis cagedmantis changed the title GC pause bursts after upgrading from 1.16 to 1.17 runtime: gc pause bursts after upgrading from 1.16 to 1.17 Nov 12, 2021
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 12, 2021
@cagedmantis cagedmantis added this to the Backlog milestone Nov 12, 2021
@cagedmantis
Copy link
Contributor

cagedmantis commented Nov 12, 2021

@prattmic
Copy link
Member

prattmic commented Nov 12, 2021

@dop251 in your trace this is a forced GC (runtime.forcegchelper). Does that apply to all of the examples you've captured in the execution tracer?

@dop251
Copy link
Author

dop251 commented Nov 12, 2021

I have seen 3 examples and yes, it is runtime.forcegchelper in all of them.

@prattmic
Copy link
Member

prattmic commented Nov 12, 2021

Thanks, one more question:

However in another example there is also an event recorded shortly before the end of STW, but it just says "proc stop".

This in particular does sound like the potential problem you describe. Specifically STW has to get all Ps to stop, so each P should have a "proc stop" event before STW continues. (Note that a P may already be stopped, so "proc stop" could be before STW). If a P takes a long time to stop, then that would block everything else.

I'd love to look at the trace you've collected to see if this is the case and what else is going on, if you think that is something you can share.

@dop251
Copy link
Author

dop251 commented Nov 12, 2021

Is there a way to share it privately?

@prattmic
Copy link
Member

prattmic commented Nov 12, 2021

Sure, feel free to email me mpratt AT google.com.

@dop251
Copy link
Author

dop251 commented Nov 12, 2021

Shared a Google Drive folder with you.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests

4 participants