Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: network blips during concurrent GC #20457

Closed
petermattis opened this issue May 22, 2017 · 11 comments

Comments

Projects
None yet
7 participants
@petermattis
Copy link

commented May 22, 2017

What version of Go are you using (go version)?

go version go1.8.1 darwin/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build955668609=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

While investigating tail latencies in CockroachDB, I noticed network blips during concurrent GC. The trace below is from a single-node cockroach cluster with a single client sending it requests as fast as possible. Note that the load is modest. My test machine is ~75% idle while gathering these traces.

screen shot 2017-05-22 at 11 25 20 am

trace.8.out.zip

Every time GC runs, there are no networking events for ~10ms which is quite close to the length of the GC run itself. This behavior occurs very quickly after starting the cluster and load generator. The heap size is modest (O(100 MB)).

Here is a zoom in on the first GC run from the above trace:

screen shot 2017-05-22 at 12 54 18 pm

In addition to the blip while GC is running, there is a smaller blip after each GC run while sweeping. The image below shows G611 being called on to assist in sweep work from 62ms-69ms in the trace.

screen shot 2017-05-22 at 1 16 31 pm

Note that I'm running both the cockroach process and the load generator (kv) on the same machine. Are the GC runs in the cockroach process starving the kv process of CPU? Running both processes with GOMAXPROCS=4 (I'm using an 8 core machine), shows much better behavior.

screen shot 2017-05-22 at 11 25 58 am

trace.4.out.zip

@dsnet

This comment has been minimized.

Copy link
Member

commented May 22, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

I note that the sysmon goroutine polls the network every 10ms if nothing else has polled. Normally findrunnable will poll the network whenever it doesn't have another goroutine to run. Perhaps during the GC there is always another goroutine to run (to do GC work) and therefore we fall back to the sysmon polling.

@petermattis

This comment has been minimized.

Copy link
Author

commented May 23, 2017

I forgot to add that I instrumented the Go runtime and can see lots of calls to netpoll via gcDrain -> pollWork. I also tried lowering the max sleep time in sysmon to 1ms, but that had no effect. Perhaps the 10ms is some OS level time slice on Darwin. I'll try and verify this same behavior on Linux.

@petermattis

This comment has been minimized.

Copy link
Author

commented May 23, 2017

Linux shows similar behavior:

screen shot 2017-05-22 at 8 56 14 pm

trace.linux.out.zip

I'm not sure what to make of the 3 idle procs (5, 6 and 7).

@josharian

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

Probably unrelated, but #20307 also observed that GC had a detrimental effect on goroutine scheduling.

@josharian

This comment has been minimized.

Copy link
Contributor

commented May 23, 2017

@petermattis if you can reproduce easily, would you mind trying out tip (and maybe 1.7.5) to see whether anything has changed?

@josharian josharian added this to the Go1.9Maybe milestone May 23, 2017

@petermattis

This comment has been minimized.

Copy link
Author

commented May 23, 2017

@josharian Tip as of go version devel +5e79787 Mon May 22 13:44:12 2017 +0000 darwin/amd64:

screen shot 2017-05-22 at 9 43 24 pm

trace.tip.out.zip

The networking blip is still there during GC, but it is no longer precisely 10ms. The network blips during sweep seem to be much reduced:

screen shot 2017-05-22 at 9 45 38 pm

@petermattis

This comment has been minimized.

Copy link
Author

commented May 24, 2017

I'm reasonably convinced that what I'm seeing is CPU starvation of my load generator. I tweaked the output of GODEBUG=gctrace=1 to compute the amount of CPU used by GC during the concurrent mark phase:

gc 203 @28.154s 0/98%: 0.007+9.5+0.039 ms clock, 0.060+0.21/18/56+0.31 ms cpu, 82->82->42 MB, 84 MB goal, 8 P

The 0/98% indicates that GC has used 0% of the CPU over the lifetime of the process, but used 98% of the CPU during the 9.5ms of concurrent marking. The use of the idle CPU for GC is reasonable on the one hand, but makes the Go process a poor neighbor.

I tried disabling idle GC workers which significantly reduced the CPU spikes during GC, but made latencies much worse:

screen shot 2017-05-24 at 9 49 48 am

trace.noidlegc.out.zip

In the above trace, G377 is running on Proc 5 when GC starts. G377 then blocks on network IO and the 2 dedicated GC workers get started. But there are only 2 Procs running and there does not seem to be a mechanism to kick off another one (or perhaps I disabled that mechanism in my hack to disable idle GC workers). Adding a wakep call after a dedicated GC worker is started fixes this issue. This gets latencies to a better place:

screen shot 2017-05-24 at 9 58 33 am

trace.noidlegc2.out.zip

CPU usage during marking is much lower than stock go1.8.1:

gc 201 @30.960s 1/29%: 0.025+22+0.080 ms clock, 0.20+11/41/0+0.64 ms cpu, 87->88->45 MB, 89 MB goal, 8 P

The downside to removing the idle GC workers is that the GC periods are about twice as long.

Seems like there is a high-level question to be answered about how much CPU GC should use when the process is not CPU limited.

@bradfitz

This comment has been minimized.

Copy link
Member

commented May 24, 2017

Dup of #17969 ("aggressive GC completion is disruptive to co-tenants") then?

@petermattis

This comment has been minimized.

Copy link
Author

commented May 24, 2017

Yep, that seems like the same issue I'm seeing.

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 20, 2017

But there are only 2 Procs running and there does not seem to be a mechanism to kick off another one (or perhaps I disabled that mechanism in my hack to disable idle GC workers).

You're right that at the moment it's possible for not all idle Ps to help with GC. This is #14179 (which we fixed once, but the fix was broken, so then we unfixed it).

Seems like there is a high-level question to be answered about how much CPU GC should use when the process is not CPU limited.

Yes. :) But I don't know what the answer is. Ideally we'd get the OS scheduler involved, since it's the only thing that knows what CPU resources are really idle. But (right now) we need global coordination between the GC workers, so we can't just run them at OS idle priority.

Thanks for all the traces and debugging! Closing as a dup of #17969.

@aclements aclements closed this Jul 20, 2017

@golang golang locked and limited conversation to collaborators Jul 20, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.