Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: concurrent phase of GC causing full request stalls #17341

Closed
zviadm opened this issue Oct 4, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@zviadm
Copy link

commented Oct 4, 2016

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

Go 1.7
Also tried with: GO: fd29628

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"

What did you do?

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

Wrote a benchmark to showcase the issue using native net/http library:
https://gist.github.com/zviadm/18263037d4b9b624fc8a5b4cb7005bb4

What this benchmark does is, create a simple http server, and then create lots of connections to it and perform very simple requests to it from those connections/goroutines. This simulates having an http server and lots of clients where most of them are idle.

What did you expect to see?

When run using something like:
$: ulimit -Sn 500000
$: GODEBUG=gctrace=1 ./net_gctest -port=5002 -nroutines=30000 -sleep_t=3s -max_latency=30ms &> benchout.txt

Expect to see it doing 10k QPS, with no stalls or tail latencies.

What did you see instead?

Sample output: https://gist.github.com/zviadm/82e9b899833db9dc741f9a6400a09d6c

gc 25 @222.337s 0%: 0.34+165+1.3 ms clock, 11+477/1307/2896+43 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 26 @237.844s 0%: 0.30+165+1.4 ms clock, 9.8+504/1306/3076+47 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 27 @253.378s 0%: 0.30+162+1.2 ms clock, 9.4+485/1286/2993+39 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P

even though, STW phase of GC is <2ms, we see lots of stalls of >30ms while concurrent phase of the GC runs. pretty much all requests stall during that time period.

This is a pretty serious issue because that basically means if you have lots of idle connections (or potentially just go routines?) in a server, it causes huge tail latency spikes. And in real programs GC frequency will be higher, it can happen once every 5-10 seconds and now these spikes easily show up in p95/p99 latency times.

Note that the cpu utilization is very low throughout the test. If we perform same test with much smaller number of go routines, say only 200 or 300, it can do 200k QPS without any STALLs on that same machine.

@bradfitz bradfitz changed the title Concurrent phase of GC causing full request stalls. runtime: concurrent phase of GC causing full request stalls Oct 4, 2016

@zviadm

This comment has been minimized.

Copy link
Author

commented Oct 4, 2016

Maybe related to:
#15847
#16293

But it seems like issue remains with current 'master' too, so seems like something else

@quentinmit quentinmit added this to the Go1.8 milestone Oct 4, 2016

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2016

@zviadm Can you send an execution trace? There are several known issues around this, which we're working on for Go 1.8. From the trace we should be able to tell which one(s) are affecting you.

https://golang.org/pkg/runtime/trace/

@cespare

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2016

I reproduced this using the given gist. Log output here.

Relevant bit of the trace:

screen_20161005100414

I attached the full trace here. For me, the pause appears in the split 6/8 (15.636125645s-19.117813135s).

Edit: I originally posted a trace using Go 1.7. I updated the log and trace using head (dce0df2) instead.

@aclements

This comment has been minimized.

Copy link
Member

commented Oct 5, 2016

Thanks! It's clear from the trace that this is #16528. Closing as a duplicate.

@zviadm, if you subscribe to #16528, it would be great to get feedback on future fixes. You can also try CL 30031 from that issue, though based on my own tests that just shifts the pauses around (albeit to an easier-to-fix place, which is progress in some sense :)

@aclements aclements closed this Oct 5, 2016

@zviadm

This comment has been minimized.

Copy link
Author

commented Oct 5, 2016

Thanks for quick responses, I have subscribed to the other task. Will try to see what happens with that CL too, just as a data point.

@golang golang locked and limited conversation to collaborators Oct 5, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.