New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: new goroutines can spend excessive time in morestack #18138

Open
petermattis opened this Issue Dec 1, 2016 · 25 comments

Comments

Projects
None yet
9 participants
@petermattis

petermattis commented Dec 1, 2016

What version of Go are you using (go version)?

go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64 a.k.a go1.8beta1

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in runtime.morestack. The problematic goroutines are all hitting a single gRPC entrypoint Server.Batch and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.

The expensive part of runtime.morestack is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:

name               old time/op  new time/op  delta
KVInsert1_SQL-8     339µs ± 2%   312µs ± 1%   -7.89%  (p=0.000 n=10+10)
KVInsert10_SQL-8    485µs ± 2%   471µs ± 1%   -2.81%  (p=0.000 n=10+10)
KVInsert100_SQL-8  1.36ms ± 0%  1.35ms ± 0%   -0.95%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     535µs ± 1%   487µs ± 1%   -9.02%   (p=0.000 n=10+9)
KVUpdate10_SQL-8    777µs ± 1%   730µs ± 1%   -6.03%   (p=0.000 n=10+9)
KVUpdate100_SQL-8  2.69ms ± 1%  2.66ms ± 1%   -1.16%  (p=0.000 n=10+10)
KVDelete1_SQL-8     479µs ± 1%   429µs ± 2%  -10.43%   (p=0.000 n=9+10)
KVDelete10_SQL-8    676µs ± 1%   637µs ± 1%   -5.80%    (p=0.000 n=9+9)
KVDelete100_SQL-8  2.23ms ± 5%  2.18ms ± 4%     ~     (p=0.105 n=10+10)
KVScan1_SQL-8       216µs ± 5%   179µs ± 1%  -17.12%  (p=0.000 n=10+10)
KVScan10_SQL-8      233µs ± 1%   201µs ± 1%  -13.76%  (p=0.000 n=10+10)
KVScan100_SQL-8     463µs ± 1%   437µs ± 0%   -5.64%   (p=0.000 n=10+8)

old are benchmarks gathered using go1.8beta1 and new are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of server.Batch to a growStack method:

var growStackGlobal = false

//go:noinline
func growStack() {
	// Goroutine stacks currently start at 2 KB in size. The code paths through
	// the storage package often need a stack that is 32 KB in size. The stack
	// growth is mildly expensive making it useful to trick the runtime into
	// growing the stack early. Since goroutine stacks grow in multiples of 2 and
	// start at 2 KB in size, by placing a 16 KB object on the stack early in the
	// lifetime of a goroutine we force the runtime to use a 32 KB stack for the
	// goroutine.
	var buf [16 << 10] /* 16 KB */ byte
	if growStackGlobal {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
	}
}

The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the growStackGlobal is not currently necessary, but I wanted to future proof against the compiler deciding that buf is not necessary.

Longer term, the stack usage under server.Batch should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.

@bradfitz bradfitz added this to the Go1.9 milestone Dec 1, 2016

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 1, 2016

@aclements

This comment has been minimized.

Member

aclements commented Dec 1, 2016

We've seen this a few times now. I'm not sure what the right answer is. My best thought so far is that the runtime could keep track of when particular go statements always lead to stack growth right away (for some value of "right away" and "always") and learn to start goroutines from that site with a larger stack. Of course, it would be hard to make this behavior predictable, but perhaps it would still be less surprising than the current behavior. If the runtime did learn to start a goroutine with a larger stack, it would still need a signal to learn if the stack should get smaller again, but we could do that efficiently by allocating the larger stack but setting the stack bounds to something smaller. Then the runtime could still observe whether or not the stack needs to grow, but the actual growth would be basically free until it reached the size of the allocation.

@randall77, thoughts, ideas?

/cc @RLH

@mrjrieke

This comment has been minimized.

mrjrieke commented Dec 1, 2016

I like @petermattis idea of being able to hint stack size on a per goroutine basis, although this implies developers have the knowhow to identify and provide size estimates accurately. Could this be done with a compiler directive?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 1, 2016

We don't want compiler directives in code. We have some used by the runtime out of necessity, but they're gross. Go prefers simplicity over tons of knobs.

@petermattis

This comment has been minimized.

petermattis commented Dec 1, 2016

Yes, please just make my code magically faster as you've been doing for the last several Go releases.

@mrjrieke

This comment has been minimized.

mrjrieke commented Dec 1, 2016

I generally agree with not having compiler directives ... magic is nice, although they (compiler directives) do exist even in go. It's an interesting opportunity either way you decide.

@mrjrieke

This comment has been minimized.

mrjrieke commented Dec 2, 2016

@bradfitz, your comment prompted me to look for the go guiding principles ( https://golang.org/doc/faq#principles). Thanks @adg as well for nicely worded principles.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 8, 2017

CL https://golang.org/cl/45142 mentions this issue.

@aclements

This comment has been minimized.

Member

aclements commented Jun 8, 2017

@petermattis (or anyone who has a good reproducer for this), would you be able to try https://go-review.googlesource.com/45142? It's a trivial hack, but it might actually do the trick. I haven't benchmarked it on anything, so it may also slow things down.

@aclements aclements modified the milestones: Go1.10Early, Go1.9 Jun 8, 2017

@aclements aclements self-assigned this Jun 8, 2017

@petermattis

This comment has been minimized.

petermattis commented Jun 9, 2017

@aclements I'll try and test either tomorrow or next week.

@petermattis

This comment has been minimized.

petermattis commented Jun 13, 2017

@aclements Applying that patch to go1.8.3 resulted in no benefit (this is with the growStack hack disabled):

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   369µs ± 2%  +1.43%  (p=0.043 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   581µs ± 1%    ~     (p=0.113 n=10+9)
KV/Insert100_SQL-8  2.05ms ± 0%  2.05ms ± 1%    ~     (p=0.912 n=10+10)
KV/Update1_SQL-8     578µs ± 1%   577µs ± 1%    ~     (p=0.968 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   914µs ± 1%    ~     (p=0.931 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.87ms ± 5%  +1.90%  (p=0.019 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   518µs ± 2%    ~     (p=0.912 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   809µs ± 1%    ~     (p=0.280 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.26ms ± 3%    ~     (p=0.052 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   216µs ± 0%    ~     (p=0.090 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   238µs ± 1%    ~     (p=0.122 n=10+8)
KV/Scan100_SQL-8     454µs ± 0%   455µs ± 1%    ~     (p=0.089 n=10+10)

Surprising to me this didn't have any effect. Compare this to the growStack hack mentioned earlier:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.grow-stack
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   331µs ± 2%   -8.82%  (p=0.000 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   561µs ± 1%   -3.80%  (p=0.000 n=10+10)
KV/Insert100_SQL-8  2.05ms ± 0%  2.03ms ± 0%   -0.88%  (p=0.000 n=10+8)
KV/Update1_SQL-8     578µs ± 1%   532µs ± 1%   -7.94%  (p=0.000 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   872µs ± 1%   -4.47%  (p=0.000 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.75ms ± 1%   -1.36%  (p=0.000 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   458µs ± 2%  -11.54%  (p=0.000 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   765µs ± 1%   -5.91%  (p=0.000 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.16ms ± 1%   -2.01%  (p=0.000 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   194µs ± 1%  -10.44%  (p=0.000 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   216µs ± 1%   -9.36%  (p=0.000 n=10+10)
KV/Scan100_SQL-8     454µs ± 0%   431µs ± 1%   -4.92%  (p=0.000 n=10+9)
@josharian

This comment has been minimized.

Contributor

josharian commented Jun 13, 2017

CL 43150 might help a little here.

@aclements

This comment has been minimized.

Member

aclements commented Jun 13, 2017

Sorry, I made a silly mistake in CL 45142. Would you mind trying the new version of that CL?

@petermattis

This comment has been minimized.

petermattis commented Jun 14, 2017

With your updated patch against go-tip (f363817) there is an improvement:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   224µs ± 0%  -7.57%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   247µs ± 0%  -6.20%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   444µs ± 0%  -4.05%  (p=0.000 n=10+10)

But the improvement is still not as good as the growStack hack:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.new out.grow-stack
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     224µs ± 0%   219µs ± 0%  -2.24%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    247µs ± 0%   240µs ± 1%  -2.59%  (p=0.000 n=10+10)
KV/Scan100_SQL-8   444µs ± 0%   439µs ± 0%  -1.06%  (p=0.000 n=10+9)

There is a little more performance if we increase the initial stack size to 32 KB:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new2
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   209µs ± 1%  -13.76%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   232µs ± 2%  -11.61%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   445µs ± 4%   -3.86%  (p=0.000 n=10+9)

Interestingly, all of these timings are lower than with go1.8.3.

@petermattis

This comment has been minimized.

petermattis commented Jun 14, 2017

Interestingly, all of these timings are lower than with go1.8.3.

Nothing to see here. This appears to be due to a change on our code between what I tested earlier today and now.

@bradfitz bradfitz modified the milestones: Go1.10Early, Go1.10 Jun 14, 2017

@petermattis

This comment has been minimized.

petermattis commented Jun 16, 2017

I did some more testing of this patch and the performance improvements carry over to production settings. morestack disappears from profiles. Note this is using a version of the patch which uses a 32KB initial stack size.

@petermattis

This comment has been minimized.

petermattis commented Aug 13, 2017

It is early in the 1.10 cycle and wanted to bring this issue forward again. See cockroachdb/cockroach#17242 for a graph showing the benefit of a larger initial stack size.

@petermattis

This comment has been minimized.

petermattis commented Oct 11, 2017

Is there any update on this issue? A larger initial goroutine stack size provides a nice performance boost for our system.

@richardartoul

This comment has been minimized.

richardartoul commented Jun 7, 2018

@petermattis We ran into a similar issue in production where adding a single extra function call (few lines of code, nothing heavy) increased our end-to-end latency by a few seconds. We narrowed down the issue to an increase in runtime.morestack. We ended up resolving the issue with this worker pool: https://github.com/m3db/m3x/blob/master/sync/pooled_worker_pool.go

I noticed that you said:

I can see the stack growing in 4 steps from 2 KB to 32 KB.

Did you have any way of programatically measuring when the stack grows? I would love to be able to prove in production that the additional function we added was enough to push the stack over the edge into one more copy, but all I could do was make inferences based on the CPU profiles.

@josharian

This comment has been minimized.

Contributor

josharian commented Jun 7, 2018

If you can reproduce in a non-production environment, you might want to try tip of Go 1.11. There were a few improvements made this cycle that could help a bit.

Again, in a non-production environment, you could simply add a println(“grow”) to the implementation of morestack in the runtime. In a production environment, you might want to atomically increment a package variable instead and read it out atomically in another goroutine.

You might also want to file an issue requesting a built-in pprof profile for stack size changes, although even if approved that couldn’t land until 1.12.

@richardartoul

This comment has been minimized.

richardartoul commented Jun 7, 2018

@josharian I'm happy to request a built-in pprof profile for stack size changes if you think that's possible. The main issue with the current CPU profiles is that you can see its happening, but can't trace what is triggering it

@josharian

This comment has been minimized.

Contributor

josharian commented Jun 7, 2018

Please file an issue, and we can discuss feasibility there. If you cc me I’ll pull in other relevant folks. Thanks!

@petermattis

This comment has been minimized.

petermattis commented Jun 8, 2018

Did you have any way of programatically measuring when the stack grows? I would love to be able to prove in production that the additional function we added was enough to push the stack over the edge into one more copy, but all I could do was make inferences based on the CPU profiles.

@richardartoul I used the following patch (against go1.9):

diff --git a/src/runtime/stack.go b/src/runtime/stack.go
index 525d0b14c1..4615defa67 100644
--- a/src/runtime/stack.go
+++ b/src/runtime/stack.go
@@ -910,6 +910,8 @@ func round2(x int32) int32 {
        return 1 << s
 }

+var debugStack = gogetenv("GOSTACK") == "1"
+
 // Called from runtime·morestack when more stack is needed.
 // Allocate larger stack and relocate to new stack.
 // Stack growth is multiplicative, for constant amortized cost.
@@ -937,6 +939,14 @@ func newstack(ctxt unsafe.Pointer) {
        // morestack so it has the necessary write barrier.
        gp.sched.ctxt = ctxt

+       if debugStack {
+               oldsize := gp.stack.hi - gp.stack.lo
+               newsize := oldsize * 2
+               print("runtime: newstack: ", oldsize, " -> ", newsize, "\n")
+               morebuf := thisg.m.morebuf
+               traceback(morebuf.pc, morebuf.sp, morebuf.lr, morebuf.g.ptr())
+       }
+
        if thisg.m.curg.throwsplit {
                // Update syscallsp, syscallpc in case traceback uses them.
                morebuf := thisg.m.morebuf
@richardartoul

This comment has been minimized.

richardartoul commented Jun 8, 2018

@petermattis Awesome, thanks! Will report back.

@josharian I will file an issue soon, just want to gather all my evidence as its kind of scattered around right now.

@josharian

This comment has been minimized.

Contributor

josharian commented Jun 8, 2018

Don’t worry about evidence or a big write up. For this, a few sentences and a reference back to this issue suffices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment