Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: new goroutines can spend excessive time in morestack #18138

Open
petermattis opened this issue Dec 1, 2016 · 44 comments
Open

runtime: new goroutines can spend excessive time in morestack #18138

petermattis opened this issue Dec 1, 2016 · 44 comments

Comments

@petermattis
Copy link

@petermattis petermattis commented Dec 1, 2016

What version of Go are you using (go version)?

go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64 a.k.a go1.8beta1

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build385423377=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in runtime.morestack. The problematic goroutines are all hitting a single gRPC entrypoint Server.Batch and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.

The expensive part of runtime.morestack is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:

name               old time/op  new time/op  delta
KVInsert1_SQL-8     339µs ± 2%   312µs ± 1%   -7.89%  (p=0.000 n=10+10)
KVInsert10_SQL-8    485µs ± 2%   471µs ± 1%   -2.81%  (p=0.000 n=10+10)
KVInsert100_SQL-8  1.36ms ± 0%  1.35ms ± 0%   -0.95%  (p=0.000 n=10+10)
KVUpdate1_SQL-8     535µs ± 1%   487µs ± 1%   -9.02%   (p=0.000 n=10+9)
KVUpdate10_SQL-8    777µs ± 1%   730µs ± 1%   -6.03%   (p=0.000 n=10+9)
KVUpdate100_SQL-8  2.69ms ± 1%  2.66ms ± 1%   -1.16%  (p=0.000 n=10+10)
KVDelete1_SQL-8     479µs ± 1%   429µs ± 2%  -10.43%   (p=0.000 n=9+10)
KVDelete10_SQL-8    676µs ± 1%   637µs ± 1%   -5.80%    (p=0.000 n=9+9)
KVDelete100_SQL-8  2.23ms ± 5%  2.18ms ± 4%     ~     (p=0.105 n=10+10)
KVScan1_SQL-8       216µs ± 5%   179µs ± 1%  -17.12%  (p=0.000 n=10+10)
KVScan10_SQL-8      233µs ± 1%   201µs ± 1%  -13.76%  (p=0.000 n=10+10)
KVScan100_SQL-8     463µs ± 1%   437µs ± 0%   -5.64%   (p=0.000 n=10+8)

old are benchmarks gathered using go1.8beta1 and new are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning of server.Batch to a growStack method:

var growStackGlobal = false

//go:noinline
func growStack() {
	// Goroutine stacks currently start at 2 KB in size. The code paths through
	// the storage package often need a stack that is 32 KB in size. The stack
	// growth is mildly expensive making it useful to trick the runtime into
	// growing the stack early. Since goroutine stacks grow in multiples of 2 and
	// start at 2 KB in size, by placing a 16 KB object on the stack early in the
	// lifetime of a goroutine we force the runtime to use a 32 KB stack for the
	// goroutine.
	var buf [16 << 10] /* 16 KB */ byte
	if growStackGlobal {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
	}
}

The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the growStackGlobal is not currently necessary, but I wanted to future proof against the compiler deciding that buf is not necessary.

Longer term, the stack usage under server.Batch should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.

@bradfitz bradfitz added this to the Go1.9 milestone Dec 1, 2016
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Dec 1, 2016

Loading

@aclements
Copy link
Member

@aclements aclements commented Dec 1, 2016

We've seen this a few times now. I'm not sure what the right answer is. My best thought so far is that the runtime could keep track of when particular go statements always lead to stack growth right away (for some value of "right away" and "always") and learn to start goroutines from that site with a larger stack. Of course, it would be hard to make this behavior predictable, but perhaps it would still be less surprising than the current behavior. If the runtime did learn to start a goroutine with a larger stack, it would still need a signal to learn if the stack should get smaller again, but we could do that efficiently by allocating the larger stack but setting the stack bounds to something smaller. Then the runtime could still observe whether or not the stack needs to grow, but the actual growth would be basically free until it reached the size of the allocation.

@randall77, thoughts, ideas?

/cc @RLH

Loading

@mrjrieke
Copy link

@mrjrieke mrjrieke commented Dec 1, 2016

I like @petermattis idea of being able to hint stack size on a per goroutine basis, although this implies developers have the knowhow to identify and provide size estimates accurately. Could this be done with a compiler directive?

Loading

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Dec 1, 2016

We don't want compiler directives in code. We have some used by the runtime out of necessity, but they're gross. Go prefers simplicity over tons of knobs.

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Dec 1, 2016

Yes, please just make my code magically faster as you've been doing for the last several Go releases.

Loading

@mrjrieke
Copy link

@mrjrieke mrjrieke commented Dec 1, 2016

I generally agree with not having compiler directives ... magic is nice, although they (compiler directives) do exist even in go. It's an interesting opportunity either way you decide.

Loading

@mrjrieke
Copy link

@mrjrieke mrjrieke commented Dec 2, 2016

@bradfitz, your comment prompted me to look for the go guiding principles ( https://golang.org/doc/faq#principles). Thanks @adg as well for nicely worded principles.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 8, 2017

CL https://golang.org/cl/45142 mentions this issue.

Loading

@aclements
Copy link
Member

@aclements aclements commented Jun 8, 2017

@petermattis (or anyone who has a good reproducer for this), would you be able to try https://go-review.googlesource.com/45142? It's a trivial hack, but it might actually do the trick. I haven't benchmarked it on anything, so it may also slow things down.

Loading

@aclements aclements added this to the Go1.10Early milestone Jun 8, 2017
@aclements aclements removed this from the Go1.9 milestone Jun 8, 2017
@aclements aclements self-assigned this Jun 8, 2017
@petermattis
Copy link
Author

@petermattis petermattis commented Jun 9, 2017

@aclements I'll try and test either tomorrow or next week.

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Jun 13, 2017

@aclements Applying that patch to go1.8.3 resulted in no benefit (this is with the growStack hack disabled):

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   369µs ± 2%  +1.43%  (p=0.043 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   581µs ± 1%    ~     (p=0.113 n=10+9)
KV/Insert100_SQL-8  2.05ms ± 0%  2.05ms ± 1%    ~     (p=0.912 n=10+10)
KV/Update1_SQL-8     578µs ± 1%   577µs ± 1%    ~     (p=0.968 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   914µs ± 1%    ~     (p=0.931 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.87ms ± 5%  +1.90%  (p=0.019 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   518µs ± 2%    ~     (p=0.912 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   809µs ± 1%    ~     (p=0.280 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.26ms ± 3%    ~     (p=0.052 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   216µs ± 0%    ~     (p=0.090 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   238µs ± 1%    ~     (p=0.122 n=10+8)
KV/Scan100_SQL-8     454µs ± 0%   455µs ± 1%    ~     (p=0.089 n=10+10)

Surprising to me this didn't have any effect. Compare this to the growStack hack mentioned earlier:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.grow-stack
name                old time/op  new time/op  delta
KV/Insert1_SQL-8     363µs ± 3%   331µs ± 2%   -8.82%  (p=0.000 n=10+10)
KV/Insert10_SQL-8    583µs ± 0%   561µs ± 1%   -3.80%  (p=0.000 n=10+10)
KV/Insert100_SQL-8  2.05ms ± 0%  2.03ms ± 0%   -0.88%  (p=0.000 n=10+8)
KV/Update1_SQL-8     578µs ± 1%   532µs ± 1%   -7.94%  (p=0.000 n=9+10)
KV/Update10_SQL-8    913µs ± 1%   872µs ± 1%   -4.47%  (p=0.000 n=9+9)
KV/Update100_SQL-8  3.80ms ± 1%  3.75ms ± 1%   -1.36%  (p=0.000 n=10+10)
KV/Delete1_SQL-8     517µs ± 2%   458µs ± 2%  -11.54%  (p=0.000 n=10+10)
KV/Delete10_SQL-8    813µs ± 2%   765µs ± 1%   -5.91%  (p=0.000 n=10+10)
KV/Delete100_SQL-8  3.22ms ± 2%  3.16ms ± 1%   -2.01%  (p=0.000 n=10+10)
KV/Scan1_SQL-8       217µs ± 1%   194µs ± 1%  -10.44%  (p=0.000 n=9+10)
KV/Scan10_SQL-8      238µs ± 0%   216µs ± 1%   -9.36%  (p=0.000 n=10+10)
KV/Scan100_SQL-8     454µs ± 0%   431µs ± 1%   -4.92%  (p=0.000 n=10+9)

Loading

@josharian
Copy link
Contributor

@josharian josharian commented Jun 13, 2017

CL 43150 might help a little here.

Loading

@aclements
Copy link
Member

@aclements aclements commented Jun 13, 2017

Sorry, I made a silly mistake in CL 45142. Would you mind trying the new version of that CL?

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Jun 14, 2017

With your updated patch against go-tip (f363817) there is an improvement:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   224µs ± 0%  -7.57%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   247µs ± 0%  -6.20%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   444µs ± 0%  -4.05%  (p=0.000 n=10+10)

But the improvement is still not as good as the growStack hack:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.new out.grow-stack
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     224µs ± 0%   219µs ± 0%  -2.24%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    247µs ± 0%   240µs ± 1%  -2.59%  (p=0.000 n=10+10)
KV/Scan100_SQL-8   444µs ± 0%   439µs ± 0%  -1.06%  (p=0.000 n=10+9)

There is a little more performance if we increase the initial stack size to 32 KB:

~/Development/go/src/github.com/cockroachdb/cockroach/pkg/sql master benchstat out.old out.new2
name              old time/op  new time/op  delta
KV/Scan1_SQL-8     243µs ± 1%   209µs ± 1%  -13.76%  (p=0.000 n=9+9)
KV/Scan10_SQL-8    263µs ± 0%   232µs ± 2%  -11.61%  (p=0.000 n=9+10)
KV/Scan100_SQL-8   463µs ± 0%   445µs ± 4%   -3.86%  (p=0.000 n=10+9)

Interestingly, all of these timings are lower than with go1.8.3.

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Jun 14, 2017

Interestingly, all of these timings are lower than with go1.8.3.

Nothing to see here. This appears to be due to a change on our code between what I tested earlier today and now.

Loading

@bradfitz bradfitz removed this from the Go1.10Early milestone Jun 14, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Jun 14, 2017
@bradfitz bradfitz added this to the Go1.10 milestone Jun 14, 2017
@bradfitz bradfitz removed this from the Go1.10Early milestone Jun 14, 2017
@petermattis
Copy link
Author

@petermattis petermattis commented Jun 16, 2017

I did some more testing of this patch and the performance improvements carry over to production settings. morestack disappears from profiles. Note this is using a version of the patch which uses a 32KB initial stack size.

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Aug 13, 2017

It is early in the 1.10 cycle and wanted to bring this issue forward again. See cockroachdb/cockroach#17242 for a graph showing the benefit of a larger initial stack size.

Loading

@petermattis
Copy link
Author

@petermattis petermattis commented Oct 11, 2017

Is there any update on this issue? A larger initial goroutine stack size provides a nice performance boost for our system.

Loading

@rsc rsc removed this from the Go1.14 milestone Oct 9, 2019
@rsc rsc added this to the Backlog milestone Oct 9, 2019
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
adtac added a commit to adtac/grpc-go that referenced this issue Nov 22, 2019
Currently (go1.13.4), the default stack size for newly spawned
goroutines is 2048 bytes. This is insufficient when processing gRPC
requests as the we often require more than 4 KiB stacks. This causes the
Go runtime to call runtime.morestack at least twice per RPC, which
causes performance to suffer needlessly as stack reallocations require
all sorts of internal work such as changing pointers to point to new
addresses.

See golang/go#18138 for more details.

Since this stack growth is guaranteed to happen at least twice per RPC,
reusing goroutines gives us two wins:

  1. The stack is already grown to 8 KiB after the first RPC, so
     subsequent RPCs do not call runtime.morestack.
  2. We eliminate the need to spawn a new goroutine for each request
     (even though they're relatively inexpensive).

Performance improves across the board. The improvement is especially
visible in small, unary requests as the overhead of stack reallocation
is higher, percentage-wise. QPS is up anywhere between 3% and 5%
depending on the number of concurrent RPC requests in flight. Latency is
down ~3%. There is even a 1% decrease in memory footprint in some cases,
though that is an unintended, but happy coincidence.

unary-networkMode_none-bufConn_false-keepalive_false-benchTime_1m0s-trace_false-latency_0s-kbps_0-MTU_0-maxConcurrentCalls_8-reqSize_1B-respSize_1B-compressor_off-channelz_false-preloader_false
               Title       Before        After Percentage
            TotalOps      2613512      2701705     3.37%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op      8657.00      8654.17    -0.03%
           Allocs/op       173.37       173.28     0.00%
             ReqT/op    348468.27    360227.33     3.37%
            RespT/op    348468.27    360227.33     3.37%
            50th-Lat    174.601µs    167.378µs    -4.14%
            90th-Lat    233.132µs    229.087µs    -1.74%
            99th-Lat     438.98µs    441.857µs     0.66%
             Avg-Lat    183.263µs     177.26µs    -3.28%
@marcogrecopriolo
Copy link

@marcogrecopriolo marcogrecopriolo commented Mar 27, 2020

Just a heads up that this is an issue for Couchbase's N1QL as well.
Personally, a directive would work well for N1QL (we only have one entry point where we need to run with a different stack size), but I understand the reluctance to use explicit controls.
OTOH - having to build a whole infrastructure of goroutine workers seems a high price to pay for one very small ideological sin (one directive).
Maybe we could have a little map of frequently used entry points holding each point's average stack size on exit?

Loading

@uluyol
Copy link
Contributor

@uluyol uluyol commented Mar 27, 2020

Not sure if this issue was noted anywhere. Here is a sample program that has enough static information to avoid any calls to morestack but in fact observes multiple stack growths:

package main

var shouldSet = false
var c = make(chan bool)
var x [16384]byte

func main() {
	go f32()
	<-c
	println("exit")
}

//go:noinline
func f32() {
	var buf [32]byte
	if shouldSet {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
		copy(x[:], buf[:])
	}
	f64()
}

[...]


//go:noinline
func f16384() {
	var buf [16384]byte
	if shouldSet {
		// Make sure the compiler doesn't optimize away buf.
		for i := range buf {
			buf[i] = byte(i)
		}
		copy(x[:], buf[:])
	}
	println("done")
	c <- true
}

Each function fX allocates a variable of X bytes on the stack and unconditionally calls another function f(2*X) up to 16KB. When go f32() is called, the compiler should have enough information to allocate a large-enough stack frame up front (it knows that f32 needs to allocate stack space for f64, f128, and so on). What we see instead are multiple calls to newstack:

runtime: newstack: 2048 -> 4096
runtime.(*mcentral).grow(0x10f6b98, 0x0)
	/Users/m/dev/go/src/runtime/mcentral.go:264 +0x13d
runtime.(*mcentral).cacheSpan(0x10f6b98, 0x1402200)
	/Users/m/dev/go/src/runtime/mcentral.go:106 +0x2bc
runtime.(*mcache).refill(0x1121108, 0xe)
	/Users/m/dev/go/src/runtime/mcache.go:138 +0x84
runtime.(*mcache).nextFree(0x1121108, 0xe, 0x1402300, 0x15fffff, 0xc00002e5c0)
	/Users/m/dev/go/src/runtime/malloc.go:867 +0x87
runtime.mallocgc(0x60, 0x10700e0, 0xc00002e601, 0x49)
	/Users/m/dev/go/src/runtime/malloc.go:1047 +0x792
runtime.newobject(0x10700e0, 0x100a024)
	/Users/m/dev/go/src/runtime/malloc.go:1176 +0x38
runtime.acquireSudog(0xc000044048)
	/Users/m/dev/go/src/runtime/proc.go:344 +0x281
runtime.chanrecv(0xc000044000, 0x0, 0xc000000101, 0x101204e)
	/Users/m/dev/go/src/runtime/chan.go:551 +0x223
runtime.chanrecv1(0xc000044000, 0x0)
	/Users/m/dev/go/src/runtime/chan.go:433 +0x2b
runtime.gcenable()
	/Users/m/dev/go/src/runtime/mgc.go:216 +0x95
runtime.main()
	/Users/m/dev/go/src/runtime/proc.go:167 +0x115
runtime.goexit()
	/Users/m/dev/go/src/runtime/asm_amd64.s:1374 +0x1
runtime: newstack: 2048 -> 4096
main.f512()
	/Users/m/dev/go/nomorestack/main.go:75 +0x86
main.f256()
	/Users/m/dev/go/nomorestack/main.go:62 +0x86
main.f128()
	/Users/m/dev/go/nomorestack/main.go:49 +0x83
main.f64()
	/Users/m/dev/go/nomorestack/main.go:36 +0x80
main.f32()
	/Users/m/dev/go/nomorestack/main.go:23 +0x5e
created by main.main
	/Users/m/dev/go/nomorestack/main.go:8 +0x35
runtime: newstack: 4096 -> 8192
main.f1024()
	/Users/m/dev/go/nomorestack/main.go:88 +0x91
main.f512()
	/Users/m/dev/go/nomorestack/main.go:75 +0x86
main.f256()
	/Users/m/dev/go/nomorestack/main.go:62 +0x86
main.f128()
	/Users/m/dev/go/nomorestack/main.go:49 +0x83
main.f64()
	/Users/m/dev/go/nomorestack/main.go:36 +0x80
main.f32()
	/Users/m/dev/go/nomorestack/main.go:23 +0x5e
created by main.main
	/Users/m/dev/go/nomorestack/main.go:8 +0x35
runtime: newstack: 8192 -> 16384
main.f2048()
	/Users/m/dev/go/nomorestack/main.go:101 +0x81
main.f1024()
	/Users/m/dev/go/nomorestack/main.go:88 +0x91
main.f512()
	/Users/m/dev/go/nomorestack/main.go:75 +0x86
main.f256()
	/Users/m/dev/go/nomorestack/main.go:62 +0x86
main.f128()
	/Users/m/dev/go/nomorestack/main.go:49 +0x83
main.f64()
	/Users/m/dev/go/nomorestack/main.go:36 +0x80
main.f32()
	/Users/m/dev/go/nomorestack/main.go:23 +0x5e
created by main.main
	/Users/m/dev/go/nomorestack/main.go:8 +0x35
runtime: newstack: 16384 -> 32768
main.f4096()
	/Users/m/dev/go/nomorestack/main.go:114 +0x97
main.f2048()
	/Users/m/dev/go/nomorestack/main.go:101 +0x81
main.f1024()
	/Users/m/dev/go/nomorestack/main.go:88 +0x91
main.f512()
	/Users/m/dev/go/nomorestack/main.go:75 +0x86
main.f256()
	/Users/m/dev/go/nomorestack/main.go:62 +0x86
main.f128()
	/Users/m/dev/go/nomorestack/main.go:49 +0x83
main.f64()
	/Users/m/dev/go/nomorestack/main.go:36 +0x80
main.f32()
	/Users/m/dev/go/nomorestack/main.go:23 +0x5e
created by main.main
	/Users/m/dev/go/nomorestack/main.go:8 +0x35
runtime: newstack: 32768 -> 65536
main.f8192()
	/Users/m/dev/go/nomorestack/main.go:127 +0x97
main.f4096()
	/Users/m/dev/go/nomorestack/main.go:114 +0x97
main.f2048()
	/Users/m/dev/go/nomorestack/main.go:101 +0x81
main.f1024()
	/Users/m/dev/go/nomorestack/main.go:88 +0x91
main.f512()
	/Users/m/dev/go/nomorestack/main.go:75 +0x86
main.f256()
	/Users/m/dev/go/nomorestack/main.go:62 +0x86
main.f128()
	/Users/m/dev/go/nomorestack/main.go:49 +0x83
main.f64()
	/Users/m/dev/go/nomorestack/main.go:36 +0x80
main.f32()
	/Users/m/dev/go/nomorestack/main.go:23 +0x5e
created by main.main
	/Users/m/dev/go/nomorestack/main.go:8 +0x35
done
exit

Unfortunately, I don't have any idea of how much fixing this would help in practice.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 27, 2020

Change https://golang.org/cl/225800 mentions this issue: runtime: grow stack more than 2x if the new frame is large

Loading

@marcogrecopriolo
Copy link

@marcogrecopriolo marcogrecopriolo commented May 6, 2020

Is there any way https://golang.org/cl/225800 could be merged in 1.14.x?

In between pooling goroutines and growing the stack with a dummy function call, the second is the better option (too much contention operating the goroutine queues), however guessing the stack is a bit of a dark art: too small and you achieve nothing, too big and you get a huge bottleneck in stackcacherefill().

If we could allocate a nice round number in one go without having to worry about if it is slightly too big, it would make our life easier: at least we don't have to experiment to find the right stack size.

Also - if you want numbers from a real life high throughput service: untreated, this is issue is eating up 12% of our overall CPU time.
Hacking the stack grows our throughput by 6% and still needlessly uses 6% CPU time.
Pooling goroutines on a partitioned queue (as many fragments as cores) removes the newstack() CPU load, but the contention in managing the queues is such that we improve throughput by 1% or thereabout.

Starting the goroutine with a larger stack size would be best.

Loading

@aclements
Copy link
Member

@aclements aclements commented May 6, 2020

@marcogrecopriolo, I'm not sure I quite understand:

Hacking the stack grows our throughput by 6% and still needlessly uses 6% CPU time.

Do you mean that CL 225800 increases your throughput by 6%, or something else?

Also - if you want numbers from a real life high throughput service: untreated, this is issue is eating up 12% of our overall CPU time.

That is interesting. Can you tell from profile stacks that this is definitely happening right after new goroutines start?

Starting the goroutine with a larger stack size would be best.

This is a very tricky trade-off, since many systems that create large numbers of goroutines benefit from the small starting stacks. There's no simple answer here, I'm afraid. :( CL 225800 has basically no downside, so I would love to know if it actually does help.

Loading

@marcogrecopriolo
Copy link

@marcogrecopriolo marcogrecopriolo commented May 6, 2020

What I mean is that if I use code like this to fire my goroutine:

// MB-38469 / go issue 18138 initial goroutine stack too small
//go:noinline
func primeStack() {
const _STACK_BUF_SIZE = 512 // 128 multiples
var buf [_STACK_BUF_SIZE]int64

// force the compiler to allocate buf
for i := 127; i < _STACK_BUF_SIZE; i += 128 {
	buf[i] = int64(i)
}

_ = stackTop(buf[_STACK_BUF_SIZE-1])

}

//go:noinline
func stackTop(v int64) int64 {
return v
}

func execOp(op Operator, context *Context, parent value.Value) {
primeStack()
op.RunOnce(context, parent)
}

// fork operator
func (this *base) fork(op Operator, context *Context, parent value.Value) {
if op.getBase().inline {
this.switchPhase(_NOTIME)
op.RunOnce(context, parent)
this.switchPhase(_EXECTIME)
} else {
go execOp(op, context, parent)
// go op.RunOnce(context, parent)
}
}

(go op.RunOnce(context, parent) is how I would normally call it), and size STACK_BUF_SIZE roughly speaking right, I get better throughput, and lower the cost of newstack().
In my case 512 is giving me +6% throughput and half the cost in newstack().

If I get STACK_BUF_SIZE wrong, I either get a huge bottleneck in stackcacherefill() and a drop in throughput. In my case STACK_BUF_SIZE set to 1024 gets me a 10% throughput drop and newstack() taking twice as much CPU time, mostly spent in stackcacherefill() piling up in some lock.

I would need to spend more time finding the optimal value for STACK_BUF_SIZE (I know it should be lower than 512), but it takes me more than an hour to run my performance test rig, and even then, somebody might actually increase their stack need down the line and render my testing useless.

If I could have 225800 merged, then I wouldn't need to worry about contention in stackcacherefill(), I'd just size my stack to a reasonable amount and never worry about it ever again.

As a side note, I know we want to avoid compilers directives, but all of the above is just a poor man's way to say

//go:stack=4196
go op.RunOnce(context, parent)

Which would be cleaner.
That would give the option of creating goroutines with larger stacks, where specifically required and when growing stacks is known to be a cost, and systems that create large numbers of goroutines would still benefit from small stacks as usual.

Loading

@marcogrecopriolo
Copy link

@marcogrecopriolo marcogrecopriolo commented May 6, 2020

Sorry - I should clarify, from extensive profiling I know that newstack() is only called in primestack(), so yes it only happens at goroutine starts.

Loading

gopherbot pushed a commit that referenced this issue May 7, 2020
We might as well grow the stack at least as large as we'll need for
the frame that is calling morestack. It doesn't help with the
lots-of-small-frames case, but it may help a bit with the
few-big-frames case.

Update #18138

Change-Id: I1f49c97706a70e20b30433cbec99a7901528ea52
Reviewed-on: https://go-review.googlesource.com/c/go/+/225800
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
@roudkerk
Copy link
Contributor

@roudkerk roudkerk commented Jul 27, 2021

Rather than a compiler directive, would it be possible to add functions to the runtime package

GoWithStackHint(f func(), stackSize int)
StackSize() int

The first could be used to start goroutines with statically determined stack size, for example

runtime.GoWithStackHint(func() { op.RunOnce(context, parent) }, 4096)

But it would also be possible to write a helper that dynamically chooses stack size based on the history of final stack sizes for goroutines created with that helper instance. For example

var runner = smartstacksize.New()

...

  runner.Go(func() { op.RunOnce(context, parent) })

Loading

CAFxX added a commit to CAFxX/go that referenced this issue Aug 13, 2021
Goroutines can spend significant time in morestack/newstack if the dynamic
call tree is large, or if a specific call tree makes heavy use of the stack
to allocate significant amount of space for temporary variables.

This CL adds a simple stack size predictor based on the pc of the go statement
that starts each goroutine. This approach is predicated on the assumption that
a specific go statement in the program will mostly result in the resulting
goroutine mostly executing the same dynamic call tree (more precisely, dynamic
call trees with similar stack sizing requirements).

The way it works is by embedding in each P a small prediction table with 32 slots.
When a go statement is executed, the pc of the go statement is hashed with a per-P
seed to pick one of the slots. Each slot is a single byte containing a simple
running estimator. The result of the estimation is _StackMin << n, with n in the
range 0 to 15 (i.e. 2KB to 64MB), and it is used to allocate a stack of the
appropriate size. In newstack, called from morestack when we need to allocate a
new stack, we record the highest stack size (highwater) used by each goroutine
(this highwater is stored in the g struct, but thanks to existing padding this
additional 1-byte field does not cause the g struct to increase in size). When a
goroutine exits, the estimator for the pc of the statement that started that
goroutine is updated using the highwater value recorded by the exiting goroutine.

The current estimation scheme is not precise for multiple reasons.
First, multiple pcs could map to the same slot in the per-P table. This is not a
significant problem if we assume that the conflicting pcs are not executed with
the same frequency as in this case the estimator will still converge, albeit more
slowly, to the correct value. Furthermore, each P uses a different seed when
hashing the pc, and the seeds themselves are periodically reset (currently, as
a result of GC - although this is not the only available option).
Second, the highwater mechanism partially relies on stackguard0; the current
runtime sometimes mangles stackguard0 (e.g. when a goroutine needs to be pre-
empted) and this can lead to the highwater of a goroutine to be lower than it
should have been: this also should not lead to problems, apart from the estimator
taking longer to converge to the true value.

The stack size prediction mechanism is disabled by default, and can be enabled by
setting GOEXPERIMENT=predictstacksize.

This CL is currently in a PoC state. It passes all tests locally, and shows
significant promise in the included benchmark, where enabling stack size
prediction leads to a doubling of the performance for medium-large stack sizes.
It is still missing tests for the new feature pending discussion about the
proposed approach.

DO NOT SUBMIT

Fixes golang#18138

Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
@gopherbot
Copy link

@gopherbot gopherbot commented Aug 13, 2021

Change https://golang.org/cl/341990 mentions this issue: runtime: predict stack sizing

Loading

CAFxX added a commit to CAFxX/go that referenced this issue Aug 13, 2021
Goroutines can spend significant time in morestack/newstack if the dynamic
call tree is large, or if a specific call tree makes heavy use of the stack
to allocate significant amount of space for temporary variables.

This CL adds a simple stack size predictor based on the pc of the go
statement that starts each goroutine. This approach is predicated on the
assumption that a specific go statement in the program will mostly result
in the resulting goroutine executing the same dynamic call tree (or, more
in general, dynamic call trees with similar stack sizing requirements).

The way it works is by embedding in each P a small prediction table with 32
slots. When a go statement is executed, the pc of the go statement is
hashed with a per-P seed to pick one of the slots.
Each slot is a single byte containing a simple running estimator.
The result of the estimation is _StackMin << n, with n in the range 0 to 15
(i.e. 2KB to 64MB), and it is used to allocate a stack of the appropriate
size.
In newstack, called from morestack when we need to increase the size of the
stack of the running G, we record the highest stack size (highwater) used
by each goroutine (this highwater is stored in the g struct, but thanks to
existing padding in the g struct this additional 1-byte field does not
cause the struct to increase in size).
When a goroutine exits, the estimator for the pc of the statement that
started that goroutine is updated using the highwater value recorded by the
exiting goroutine.

The current estimation scheme is not precise for multiple reasons.

First, multiple pcs could map to the same slot in the per-P table. This is
not a significant problem under the assumption that the conflicting pcs are
not executed with the same frequency: in this case the estimator will still
converge, albeit more slowly, to the correct value.
Furthermore, each P uses a different seed when hashing the pc, and the
seeds themselves are periodically reset (currently, as a result of GC -
although this is not the only available option).

Second, the highwater mechanism partially relies on stackguard0; the
current runtime sometimes mangles stackguard0 (e.g. when a goroutine needs
to be preempted) and this can lead to the highwater of a goroutine to be
lower than its actual value: this also should not lead to problems, apart
from the estimator taking longer to converge to the true value.

The stack size prediction mechanism is currently disabled by default, and
can be enabled by setting GOEXPERIMENT=predictstacksize. The idea would be
to eventually enable it by default but keeping for a finite period of time
the ability to turn it off for debugging.

This CL is currently in a PoC state. It passes all tests locally, and shows
significant promise in the included benchmark, where enabling stack size
prediction leads to a doubling of the performance for medium-large stack
sizes.
It is still missing tests for the new feature pending discussion about the
proposed approach.

DO NOT SUBMIT

Fixes golang#18138

Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
CAFxX added a commit to CAFxX/go that referenced this issue Aug 15, 2021
Goroutines can spend significant time in morestack/newstack if the
dynamic call tree is large, or if a specific call tree makes heavy use of
the stack to allocate significant amount of space for temporary
variables.

This CL adds a simple stack size predictor based on the pc of the go
statement that starts each goroutine. This approach is predicated on the
assumption that a specific go statement in the program will mostly result
in the resulting goroutine mostly executing the same dynamic call tree
(more precisely, dynamic call trees with similar stack sizing
requirements).

The way it works is by embedding in each P a small prediction table with
32 slots. When a go statement is executed, the pc of the go statement is
hashed with a per-P seed to pick one of the slots. Each slot is a single
byte containing a simple running estimator. The result of the estimation
is _StackMin << n, with n in the range 0 to 15 (i.e. 2KB to 64MB), and it
is used to allocate a stack of the appropriate size. In newstack, called
from morestack when we need to allocate a new stack, we record the
highest stack size (highwater) used by each goroutine (this highwater is
stored in the g struct, but thanks to existing padding this additional
1-byte field does not cause the g struct to increase in size). When a
goroutine exits, the estimator for the pc of the statement that started
that goroutine is updated using the highwater value recorded by the
exiting goroutine.

The current estimation scheme is not precise for multiple reasons.

First, multiple PCs could map to the same slot in the per-P table. This
is not a significant problem if we assume that the conflicting pcs are
not executed with the same frequency as in this case the estimator will
still converge, albeit more slowly, to the correct value. Furthermore,
each P uses a different seed when hashing the pc, and the seeds
themselves are periodically reset (currently, as a result of GC -
although this is not the only available option).

Second, the highwater mechanism partially relies on stackguard0; the
current runtime sometimes mangles stackguard0 (e.g. when a goroutine
needs to be preempted) and this can lead to the highwater of a goroutine
to be lower than it should have been: this also should not lead to
problems, apart from the estimator taking longer to converge to the true
value.

The stack size prediction mechanism is disabled by default, and can be
enabled by setting GOEXPERIMENT=predictstacksize. The plan is to
eventually enable it default, and later remove the experiment
alltogether.

This CL is currently in a PoC state. It passes all tests locally, and
shows significant promise in the included benchmark, where enabling stack
size prediction leads to a doubling of the performance for medium-large
stack sizes. It is still missing tests for the new feature pending
discussion about the proposed approach.

DO NOT SUBMIT

Fixes golang#18138

Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
CAFxX added a commit to CAFxX/go that referenced this issue Aug 15, 2021
Goroutines can spend significant time in morestack/newstack if the
dynamic call tree is large, or if a specific call tree makes heavy use
of the stack to allocate significant amount of space for temporary
variables.

This CL adds a simple stack size predictor based on the pc of the go
statement that starts each goroutine. This approach is predicated on the
assumption that a specific go statement in the program will mostly
result in the resulting goroutine mostly executing the same dynamic call
tree (more precisely, dynamic call trees with similar stack sizing
requirements).

The way it works is by embedding in each P a small prediction table with
32 slots. When a go statement is executed, the pc of the go statement is
hashed with a per-P seed to pick one of the slots. Each slot is a single
byte containing a simple running estimator. The result of the estimation
is _StackMin << n, with n in the range 0 to 15 (i.e. 2KB to 64MB), and
it is used to allocate a stack of the appropriate size. In newstack,
called from morestack when we need to allocate a new stack, we record
the highest stack size used by each goroutine (the highwater mark is
stored in the g struct, but thanks to existing padding this additional
1-byte field does not cause the g struct to increase in size). When a
goroutine exits, the estimator for the pc of the statement that started
that goroutine is updated using the highwater value recorded by the
exiting goroutine.

The current estimation scheme is not precise for multiple reasons.

First, multiple PCs could map to the same slot in the per-P table. This
is not a significant problem if we assume that the conflicting pcs are
not executed with the same frequency as in this case the estimator will
still converge, albeit more slowly, to the correct value. Furthermore,
each P uses a different seed when hashing the pc, and the seeds
themselves are periodically reset (currently, as a result of GC -
although this is not the only available option).

Second, the highwater mechanism partially relies on stackguard0; the
current runtime sometimes mangles stackguard0 (e.g. when a goroutine
needs to be preempted) and this can lead to the highwater of a goroutine
to be lower than it should have been: this also should not lead to
problems, apart from the estimator taking longer to converge to the true
value.

The stack size prediction mechanism is disabled by default, and can be
enabled by setting GOEXPERIMENT=predictstacksize. The plan is to
eventually enable it default, and later remove the experiment
alltogether.

This CL is currently in a PoC state. It passes all tests locally, and
shows significant promise in the included benchmark, where enabling
stack size prediction leads to a doubling of the performance for
medium-large stack sizes. It is still missing tests for the new feature
pending discussion about the proposed approach.

DO NOT SUBMIT

Fixes golang#18138

Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Aug 18, 2021

Forgot to mention it back here, but I have a CL up for early review that basically implements something similar to what @aclements suggested. It uses a fairly conservative approach, so in this early incarnation it may still leave a bit of performance on the table, but in the tests so far it seems to work well enough in practice to be useful. If anyone wants to run their own benchmarks and report back it would be great (you need to build from that CL and then set GOEXPERIMENT=predictstacksize). The upside is that it requires no knobs, annotations, or code changes.

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Aug 28, 2021

Change https://golang.org/cl/345889 mentions this issue: runtime: measure stack usage; start stacks larger if needed

Loading

@randall77
Copy link
Contributor

@randall77 randall77 commented Aug 28, 2021

I wrote up an idea I had about starting goroutines with a larger than minimum stack size, for this issue. The doc is here.
The immediate impetus for this doc was another attempt to fix that issue in CL 341990, but generally these ideas have been sloshing around my head for a while.
Comments welcome. I have a first stab at an implementation in CL 345889.

Loading

@go101
Copy link

@go101 go101 commented Oct 2, 2021

Improve @uluyol's idea by setting the initial stack size of a new goroutine to any 2n sizes:

func startRoutine() {
	// Use a dummy anonymous function to enlarge stack.
	func(x *interface{}) {
		type _ int // avoid being inlined
		if x != nil {
			*x = [128 << 20]byte{} // initial 256M stack
		}
	}(nil)
	
	// ... do work load
}

[update]: a demo: https://play.golang.org/p/r3t_OXxTvt7

Loading

@go101
Copy link

@go101 go101 commented Oct 2, 2021

Rather than a compiler directive, would it be possible to add functions to the runtime package

It would be great to add a runtime/debug/SetCurrentGoroutineOption function:

type GoroutineOption int
func SetCurrentGoroutineOption(key GoroutineOption, value int) {...}

const (
	StackSizeOfTheNextSpawndGoroutine GoroutineOption = iota
	PriorityCasesInTheNextSelectBlock
	...
)

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet