New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: new goroutines can spend excessive time in morestack #18138
Comments
/cc @aclements @randall77 |
We've seen this a few times now. I'm not sure what the right answer is. My best thought so far is that the runtime could keep track of when particular go statements always lead to stack growth right away (for some value of "right away" and "always") and learn to start goroutines from that site with a larger stack. Of course, it would be hard to make this behavior predictable, but perhaps it would still be less surprising than the current behavior. If the runtime did learn to start a goroutine with a larger stack, it would still need a signal to learn if the stack should get smaller again, but we could do that efficiently by allocating the larger stack but setting the stack bounds to something smaller. Then the runtime could still observe whether or not the stack needs to grow, but the actual growth would be basically free until it reached the size of the allocation. @randall77, thoughts, ideas? /cc @RLH |
I like @petermattis idea of being able to hint stack size on a per goroutine basis, although this implies developers have the knowhow to identify and provide size estimates accurately. Could this be done with a compiler directive? |
We don't want compiler directives in code. We have some used by the runtime out of necessity, but they're gross. Go prefers simplicity over tons of knobs. |
Yes, please just make my code magically faster as you've been doing for the last several Go releases. |
I generally agree with not having compiler directives ... magic is nice, although they (compiler directives) do exist even in go. It's an interesting opportunity either way you decide. |
@bradfitz, your comment prompted me to look for the go guiding principles ( https://golang.org/doc/faq#principles). Thanks @adg as well for nicely worded principles. |
CL https://golang.org/cl/45142 mentions this issue. |
@petermattis (or anyone who has a good reproducer for this), would you be able to try https://go-review.googlesource.com/45142? It's a trivial hack, but it might actually do the trick. I haven't benchmarked it on anything, so it may also slow things down. |
@aclements I'll try and test either tomorrow or next week. |
@aclements Applying that patch to go1.8.3 resulted in no benefit (this is with the
Surprising to me this didn't have any effect. Compare this to the
|
CL 43150 might help a little here. |
Sorry, I made a silly mistake in CL 45142. Would you mind trying the new version of that CL? |
With your updated patch against go-tip (f363817) there is an improvement:
But the improvement is still not as good as the
There is a little more performance if we increase the initial stack size to 32 KB:
Interestingly, all of these timings are lower than with go1.8.3. |
Nothing to see here. This appears to be due to a change on our code between what I tested earlier today and now. |
I did some more testing of this patch and the performance improvements carry over to production settings. |
It is early in the 1.10 cycle and wanted to bring this issue forward again. See cockroachdb/cockroach#17242 for a graph showing the benefit of a larger initial stack size. |
Is there any update on this issue? A larger initial goroutine stack size provides a nice performance boost for our system. |
Change https://golang.org/cl/341990 mentions this issue: |
Goroutines can spend significant time in morestack/newstack if the dynamic call tree is large, or if a specific call tree makes heavy use of the stack to allocate significant amount of space for temporary variables. This CL adds a simple stack size predictor based on the pc of the go statement that starts each goroutine. This approach is predicated on the assumption that a specific go statement in the program will mostly result in the resulting goroutine executing the same dynamic call tree (or, more in general, dynamic call trees with similar stack sizing requirements). The way it works is by embedding in each P a small prediction table with 32 slots. When a go statement is executed, the pc of the go statement is hashed with a per-P seed to pick one of the slots. Each slot is a single byte containing a simple running estimator. The result of the estimation is _StackMin << n, with n in the range 0 to 15 (i.e. 2KB to 64MB), and it is used to allocate a stack of the appropriate size. In newstack, called from morestack when we need to increase the size of the stack of the running G, we record the highest stack size (highwater) used by each goroutine (this highwater is stored in the g struct, but thanks to existing padding in the g struct this additional 1-byte field does not cause the struct to increase in size). When a goroutine exits, the estimator for the pc of the statement that started that goroutine is updated using the highwater value recorded by the exiting goroutine. The current estimation scheme is not precise for multiple reasons. First, multiple pcs could map to the same slot in the per-P table. This is not a significant problem under the assumption that the conflicting pcs are not executed with the same frequency: in this case the estimator will still converge, albeit more slowly, to the correct value. Furthermore, each P uses a different seed when hashing the pc, and the seeds themselves are periodically reset (currently, as a result of GC - although this is not the only available option). Second, the highwater mechanism partially relies on stackguard0; the current runtime sometimes mangles stackguard0 (e.g. when a goroutine needs to be preempted) and this can lead to the highwater of a goroutine to be lower than its actual value: this also should not lead to problems, apart from the estimator taking longer to converge to the true value. The stack size prediction mechanism is currently disabled by default, and can be enabled by setting GOEXPERIMENT=predictstacksize. The idea would be to eventually enable it by default but keeping for a finite period of time the ability to turn it off for debugging. This CL is currently in a PoC state. It passes all tests locally, and shows significant promise in the included benchmark, where enabling stack size prediction leads to a doubling of the performance for medium-large stack sizes. It is still missing tests for the new feature pending discussion about the proposed approach. DO NOT SUBMIT Fixes golang#18138 Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
Goroutines can spend significant time in morestack/newstack if the dynamic call tree is large, or if a specific call tree makes heavy use of the stack to allocate significant amount of space for temporary variables. This CL adds a simple stack size predictor based on the pc of the go statement that starts each goroutine. This approach is predicated on the assumption that a specific go statement in the program will mostly result in the resulting goroutine mostly executing the same dynamic call tree (more precisely, dynamic call trees with similar stack sizing requirements). The way it works is by embedding in each P a small prediction table with 32 slots. When a go statement is executed, the pc of the go statement is hashed with a per-P seed to pick one of the slots. Each slot is a single byte containing a simple running estimator. The result of the estimation is _StackMin << n, with n in the range 0 to 15 (i.e. 2KB to 64MB), and it is used to allocate a stack of the appropriate size. In newstack, called from morestack when we need to allocate a new stack, we record the highest stack size (highwater) used by each goroutine (this highwater is stored in the g struct, but thanks to existing padding this additional 1-byte field does not cause the g struct to increase in size). When a goroutine exits, the estimator for the pc of the statement that started that goroutine is updated using the highwater value recorded by the exiting goroutine. The current estimation scheme is not precise for multiple reasons. First, multiple PCs could map to the same slot in the per-P table. This is not a significant problem if we assume that the conflicting pcs are not executed with the same frequency as in this case the estimator will still converge, albeit more slowly, to the correct value. Furthermore, each P uses a different seed when hashing the pc, and the seeds themselves are periodically reset (currently, as a result of GC - although this is not the only available option). Second, the highwater mechanism partially relies on stackguard0; the current runtime sometimes mangles stackguard0 (e.g. when a goroutine needs to be preempted) and this can lead to the highwater of a goroutine to be lower than it should have been: this also should not lead to problems, apart from the estimator taking longer to converge to the true value. The stack size prediction mechanism is disabled by default, and can be enabled by setting GOEXPERIMENT=predictstacksize. The plan is to eventually enable it default, and later remove the experiment alltogether. This CL is currently in a PoC state. It passes all tests locally, and shows significant promise in the included benchmark, where enabling stack size prediction leads to a doubling of the performance for medium-large stack sizes. It is still missing tests for the new feature pending discussion about the proposed approach. DO NOT SUBMIT Fixes golang#18138 Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
Goroutines can spend significant time in morestack/newstack if the dynamic call tree is large, or if a specific call tree makes heavy use of the stack to allocate significant amount of space for temporary variables. This CL adds a simple stack size predictor based on the pc of the go statement that starts each goroutine. This approach is predicated on the assumption that a specific go statement in the program will mostly result in the resulting goroutine mostly executing the same dynamic call tree (more precisely, dynamic call trees with similar stack sizing requirements). The way it works is by embedding in each P a small prediction table with 32 slots. When a go statement is executed, the pc of the go statement is hashed with a per-P seed to pick one of the slots. Each slot is a single byte containing a simple running estimator. The result of the estimation is _StackMin << n, with n in the range 0 to 15 (i.e. 2KB to 64MB), and it is used to allocate a stack of the appropriate size. In newstack, called from morestack when we need to allocate a new stack, we record the highest stack size used by each goroutine (the highwater mark is stored in the g struct, but thanks to existing padding this additional 1-byte field does not cause the g struct to increase in size). When a goroutine exits, the estimator for the pc of the statement that started that goroutine is updated using the highwater value recorded by the exiting goroutine. The current estimation scheme is not precise for multiple reasons. First, multiple PCs could map to the same slot in the per-P table. This is not a significant problem if we assume that the conflicting pcs are not executed with the same frequency as in this case the estimator will still converge, albeit more slowly, to the correct value. Furthermore, each P uses a different seed when hashing the pc, and the seeds themselves are periodically reset (currently, as a result of GC - although this is not the only available option). Second, the highwater mechanism partially relies on stackguard0; the current runtime sometimes mangles stackguard0 (e.g. when a goroutine needs to be preempted) and this can lead to the highwater of a goroutine to be lower than it should have been: this also should not lead to problems, apart from the estimator taking longer to converge to the true value. The stack size prediction mechanism is disabled by default, and can be enabled by setting GOEXPERIMENT=predictstacksize. The plan is to eventually enable it default, and later remove the experiment alltogether. This CL is currently in a PoC state. It passes all tests locally, and shows significant promise in the included benchmark, where enabling stack size prediction leads to a doubling of the performance for medium-large stack sizes. It is still missing tests for the new feature pending discussion about the proposed approach. DO NOT SUBMIT Fixes golang#18138 Change-Id: Id2f617e39bbd7ed969d35e1f231ab61c207fa572
Forgot to mention it back here, but I have a CL up for early review that basically implements something similar to what @aclements suggested. It uses a fairly conservative approach, so in this early incarnation it may still leave a bit of performance on the table, but in the tests so far it seems to work well enough in practice to be useful. If anyone wants to run their own benchmarks and report back it would be great (you need to build from that CL and then set |
Change https://golang.org/cl/345889 mentions this issue: |
I wrote up an idea I had about starting goroutines with a larger than minimum stack size, for this issue. The doc is here. |
Improve @uluyol's idea by setting the initial stack size of a new goroutine to any 2n sizes: func startRoutine() {
// Use a dummy anonymous function to enlarge stack.
func(x *interface{}) {
type _ int // avoid being inlined
if x != nil {
*x = [128 << 20]byte{} // initial 256M stack
}
}(nil)
// ... do work load
} [update]: a demo: https://play.golang.org/p/r3t_OXxTvt7 |
It would be great to add a type GoroutineOption int
func SetCurrentGoroutineOption(key GoroutineOption, value int) {...}
const (
StackSizeOfTheNextSpawndGoroutine GoroutineOption = iota
PriorityCasesInTheNextSelectBlock
...
) |
It would be good if Go would support tail call elimination. Then at least in some cases, the stack wouldn't grow as much as it does now. |
A straightforward implementation of tail call elimination would lead to difficulties with See also #22624. |
"Proper" tail recursion requires the full removal of the caller's stack frame when a function is call in tail position. If that is too extreme, maybe it's possible to only remove the part of the stack frame that is not needed for |
See #36067 |
It looks a new implementation will be adopted in Go 1.19. |
Yes. Of course, it matters what exactly you mean by "efficient". The new behavior uses a bit more space for a bit less CPU. Starting new goroutines with the average stack size will waste at most a factor of 2 in space (in addition to the at-most factor of 2 we already waste by growing stacks using power-of-two sizes). In exchange, we get less work on goroutine startup. It is conceivable that the tradeoff is not advantageous for some programs. I suspect it will be a net win for most. I'm curious if anyone has an example where it is not the right thing to do. |
I'm a little worrying about this change will bring more unpredictable/random factors for Go program performances. |
Is this still a problem after go 1.19?
|
I have not heard any complaints. |
FWIW historic average stack usage makes absolutely no difference to the Couchbase's N1QL engine. For us, if a compiler directive is not acceptable, at least having some compiler analysis that determines the initial stack size for a goroutine (you have sample code that you can use for testing from myself and other contributors) would work much better than the status quo. |
FWIW I believe average stack sizes not to be a great approach, and stack sizes based on initial PC would have been much better. In a classical server example, where each request is serviced by one or more goroutines, that stack size is not a random occurrence, but is a function of the work done by the individual goroutine. In our case, all execution operators use a lot of stack, because they do the bulk of the work, everything else does not. To put this in context, on my performance testing cluster, we serve about 200K requests a second, and spawn 5 execution operators, which end up having to grow the stack about 1M times a second. Also, to put this in context, Informix's Online Dynamic Server, vintage 1996, used exactly the same multithreaded model adopted by golang (we had to code the context switches by hand though), and exactly the same execution approach as Couchbase's N1QL. I have tried goroutine pooling, but it is quite the kerfufflle, and doesn't work as well as manually forcing an initial stack growth. |
I agree that average stack size isn't likely to work well for many
situations.
For the RPC servers I have looked at, for each request the RPC framework
spawns one or two trivial background goroutines which will drag down the
average size. So the main goroutine for each request still needs to resize
its stack.
I wonder whether PGO could be taught to smartly select initial stack sizes:
https://go.dev/doc/pgo. But adapting at runtime would certainly be more
convenient.
… Message ID: ***@***.***>
|
Just chiming in that I'm still working around this issue in my performance-sensitive applications manually for the same reasons mentioned above, an average over the entire application is not reflective of the needs of my critical path. |
What version of Go are you using (
go version
)?go version devel +41908a5 Thu Dec 1 02:54:21 2016 +0000 darwin/amd64
a.k.ago1.8beta1
What operating system and processor architecture are you using (
go env
)?What did you do?
A recent change to github.com/cockroachdb/cockroach replaced a synchronous call with one wrapped in a goroutine. This small change resulted in a significant slowdown in some benchmarks. The slowdown was traced to additional time being spent in
runtime.morestack
. The problematic goroutines are all hitting a single gRPC entrypointServer.Batch
and the code paths that fan out from this entrypoint tend to use an excessive amount of stack due to an over reliance on passing and returning by value instead of using pointers. Typical calls use 16-32 KB of stack.The expensive part of
runtime.morestack
is the adjustment of existing pointers on the stack. And due to the incremental nature of the stack growth, I can see the stack growing in 4 steps from 2 KB to 32 KB. So we experimented with a hack to pre-grow the stack. Voila, the performance penalty of the change disappeared:old
are benchmarks gathered using go1.8beta1 andnew
are on go1.8beta1 with the hack to pre-grow the stack. The hack is a call at the beginning ofserver.Batch
to agrowStack
method:The question here is whether this is copacetic and also to alert the runtime folks that there is a performance opportunity here. Note that the
growStackGlobal
is not currently necessary, but I wanted to future proof against the compiler deciding thatbuf
is not necessary.Longer term, the stack usage under
server.Batch
should be reduced on our side. I'm guessing that we could get the stack usage down to 8-16 KB without too many contortions. But even with such reductions, being able to pre-grow the stack for a goroutine looks beneficial.The text was updated successfully, but these errors were encountered: