New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: use parent goroutine's stack for new goroutines #27345

Open
josharian opened this Issue Aug 29, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@josharian
Contributor

josharian commented Aug 29, 2018

This is a speculative idea from discussion with @aclements. Jotting it down so it doesn't get lost and to open discussion.

Spinning off many goroutines that grow their stack and then exit can be a source of performance pain. See e.g. #18138. One approach to fixing this is tracking and using statistics about stack growth behavior per spawn site.

Another possible approach is to:

  • switch immediately in the scheduler from parent goroutine to spawned goroutine
  • use the (free part of) the parent's stack as the spawn's stack
  • relocate the spawn's stack as needed

The idea here is that the parent goroutine's stack is likely to have plenty of free space, so if the spawn is short-lived, we can avoid doing any stack allocation for it at all, much less multiple stack growths.

If the spawn lives long enough that the parent goroutine starts running again, then this is worse. So it might be worth tracking some basic statistics on the fly to decide whether to do this.

@josharian josharian added this to the Unplanned milestone Aug 29, 2018

@aclements

This comment has been minimized.

Member

aclements commented Sep 6, 2018

One common pattern where this may be problematic is where a parent spawns a goroutine and blocks on receiving the result of that goroutine from an unbuffered channel. In the basic form of this design, this would always force a stack-rip. A few ways I can think to get around this:

  1. Encourage people to use a single-element channel instead. Not so great for existing code.
  2. Statically recognize this "send and exit" pattern and compile it into a combined operation that does them in the opposite order (the difference is unobservable). The value would still have to be buffered somewhere until the parent reached the channel receive. This may be tricky because there's no way to bound the amount of buffering necessary.
  3. Let the parent keep running until it blocks (like today), and then start running the new goroutine on the parent's stack. We'd need somewhere to stash the goroutine's argument frame. Here we could at least bound the amount of buffering for the argument frames, since we could always fall back to spawning a full goroutine.

For solution 2 or 3, we could always do something like _defer objects, which are manually allocated and freed by the runtime.

Cilk (which inspired this idea) structurally enforces something like solution 2, but the spawning code always has to pre-arrange a place to put the result.

@cherrymui

This comment has been minimized.

Contributor

cherrymui commented Sep 6, 2018

How about allocating the child goroutine's stack in the same allocation but not right next to the parent's stack? That is, suppose the parent has 100K allocated stack space, say (base, base+100K), and is currently using 60K. When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound. When either the parent's stack grows to 90K, or the child's stack grows to 10K, (or the parent fires up more goroutines which uses up the stack), relocate the stack.

This way, both goroutines are runnable (if not otherwise blocked), and the scheduler can choose whatever appropriate.

@CAFxX

This comment has been minimized.

Contributor

CAFxX commented Sep 7, 2018

When it spawns the child, we can put the child's stack at, say, base+90K, and adjust the parent's stack bound.

If and when #25999 (comment) materializes then this could even be done precisely by knowing how much stack space both parent and child actually require.

@RLH

This comment has been minimized.

Contributor

RLH commented Sep 9, 2018

@aclements

This comment has been minimized.

Member

aclements commented Sep 14, 2018

This would also bound the number of goroutines the scheduler would have to
manage as well as a lot of other nice properties discussed in the Cilk
papers. Go should preserve these bounds if we do this.

I think it's worth striving for similar bounds, but Cilk was able to achieve these bounds strictly only because it had a very strict fork/join model of parallelism, where the parent function couldn't even return until all of its child threads had returned/exited. Because of Go's ad hoc parallelism, the best we could do (which would still be quite good) is to say, if your program is structured like X, then it will have some space and time bounds.

Let the parent keep running until it blocks (like today), and then start running the new goroutine on the parent's stack. We'd need somewhere to stash the goroutine's argument frame. Here we could at least bound the amount of buffering for the argument frames, since we could always fall back to spawning a full goroutine.

@pjweinb pointed out to me today that we could switch to the child when the parent blocks or tries to start a second child goroutine. Bounding it to a single pending goroutine makes the bookkeeping much simpler and may go towards ensuring Cilk-like space bounds. We could also combine this with @cherrymui's idea, which would give us a place to put the pending argument frame: when the parent go's the child, we immediately assign most of the parent's remaining stack to the child, put the goroutine's argument frame in its section of the stack, shrink the parent's stack bounds, and let the parent keep running. When the parent blocks, tries to start another child, or encounters its stack bound, we switch to the child (without moving the parent's stack at that time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment