Background
The following is a fairly frequent pattern that appears in our code and others:
goroutine1:
ch1 <- data (1)
result = <-ch2 (2)
goroutine2:
data = <- ch1 (3)
// do work...
ch2 <- result (4)
The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.
- If goroutine2 is busy or there are no idle Ps, then the behavior is fine. The item will be enqueued in the channel, goroutine2 is marked as runnable if needed, and eventually goroutine1 will yield.
- If goroutine2 is not busy and there are idle Ps, then the behavior is sub-optimal. The operation in (1) will mark goroutine2 as runnable, and wake up some idle P via a relatively expensive system call [1]. Ultimately the wake will likely result in an IPI to wake an idle core, if there are any. The next P will be scheduled and a race to (2) and (3) ensures.
In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.
These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)
Ideal schedule
With an oracle, the ideal schedule after (1) would be:
- If goroutine2 is running or there are no idle Ps, enqueue only (current behavior).
- If goroutine1 will not block or has other goroutines in its runqueue, wake idle Ps (current behavior).
- If goroutine1 will block immediately, and there are no other goroutines in P's local runqueue, do not wake up any other Ps. The goroutine2 will be executed by the current P immediately after goroutine1 blocks.
In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.
Proposal
It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.
In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.
Here's a rough sketch of the idea:
- runqput returns a bool that indicates whether the newly placed G is the only item on the queue. (Alternatively we could just check the runq length below.)
- goready takes an additional parameter "deferwake" which skips the wake operation if true. By default this will be false everywhere, which implements current behavior.
- chansend accepts a similar "deferwake" parameter. This is plumbed through to send, and will be AND'ed with the result of runqput. The deferwake parameter will be passed as true if the compiler detects a blocking receive immediately following the blocking send statement (or possibly in the same block, see below).
- chanrecv also accepts a "deferwake" parameter, which will be set to true only when proceeded by a call to chansend with deferwake also set to true. If this is true AND the current goroutine will not yield as a result of the recv AND the current runqueue length > 0 AND there are idle Ps, at this point we can call wakeup.
Rejected alternatives
I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:
ch1 <~ data
This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.
However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.
Extensions
- This optimization may apply to other kinds of wakes. I consider only channels today.
- The optimization could be extended to cases where a blocking channel receive appears following the blocking send in the same block, not necessary the subsequent statement.
[1] https://github.com/golang/go/blob/master/src/runtime/proc.go#L665
Background
The following is a fairly frequent pattern that appears in our code and others:
goroutine1:
goroutine2:
The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.
In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.
These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)
Ideal schedule
With an oracle, the ideal schedule after (1) would be:
In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.
Proposal
It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.
In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.
Here's a rough sketch of the idea:
Rejected alternatives
I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:
ch1 <~ data
This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.
However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.
Extensions
[1] https://github.com/golang/go/blob/master/src/runtime/proc.go#L665