runtime: optimization to reduce P churn #32113
The following is a fairly frequent pattern that appears in our code and others:
The scheduler exhibits two different behaviors, depending on whether goroutine2 is busy and there are available Ps.
In the second case, if the P wakes and successfully steals the now runnable goroutine2, i.e. (3) happens first, then it will start executing on the new P. Unfortunately, the whole dance will happen again with the result. If the P wakes but does not successfully steal the now runnable goroutine2, i.e. (4) happens first and goroutine2 is run locally, then a large number of cycles are wasted. Either way, this dance happens again with the result. In both cases, we spend a large number of cycles and interprocessor co-ordination costs for what should be a goroutine context switch.
These are further problems caused by this, as it will introduce unnecessary work stealing and bouncing of goroutines between system threads and cores. (Leading to locality inefficiencies.)
With an oracle, the ideal schedule after (1) would be:
In essence, we want to yield the goroutine1's time to goroutine2 in this case, or at least avoid all the wasted signaling overhead. To put it another way: if goroutine1's P will block, then it fills the role of the "idle P" far more efficiently.
It may be possible to specifically optimize for this case in the compiler, just as certain loop patterns are optimized.
In the case where a blocking channel send is immediately followed by a blocking channel receive, I propose an optimization that tries to avoid these scheduler round trips.
Here's a rough sketch of the idea:
I thought about this problem a few years ago when it caused issues. In the past, I considered the possibility of a different channel operator. Something like:
ch1 <~ data
This operator would write to the channel and immediately yield to the other goroutine, if it was not already running (otherwise would fall back to the existing channel behavior). Using this operator in the above situation would make it much more efficient in general.
However, this is a language change, and confusing to users. When do you use which operator? It would be good to have the effect of this optimization out of the box.
The text was updated successfully, but these errors were encountered:
What about just enforcing a minimum delay between when a G is created and when it can be stolen? That gives the local P time to finish the spawning G (finish = either done or block) and pick up the new G itself.
The delay would be on the order of the overhead to move a G between processors (sys calls, cache warmup, etc.)
The tricky part is to not even wake the remote P when the goroutine is queued. We want a timer somehow that can be cancelled if the G is started locally.
Yes, most of the waste is generated by the wakeup call itself. Ensuring that the other P does not steal the G is probably a minor improvement, but you're still going to waste a ton of cycles (maybe even doing these wake ups twice -- on (1) and (4)).
I think using a timer gets much trickier. This is the reason I have limited the proposal to compiler-identified sequences of "chansend(block=true); chanrecv(block=true)" calls. It's possible that the system thread could be pre-empted between those calls, but if the system is busy (though Ps in this process may still be idle) it's probably even more valuable to not waste useless cycles.
This has come up repeatedly. Obviously it is easy to recognize and fuse
It's harder to see that in more complex code that would benefit from the optimization, though. We've fiddled with heuristics in the runtime to try to wait a little bit before stealing a G from a P, and so on. Probably more tuning is needed.
It's unclear this needs to be a proposal, unless you are proposing a language change, and it sounds like you've backed away from that.
The way forward with a suggestion like this is to try implementing it and see how much of an improvement (and how general of an improvement) it yields.
I backed away from a language change proposal based on the assumption that it would likely not be accepted. My personal preference would be to have an operation like <~ that immediately switches to the other goroutine if currently waiting. (And behaves like a normal channel operation if busy.) But I realize that the existence of this operator might be confusing.
I think it's unclear how much of a impact this would have in general. This is probably just be a tiny optimization that doesn't matter in the general case, but can help in a few very specific ones. For us, it might let us structure some goroutine interactions much more efficiently.
I hacked something together, and it seems like there's a decent effect on microbenchmarks at least (unless I screwed something up).
The system time is telling @ 20x, and the extra 14% in CPU usage is indicative of an additional P waking up with nothing to do. (Or maybe it occasionally successfully steals the goroutine, which is also bad.)
Assuming this small optimization is readily acceptable -- what's the best way to group those operations and transform the channel calls? The runtime bits are straight-forward, but any up front guidance on the compiler side is appreciated. Otherwise, I'm just planning to call a specialized scan in walkstmt list, but maybe there's a better way.
I've started looking into this. I've got a very naive implementation (probably very similar to Adin's) to use with his microbenchmark.
Fixed time (
Fixed iterations (
I've included both since the the different fixed dimensions change the interpretation. e.g., the first case has higher cycles after because it is simply able to do a lot more work. And it still does nearly double the iterations in 30% less CPU time (== far less time stalled)!
This certainly looks worthwhile from the micro-benchmark perspective. The questions remaining to me are if we can efficiently and reliably detect these scenarios, and if they affect many programs.
Neither @amscanne nor my prototype change any language syntax or APIs. Rather, the compiler detects channel send followed immediately by channel receive and rather than calling the typical
Both prototypes are rudimentary and would probably hurt performance for many programs due to poor decisions and would need more refinement.