Join GitHub today
runtime: make chan-based generators faster #8903
In some cases parallelism is not needed/desirable -- fine-grained chan-based generators/iterators, or "coroutine-like" approach when you have several independent stacks and manually switch between them. Sync channels substantially limit parallelism, to the point where pipeline-like parallelization does not work. But they do not eliminate it entirely, as after an operation both producer and consumer are runnable. Current implementation handles this "contact bounce" under-parallelization very inefficiently -- goroutines are constantly scheduled and de-scheduled, threads are constantly parked and unparked, work migrate between threads, etc. The idea is to limit parallelism on sync channels even more aggressively. So that fine-grained communication over sync channels can be very efficient. But w/o eliminating concurrency nor affecting semantics. Implementation: 1. Each P has an additional local work queue, let's call it Domain. Work form Domain can't be stolen by other P's. Domain is has maximum size of 4 goroutines. 2. When a goroutine unblocks another goroutine on a sync chan, it adds the other goroutine to own Domain (Domain of the current P) instead of putting it to the normal work queue. This effectively limits parallelism. We can use Domains in other situations, but this is outside of context of this bug. 3. When a goroutine blocks, P schedules one of the goroutines from Domain (if it's not empty). In some circumstances (e.g. a goroutine is preempted due to time slice exhaustion, or blocks in syscall) P can decide to just move all goroutines in the Domain to normal work queue. Then, *before* a goroutine blocks on a sync chan it saves Hchan and elem pointers in G and switches to another goroutine in Domain *once*. Or if it sees that a goroutine in the Domain want to accomplish a pairing operation on the same chan, it executes the operation with that other goroutine. This allows to implement tight communication over sync channels with the cost of just several function calls and w/o several degradation with GOMAXPROCS>1. Communication domains form and break up dynamically. Program semantics are not affected. Here is a prototype implementation: https://golang.org/cl/74180043/ I've benchmarked it on a real application where chan-based generators were abounded due to forementioned reasons: https://groups.google.com/d/msg/golang-dev/0IElw_BbTrk/cGHMdNoHGQEJ The change provides 17.5x speedup (and it's not the limit as runtime.switchto can be significantly more efficient).