-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile,runtime: optimize channel filling idiom #51553
Comments
Is this really a common idiom? |
No. Also, why? |
I've seen this in a few pieces of code where the user wants a semaphore of size N. They'll fill a buffered channel of size N, and then acquiring the semaphore is a channel receive operation, and releasing it is a channel send operation. Although that's somewhat unnecessary; one can use an empty buffered channel of size N, and swap the channel operations so that an acquire is a channel send, and a release is a channel receive. It's just perhaps less obvious to use a channel this way, if you conceptually think of the "acquire semaphore" operation as a "get". |
Also cc @bcmills given he's given significant thought to concurrency idioms using channels. |
FWIW I see this pop up most frequently where work created upfront needs to be distributed to a bounded pool of workers. If it makes the discussion easier I can remove all mentions of this being an idiom. |
FWIW, you can modify the idiom somewhat to make it intuitive in the other direction: send-to-acquire is analogous to a “lockout–tagout” mechanism, where sending to the channel adds your lock and receiving from the channel removes it. |
An uncontended lock/unlock is typically inexpensive to begin with. That said, it seems to me that this is a special case of inlining and escape analysis. If we know that the channel has not yet escaped (to any closures or other goroutines), then we can inline the channel-send operation and notice that its lock/unlock operations (a) always take the fast (uncontended) path and (b) always cancel out. Then ordinary optimizations (inlining, constant-folding, dead-code elimination, and so on) should be sufficient to eliminate the lock/unlock operations entirely. |
Yes, compared to a contended one. I think we can agree though that compared to no lock at all it's not exactly inexpensive:
func BenchmarkChanFilling(b *testing.B) {
n := 10
src := make([]int, n)
b.Run("channel", func(b *testing.B) {
for i := 0; i < b.N; i++ {
ch := make(chan int, n)
for i := range src {
ch <- src[i]
}
}
})
b.Run("mutex", func(b *testing.B) {
var m sync.Mutex
for i := 0; i < b.N; i++ {
dst := make([]int, n)
for i := range src {
m.Lock()
dst[i] = src[i]
m.Unlock()
}
}
})
b.Run("plain", func(b *testing.B) {
for i := 0; i < b.N; i++ {
dst := make([]int, n)
for i := range src {
dst[i] = src[i]
}
}
})
}
That would be ideal, as such powerful analysis capabilities and speculative optimizations would apply to a ton of other situations... not sure though of how well something like that would fit with the current compiler. |
Are there examples where filling an new channel is a performance bottleneck? The small extra setup cost will quickly become irrelevant If there are significant channel operations. Edit: Enabling ongoing batch transfers into/out of a channel likely has performance advantages (less so for just setup). |
This is wandering a bit afield, but: We could make
for _, t := range s {
c <- t
}
yield len(s)
(But I also want copy to work on maps, NaNs or no NaNs, so I'm perhaps a bit copy-happy.) |
This (using |
While it would have been better in my view if this was just a compiler transformation that required no changes to the language spec, I'm fine with pivoting to a proposal to extend Just FTR I recently prototyped a limited version of the runtime bits to support bulk sends and it's up to 20x faster in the uncontended case, depending on the number of elements. @josharian I have a question regarding your proposed semantics: why blocking instead of non-blocking? If they are non-blocking, it is possible (albeit not very ergonomic) to turn them into blocking versions... but if they are blocking they can't be easily turned into non-blocking ones. Or are you thinking that |
In the following snippet, the compiler could recognize that the channel can not be accessed concurrently by anything else, and optimize the whole loop to a straight
copy
of the contents of the slice to the channel's send queue1:The main benefit is that this would avoid a lock/unlock for each element, reducing CPU utilization.
A potentially better alternative (that I haven't really thought through in terms of compliance to the spec) is to provide a way (e.g. via
copy
) or otherwise allowing the compiler to send/receive batches of multiple elements (regardless of whether the channel is already in use by multiple goroutines) as long as there is space in the channel and as long as the sending/receiving code does not perform any other synchronization operation. This approach would still amortize the lock/unlock cost across multiple elements and would likely apply to more cases than just the specific one above.I see this idiom pop up most frequently where work created upfront needs to be distributed to a bounded pool of workers, and, more in general, to build iterator patterns (even though performance is often cited as problematic). It is also used in some cases to create semaphores.
Footnotes
going one step further, if the slice itself can be proven to be dead after the loop, the copy of the slice contents and the allocation of the channel send queue could be skipped and the channel could adopt the slice itself as its send queue, turning an O(n) operation into O(1). ↩
The text was updated successfully, but these errors were encountered: