Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile,runtime: optimize channel filling idiom #51553

Open
CAFxX opened this issue Mar 9, 2022 · 12 comments
Open

cmd/compile,runtime: optimize channel filling idiom #51553

CAFxX opened this issue Mar 9, 2022 · 12 comments
Labels
NeedsDecision Performance
Milestone

Comments

@CAFxX
Copy link
Contributor

@CAFxX CAFxX commented Mar 9, 2022

In the following snippet, the compiler could recognize that the channel can not be accessed concurrently by anything else, and optimize the whole loop to a straight copy of the contents of the slice to the channel's send queue1:

ch := make(chan T, len(sliceT))
for _, e := range sliceT {
  ch <- e
}

The main benefit is that this would avoid a lock/unlock for each element, reducing CPU utilization.

A potentially better alternative (that I haven't really thought through in terms of compliance to the spec) is to provide a way (e.g. via copy) or otherwise allowing the compiler to send/receive batches of multiple elements (regardless of whether the channel is already in use by multiple goroutines) as long as there is space in the channel and as long as the sending/receiving code does not perform any other synchronization operation. This approach would still amortize the lock/unlock cost across multiple elements and would likely apply to more cases than just the specific one above.

I see this idiom pop up most frequently where work created upfront needs to be distributed to a bounded pool of workers, and, more in general, to build iterator patterns (even though performance is often cited as problematic). It is also used in some cases to create semaphores.

Footnotes

  1. going one step further, if the slice itself can be proven to be dead after the loop, the copy of the slice contents and the allocation of the channel send queue could be skipped and the channel could adopt the slice itself as its send queue, turning an O(n) operation into O(1).

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 9, 2022

Is this really a common idiom?

@ALTree ALTree added Performance NeedsDecision labels Mar 9, 2022
@ALTree ALTree added this to the Unplanned milestone Mar 9, 2022
@robpike
Copy link
Contributor

@robpike robpike commented Mar 9, 2022

No. Also, why?

@mvdan
Copy link
Member

@mvdan mvdan commented Mar 9, 2022

I've seen this in a few pieces of code where the user wants a semaphore of size N. They'll fill a buffered channel of size N, and then acquiring the semaphore is a channel receive operation, and releasing it is a channel send operation.

Although that's somewhat unnecessary; one can use an empty buffered channel of size N, and swap the channel operations so that an acquire is a channel send, and a release is a channel receive. It's just perhaps less obvious to use a channel this way, if you conceptually think of the "acquire semaphore" operation as a "get".

@mvdan
Copy link
Member

@mvdan mvdan commented Mar 9, 2022

Also cc @bcmills given he's given significant thought to concurrency idioms using channels.

@CAFxX
Copy link
Contributor Author

@CAFxX CAFxX commented Mar 9, 2022

FWIW I see this pop up most frequently where work created upfront needs to be distributed to a bounded pool of workers.

If it makes the discussion easier I can remove all mentions of this being an idiom.

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 9, 2022

@mvdan

It's just perhaps less obvious to use a channel this way, if you conceptually think of the "acquire semaphore" operation as a "get".

FWIW, you can modify the idiom somewhat to make it intuitive in the other direction: send-to-acquire is analogous to a “lockout–tagout” mechanism, where sending to the channel adds your lock and receiving from the channel removes it.

@bcmills
Copy link
Member

@bcmills bcmills commented Mar 9, 2022

@CAFxX

The main benefit is that this would avoid a lock/unlock for each element.

An uncontended lock/unlock is typically inexpensive to begin with.

That said, it seems to me that this is a special case of inlining and escape analysis. If we know that the channel has not yet escaped (to any closures or other goroutines), then we can inline the channel-send operation and notice that its lock/unlock operations (a) always take the fast (uncontended) path and (b) always cancel out. Then ordinary optimizations (inlining, constant-folding, dead-code elimination, and so on) should be sufficient to eliminate the lock/unlock operations entirely.

@CAFxX
Copy link
Contributor Author

@CAFxX CAFxX commented Mar 10, 2022

An uncontended lock/unlock is typically inexpensive to begin with.

Yes, compared to a contended one. I think we can agree though that compared to no lock at all it's not exactly inexpensive:

name                    time/op
ChanFilling/channel-16   272ns ± 3%
ChanFilling/mutex-16     161ns ± 3%
ChanFilling/plain-16    30.1ns ± 3%
func BenchmarkChanFilling(b *testing.B) {
	n := 10
	src := make([]int, n)
	b.Run("channel", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			ch := make(chan int, n)
			for i := range src {
				ch <- src[i]
			}
		}
	})
	b.Run("mutex", func(b *testing.B) {
		var m sync.Mutex
		for i := 0; i < b.N; i++ {
			dst := make([]int, n)
			for i := range src {
				m.Lock()
				dst[i] = src[i]
				m.Unlock()
			}
		}
	})
	b.Run("plain", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			dst := make([]int, n)
			for i := range src {
				dst[i] = src[i]
			}
		}
	})
}

That said, it seems to me that this is a special case of inlining and escape analysis. If we know that the channel has not yet escaped (to any closures or other goroutines), then we can inline the channel-send operation and notice that its lock/unlock operations (a) always take the fast (uncontended) path and (b) always cancel out. Then ordinary optimizations (inlining, constant-folding, dead-code elimination, and so on) should be sufficient to eliminate the lock/unlock operations entirely.

That would be ideal, as such powerful analysis capabilities and speculative optimizations would apply to a ton of other situations... not sure though of how well something like that would fit with the current compiler.

@mpx
Copy link
Contributor

@mpx mpx commented Mar 10, 2022

Are there examples where filling an new channel is a performance bottleneck?

The small extra setup cost will quickly become irrelevant If there are significant channel operations.
If new channels are repeatedly created and discarded, then allocations will probably have a larger effect on performance (assuming channels escape since they need to be passed to other goroutines).

Edit: Enabling ongoing batch transfers into/out of a channel likely has performance advantages (less so for just setup).

@josharian
Copy link
Contributor

@josharian josharian commented Mar 22, 2022

This is wandering a bit afield, but: We could make copy work on slices/channels.

n := copy(chan T, []T) would block until all elements from the slice have been sent on the channel; n would contain the length of the size. That is, it would be equivalent to:

for _, t := range s {
  c <- t
}
yield len(s)

n := copy([]T, chan T) would block until as many elements of the channel as possible have been received and copied into the slice; n would contain an element count. That is, it would be equivalent to:

go
n := 0
for {
  t, ok := <-c
  if !ok {
    break
  }
  s[n] = t
  n++
}
yield n

(But I also want copy to work on maps, NaNs or no NaNs, so I'm perhaps a bit copy-happy.)

@robpike
Copy link
Contributor

@robpike robpike commented Mar 23, 2022

This (using copy) has come up before. The performance variability with data type is concerning, but maybe it's OK.

@CAFxX
Copy link
Contributor Author

@CAFxX CAFxX commented Apr 17, 2022

While it would have been better in my view if this was just a compiler transformation that required no changes to the language spec, I'm fine with pivoting to a proposal to extend copy.

Just FTR I recently prototyped a limited version of the runtime bits to support bulk sends and it's up to 20x faster in the uncontended case, depending on the number of elements.

@josharian I have a question regarding your proposed semantics: why blocking instead of non-blocking? If they are non-blocking, it is possible (albeit not very ergonomic) to turn them into blocking versions... but if they are blocking they can't be easily turned into non-blocking ones. Or are you thinking that copy should also be usable as a case in a switch, to be able to leverage the default case for non-blocking operations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsDecision Performance
Projects
None yet
Development

No branches or pull requests

8 participants