-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: function's interface (send on channel vs normal return) should not affect quality of compilation of function's logic #63332
Comments
I think this is a consequence of variables being sent to a channel effectively having their address taken, disabling register allocation for them. |
cc @golang/compiler |
The following hack of unnamed enclosed function also works, and is 4 times faster than original // 4 times faster version of sumpar1:
func sumpar1(numiter int, c chan<- float64) {
sum := func() float64 {
sum := 0.0
for i := 0; i < numiter; i++ {
sum += 1.0
}
return sum
}()
c <- sum
}
// original sumpar1:
func sumpar1(numiter int, c chan<- float64) {
sum := 0.0
for i := 0; i < numiter; i++ {
sum += 1.0
}
c <- sum
} Also, I tried some more meaningful logic (instead of adding a constant |
I've been thinking for a little bit that maybe we shouldn't mark variables addrtaken due to runtime calls at all, at least for SSA-able types. Instead, if we need to pass a value by address to the runtime and it's not already addrtaken, we allocate a temporary variable to spill into it just for the runtime call. |
And this following simple trick (below) of creating a new (and useless) variable
// 4 times faster version of sumpar1 (by creating sendsum):
func sumpar1(in []float64, start int, end int, c chan<- float64) {
sum := 0.0
for i := start; i < end; i++ {
sum += in[i]
}
sendsum := sum
c <- sendsum
}
// original sumpar1:
func sumpar1(numiter int, c chan<- float64) {
sum := 0.0
for i := 0; i < numiter; i++ {
sum += 1.0
}
c <- sum
}
// 4 times faster aggregation
func (sp *special) sumseq(in []float64, start int, end int) float64 {
sum := 0.0
for i := start; i < end; i++ {
sum += in[i]
}
sp.sum = sum
return sp.sum
}
// compared to the following:
func (sp *special) sumseq(in []float64, start int, end int) float64 {
sp.sum = 0.0
for i := start; i < end; i++ {
sp.sum += in[i]
}
return sp.sum
}
// this runs 4 times slower than the original version below
func sumseq(numiter int) float64 {
sum := new(float64)
for i := 0; i < numiter; i++ {
*sum += 1.0
}
return *sum
}
// original version of sumseq
func sumseq(numiter int) float64 {
sum := 0.0
for i := 0; i < numiter; i++ {
sum += 1.0
}
return sum
} |
@amanvm No, the change I suggested would only help the first case. The second and third cases require pointer alias analysis, which we don't currently do much of. E.g., in case two, it's possible through the use of package unsafe for Case 3 seems like it should be possible to optimize if we had a pass that looked for memory-resident variables that could be lifted to SSA form. |
Change https://go.dev/cl/541715 mentions this issue: |
@mdempsky Thanks for the fix! Does the patch fix just case 1? If yes, do you think it is a good idea to open other issues for case 2 and/or 3? |
@0-issue Correct, only the first issue is fixed. Filing issues for the other two is fine. I'd suggest filing them as two separate issues, because they're going to need different fixes. Please link back to this issue if you do. Thanks. |
@mdempsky I saw a release (1.21.5) happen since this fix, but the behavior is the same. Did this go in that release, if not which release would have this fix? Thanks! |
This change wasn't backported, it'll only be available in the next minor release (1.22.0) |
It's in now, and works great! Thanks @mdempsky |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Background: This quest started with an observed anomaly that a simple "add 1.0 billion times" program performed slower when work was distributed among 2 parallel goroutines (each doing half the work) compared to a single one doing billion additions. There isn't any interaction between the independent goroutines while they do the computations, so this isn't about synchronization penalties. Nor are new memory pages being touched in the loop, so it isn't about memory either.
Loop in
sumpar1
is 4 times slower thansumseq
, even though there isn't any difference in loop's logic. The only way they vary is thatsumpar1
returns value computed in the function on a channel instead of a normal function return. This observation isn't about delay added by writing to channel itself, it's about the quality of loop compiled in both the functions.This behavior can be bypassed for now with an ugly hack. For a function like
sumpar1
that sends result of a tight loop on a channel, break it into two parts: a function that does the tight loop and returns the value normally likesumseq
, and create a wrapper functionsumpar2
that just instantiatessumseq
and returns it's value on channel. You can see the performance results in the listing below entire code (sumpar2
is 4 times faster thansumpar1
). Update: using an anonymous function or a useless variable like mentioned in comments: #63332 (comment), #63332 (comment) also achieves 4x speedup, and would be more maintainable till compiler addresses the issue.Why is fixing this might be worthwhile? a) the hack of wrapping mentioned in previous para is quite ugly, b) it makes it difficult to reason about go if such optimizations are to be manually done, c) this might be a low-hanging fruit that could help existing codebases. Seems like this could be an easy fix in go's compiler if the optimization is first run on "business logic" and then the interface (channel vs normal function return) is glued to it later. Though I am not a compiler expert.
Complete code (count.go):
Output:
What did you expect to see?
The compilation quality of a function's "business logic" should not be affected if output is returned on channel vs returned normally.
What did you see instead?
Returning on channel affects compilation quality of logic in function.
Check the assembly of the loop for
sumseq
andsumpar1
on an ARM64 machine below. Similar degradation in assembly is seen on other architectures (though I don't have a way to benchmark them):sumseq:
sumpar1:
The text was updated successfully, but these errors were encountered: