Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: function's interface (send on channel vs normal return) should not affect quality of compilation of function's logic #63332

Closed
0-issue opened this issue Oct 2, 2023 · 12 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@0-issue
Copy link

0-issue commented Oct 2, 2023

What version of Go are you using (go version)?

$ go version
go version go1.21.1 darwin/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/amanvm/Library/Caches/go-build'
GOENV='/Users/amanvm/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/amanvm/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/amanvm/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.21.1/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.21.1/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.1'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/4v/hg2cqzyx3rd9hv7kx2_7b8w40000gn/T/go-build72951666=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Background: This quest started with an observed anomaly that a simple "add 1.0 billion times" program performed slower when work was distributed among 2 parallel goroutines (each doing half the work) compared to a single one doing billion additions. There isn't any interaction between the independent goroutines while they do the computations, so this isn't about synchronization penalties. Nor are new memory pages being touched in the loop, so it isn't about memory either.

Loop in sumpar1 is 4 times slower than sumseq, even though there isn't any difference in loop's logic. The only way they vary is that sumpar1 returns value computed in the function on a channel instead of a normal function return. This observation isn't about delay added by writing to channel itself, it's about the quality of loop compiled in both the functions.

This behavior can be bypassed for now with an ugly hack. For a function like sumpar1 that sends result of a tight loop on a channel, break it into two parts: a function that does the tight loop and returns the value normally like sumseq, and create a wrapper function sumpar2 that just instantiates sumseq and returns it's value on channel. You can see the performance results in the listing below entire code (sumpar2 is 4 times faster than sumpar1). Update: using an anonymous function or a useless variable like mentioned in comments: #63332 (comment), #63332 (comment) also achieves 4x speedup, and would be more maintainable till compiler addresses the issue.

Why is fixing this might be worthwhile? a) the hack of wrapping mentioned in previous para is quite ugly, b) it makes it difficult to reason about go if such optimizations are to be manually done, c) this might be a low-hanging fruit that could help existing codebases. Seems like this could be an easy fix in go's compiler if the optimization is first run on "business logic" and then the interface (channel vs normal function return) is glued to it later. Though I am not a compiler expert.

func sumseq(numiter int) float64 {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	return sum
}

func sumpar1(numiter int, c chan<- float64) {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	c <- sum
}

func sumpar2(numiter int, c chan<- float64) {
	c <- sumseq(numiter)
}

Complete code (count.go):

package main

import (
	"fmt"
	"time"
)

const (
	numiterations = 1_000_000_000
	numworkers    = 2
)

func sumseq(numiter int) float64 {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	return sum
}

func sumpar1(numiter int, c chan<- float64) {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	c <- sum
}

func sumpar2(numiter int, c chan<- float64) {
	c <- sumseq(numiter)
}

func main() {
	c := make(chan float64, numworkers)

	// distribute counting among #numworkers parallel routines
	// using sumpar1
	start := time.Now()
	result := 0.0
	for i := 0; i < numworkers; i++ {
		go sumpar1(numiterations/numworkers, c)
	}
	for i := 0; i < numworkers; i++ {
		result += <-c
	}
	delta := time.Since(start)
	fmt.Printf("sumpar1 result: %f took: %v\n", result, delta)

	// distribute counting among #numworkers parallel routines
	// using sumpar2
	start = time.Now()
	result = 0.0
	for i := 0; i < numworkers; i++ {
		go sumpar2(numiterations/numworkers, c)
	}
	for i := 0; i < numworkers; i++ {
		result += <-c
	}
	delta = time.Since(start)
	fmt.Printf("sumpar2 result: %f took: %v\n", result, delta)

	// do all counting in single routine
	// using sumseq
	start = time.Now()
	result = sumseq(numiterations)
	delta = time.Since(start)
	fmt.Printf(" sumseq result: %f took: %v\n", result, delta)
}

Output:

sumpar1 result: 1000000000.000000 took: 1.780064625s
sumpar2 result: 1000000000.000000 took: 484.627042ms
 sumseq result: 1000000000.000000 took: 940.366166ms

What did you expect to see?

The compilation quality of a function's "business logic" should not be affected if output is returned on channel vs returned normally.

What did you see instead?

Returning on channel affects compilation quality of logic in function.

Check the assembly of the loop for sumseq and sumpar1 on an ARM64 machine below. Similar degradation in assembly is seen on other architectures (though I don't have a way to benchmark them):

sumseq:

	0x000c 00012 (count.go:15)	ADD	$1, R1, R1
	0x0010 00016 (count.go:16)	FMOVD	$(1.0), F1
	0x0014 00020 (count.go:16)	FADDD	F1, F0, F0
	0x0018 00024 (count.go:15)	CMP	R1, R0
	0x001c 00028 (count.go:15)	BGT	12

sumpar1:

	0x0028 00040 (count.go:24)	FMOVD	main.sum-8(SP), F0
	0x002c 00044 (count.go:24)	FMOVD	$(1.0), F1
	0x0030 00048 (count.go:24)	FADDD	F1, F0, F0
	0x0034 00052 (count.go:24)	FMOVD	F0, main.sum-8(SP)
	0x0038 00056 (count.go:23)	ADD	$1, R2, R2
	0x003c 00060 (count.go:23)	CMP	R2, R0
	0x0040 00064 (count.go:23)	BGT	40
@randall77
Copy link
Contributor

I think this is a consequence of variables being sent to a channel effectively having their address taken, disabling register allocation for them.
We should fix this. Probably a vestige of old bits of the compiler.

@randall77 randall77 added this to the Unplanned milestone Oct 2, 2023
@prattmic
Copy link
Member

prattmic commented Oct 3, 2023

cc @golang/compiler

@prattmic prattmic added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 3, 2023
@0-issue
Copy link
Author

0-issue commented Oct 3, 2023

The following hack of unnamed enclosed function also works, and is 4 times faster than original sumpar1 (copied below):

// 4 times faster version of sumpar1:
func sumpar1(numiter int, c chan<- float64) {
	sum := func() float64 {
		sum := 0.0
		for i := 0; i < numiter; i++ {
			sum += 1.0
		}
		return sum
	}()
	c <- sum
}

// original sumpar1:
func sumpar1(numiter int, c chan<- float64) {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	c <- sum
}

Also, I tried some more meaningful logic (instead of adding a constant 1.0), like accumulating a billion preloaded random float64 values from memory, and the observed speed difference is similar. Of course the speedup will go down with increased logic per iteration, but tight loops doing one or two arithmetic ops are out there.

@mdempsky
Copy link
Contributor

mdempsky commented Oct 3, 2023

I think this is a consequence of variables being sent to a channel effectively having their address taken, disabling register allocation for them.

I've been thinking for a little bit that maybe we shouldn't mark variables addrtaken due to runtime calls at all, at least for SSA-able types. Instead, if we need to pass a value by address to the runtime and it's not already addrtaken, we allocate a temporary variable to spill into it just for the runtime call.

@0-issue
Copy link
Author

0-issue commented Oct 5, 2023

And this following simple trick (below) of creating a new (and useless) variable sendsum works too... Similar behavior is seen where loop aggregates like sum are pre-allocated in memory, for instance if sum were a member of a struct on which a method like subpar was called. @mdempsky the change you have in mind, will it optimize all of the cases 1., 2., 3., below? Is there a guiding document for producing performant code in go?

  1. Create a useless variable sendsum, first assign sum to it, and then send sendsum over channel:
// 4 times faster version of sumpar1 (by creating sendsum):
func sumpar1(in []float64, start int, end int, c chan<- float64) {
	sum := 0.0
	for i := start; i < end; i++ {
		sum += in[i]
	}
	sendsum := sum
	c <- sendsum
}

// original sumpar1:
func sumpar1(numiter int, c chan<- float64) {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	c <- sum
}
  1. Example without channel (method and struct member), and normal function returns:
// 4 times faster aggregation
func (sp *special) sumseq(in []float64, start int, end int) float64 {
	sum := 0.0
	for i := start; i < end; i++ {
		sum += in[i]
	}
	sp.sum = sum
	return sp.sum
}

// compared to the following:
func (sp *special) sumseq(in []float64, start int, end int) float64 {
	sp.sum = 0.0
	for i := start; i < end; i++ {
		sp.sum += in[i]
	}
	return sp.sum
}
  1. Lastly, go compiler does not optimize pointer references within loop:
// this runs 4 times slower than the original version below
func sumseq(numiter int) float64 {
	sum := new(float64)
	for i := 0; i < numiter; i++ {
		*sum += 1.0
	}
	return *sum
}

// original version of sumseq
func sumseq(numiter int) float64 {
	sum := 0.0
	for i := 0; i < numiter; i++ {
		sum += 1.0
	}
	return sum
}

@mdempsky
Copy link
Contributor

mdempsky commented Oct 9, 2023

@amanvm No, the change I suggested would only help the first case.

The second and third cases require pointer alias analysis, which we don't currently do much of. E.g., in case two, it's possible through the use of package unsafe for sp.sum and in[start:end] to alias.

Case 3 seems like it should be possible to optimize if we had a pass that looked for memory-resident variables that could be lifted to SSA form.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 9, 2023
@mdempsky mdempsky self-assigned this Nov 6, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/541715 mentions this issue: cmd/compile/internal/walk: copy SSA-able variables

@0-issue
Copy link
Author

0-issue commented Nov 21, 2023

@mdempsky Thanks for the fix! Does the patch fix just case 1? If yes, do you think it is a good idea to open other issues for case 2 and/or 3?

@mdempsky
Copy link
Contributor

@0-issue Correct, only the first issue is fixed.

Filing issues for the other two is fine. I'd suggest filing them as two separate issues, because they're going to need different fixes. Please link back to this issue if you do.

Thanks.

@0-issue
Copy link
Author

0-issue commented Dec 7, 2023

@mdempsky I saw a release (1.21.5) happen since this fix, but the behavior is the same. Did this go in that release, if not which release would have this fix? Thanks!

@seankhliao
Copy link
Member

seankhliao commented Dec 7, 2023

This change wasn't backported, it'll only be available in the next minor release (1.22.0)

@0-issue
Copy link
Author

0-issue commented Mar 6, 2024

It's in now, and works great! Thanks @mdempsky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants