Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: fluent interfaces doing new allocations in go 1.19 #57434

Open
CannibalVox opened this issue Dec 21, 2022 · 29 comments
Open

cmd/compile: fluent interfaces doing new allocations in go 1.19 #57434

CannibalVox opened this issue Dec 21, 2022 · 29 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@CannibalVox
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.19.4 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\Stephen\AppData\Local\go-build
set GOENV=C:\Users\Stephen\AppData\Roaming\go\env
set GOEXE=.exe
set GOEXPERIMENT=
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\msys64\mingw64\pkg\mod
set GONOPROXY=
set GONOSUMDB=
set GOOS=windows
set GOPATH=C:/msys64/mingw64
set GOPRIVATE=
set GOPROXY=https://proxy.golang.org,direct
set GOROOT=C:\Program Files\Go
set GOSUMDB=sum.golang.org
set GOTMPDIR=
set GOTOOLDIR=C:\Program Files\Go\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=go1.19.4
set GCCGO=gccgo
set GOAMD64=v1
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\Stephen\projects\vkngwrapper\math\go.mod
set GOWORK=C:\Users\Stephen\projects\vkngwrapper\go.work
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=C:\msys64\tmp\go-build13049274=/tmp/go-build -gno-record-gcc-switches

What did you do?

Ran the following benchmarks:

package math

import "testing"

var output int

type Object struct {
	Val int
}

func (o *Object) Initialize() *Object {
	o.Val = 5
	return o
}

func (o *Object) Update() *Object {
	o.Val = o.Val + 1
	return o
}

func BenchmarkNoAllocs(b *testing.B) {
	b.ReportAllocs()

	for i := 0; i < b.N; i++ {
		var obj Object
		obj.Initialize()
		obj.Update()
		output = obj.Val
	}
}

func BenchmarkAllocs(b *testing.B) {
	b.ReportAllocs()

	for i := 0; i < b.N; i++ {
		var obj Object
		obj.Initialize().Update()
		output = obj.Val
	}
}

What did you expect to see?

I expected neither benchmark to perform any allocations.

What did you see instead?

BenchmarkAllocs allocated obj onto the heap, but BenchmarkNoAllocs does not:

$ go test -test.bench=. -test.run xx .
goos: windows
goarch: amd64
pkg: github.com/vkngwrapper/math
cpu: AMD Ryzen 7 5800X 8-Core Processor
BenchmarkNoAllocs-16                            1000000000               0.2160 ns/op          0 B/op          0 allocs/op
BenchmarkAllocs-16                              143329258                8.381 ns/op           8 B/op          1 allocs/op
PASS

However, it does not do this in 1.18.9:

$ go1.18.9 test -test.bench=. -test.run xx .
goos: windows
goarch: amd64
pkg: github.com/vkngwrapper/math
cpu: AMD Ryzen 7 5800X 8-Core Processor
BenchmarkNoAllocs-16                            1000000000               0.4301 ns/op          0 B/op          0 allocs/op
BenchmarkAllocs-16                              1000000000               0.4271 ns/op          0 B/op          0 allocs/op
PASS
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 21, 2022
@dr2chase
Copy link
Contributor

I just did the obvious experiment, and it is not unified IR. Still slow:

GOEXPERIMENT=nounified go test foo_test.go -bench Bench
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i5-1038NG7 CPU @ 2.00GHz
BenchmarkNoAllocs-8   	1000000000	         0.3096 ns/op	       0 B/op	       0 allocs/op
BenchmarkAllocs-8     	75144219	        13.55 ns/op	       8 B/op	       1 allocs/op
PASS

@mrjrieke
Copy link

Good find.

@dr2chase dr2chase self-assigned this Dec 21, 2022
@dr2chase
Copy link
Contributor

It looks like a bug in escape analysis.

GOEXPERIMENT=nounified GOSSAFUNC=BenchmarkAllocs go test -c -o foo_19 -gcflags -m=2 foo_test.go
# command-line-arguments [command-line-arguments.test]
./foo_test.go:11:6: can inline (*Object).Initialize with cost 6 as: method(*Object) func() *Object { o.Val = 5; return o }
./foo_test.go:16:6: can inline (*Object).Update with cost 9 as: method(*Object) func() *Object { o.Val = o.Val + 1; return o }
./foo_test.go:21:6: can inline BenchmarkNoAllocs with cost 51 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:22:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:26:17: inlining call to (*Object).Initialize
./foo_test.go:27:13: inlining call to (*Object).Update
./foo_test.go:32:6: can inline BenchmarkAllocs with cost 52 as: func(*testing.B) { (*testing.B).ReportAllocs(b); for loop }
./foo_test.go:33:16: inlining call to testing.(*B).ReportAllocs
./foo_test.go:37:17: inlining call to (*Object).Initialize
./foo_test.go:37:26: inlining call to (*Object).Update
./foo_test.go:11:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:11:7:   flow: ~r0 = o:
./foo_test.go:11:7:     from return o (return) at ./foo_test.go:13:2
./foo_test.go:11:7: leaking param: o to result ~r0 level=0
./foo_test.go:16:7: parameter o leaks to ~r0 with derefs=0:
./foo_test.go:16:7:   flow: ~r0 = o:
./foo_test.go:16:7:     from return o (return) at ./foo_test.go:18:2
./foo_test.go:16:7: leaking param: o to result ~r0 level=0
./foo_test.go:21:24: b does not escape
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7:   flow: o = &obj:
./foo_test.go:36:7:     from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7:     from o := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: ~R0 = o:
./foo_test.go:36:7:     from ~R0 = o (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: .autotmp_3 = ~R0:
./foo_test.go:36:7:     from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj

@CannibalVox
Copy link
Author

It's possible this was already fixed? A redditor cannot repro in latest 1.20-

go version devel go1.20-fadd77c05b Wed Dec 21 20:18:10 2022 +0000 linux/amd64

BenchmarkNoAllocs-8  1000000000  0.6611 ns/op  0 B/op  0 allocs/op
BenchmarkAllocs-8    1000000000  0.6606 ns/op  0 B/op  0 allocs/op

It occurs to me that when I was asked about "latest release" in the PR template, they might have meant this version and not 1.19.4- my apologies.

@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 22, 2022
@cherrymui
Copy link
Member

I just did the obvious experiment, and it is not unified IR

Quite the opposite. Unified IR actually fixes it. At tip, unified IR doesn't allocate, non-unified allocates. The IR between the two looks slightly different. Non-unified generates an autotmp for the result of the first call, which is the autotmp_3 you saw in the escape analysis diagnostics. The 1.18 AST doesn't have that autotmp (more like the unified IR version). The 1.19 one does.

The escape analysis definitely looks weird. flow: .autotmp_3 = ~R0 assigning to autotmp shouldn't cause it to escape, and it doesn't print any more lines. Not sure why it thinks obj escapes to heap...

@dr2chase
Copy link
Contributor

.autotmp_3 is to carry the result of one inlined call to another (visible with -m=5). I'm inclined to poke at this a bit more to see why it went wrong, it really should not have.

./foo_test.go:36:7:[2] BenchmarkAllocs stmt: var obj Object
./foo_test.go:36:7:[2] BenchmarkAllocs stmt: obj = <nil>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io := obj
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var io *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: io.Val = 5
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io; goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: ~R0 = io
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: goto .i1
./foo_test.go:37:17:[2] BenchmarkAllocs stmt: .i1: 
./foo_test.go:37:17:.i1:  non-looping label
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo := .autotmp_3
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var uo *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: <node inlmark>
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: uo.Val = uo.Val + 1
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo; goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: ~R0 = uo
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: var ~R0 *Object
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: goto .i2
./foo_test.go:37:26:[2] BenchmarkAllocs stmt: .i2: 
./foo_test.go:37:26:.i2:  non-looping label
./foo_test.go:38:10:[2] BenchmarkAllocs stmt: output = obj.Val
./foo_test.go:36:7: obj escapes to heap:
./foo_test.go:36:7:   flow: io = &obj:
./foo_test.go:36:7:     from obj (address-of) at ./foo_test.go:37:6
./foo_test.go:36:7:     from io := obj (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: ~R0 = io:
./foo_test.go:36:7:     from ~R0 = io (assign-pair) at ./foo_test.go:37:17
./foo_test.go:36:7:   flow: .autotmp_3 = ~R0:
./foo_test.go:36:7:     from .autotmp_3 := ~R0 (assign) at ./foo_test.go:37:26
./foo_test.go:32:22: b does not escape
./foo_test.go:36:7: moved to heap: obj

@cuonglm
Copy link
Member

cuonglm commented Dec 22, 2022

This is introduced in https://go-review.googlesource.com/c/go/+/392834, and Unified IR also has this problem until https://go-review.googlesource.com/c/go/+/421821.

We need to ensure non-Name call in subexpression call is also rewritten.

@gopherbot
Copy link

Change https://go.dev/cl/459295 mentions this issue: cmd/compile: ensure non-Name call in subexpression call is rewritten

@cuonglm
Copy link
Member

cuonglm commented Dec 22, 2022

@dr2chase @cherrymui Is performance regression legal for backport?

@cherrymui
Copy link
Member

What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.

@cherrymui
Copy link
Member

I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.

@cuonglm
Copy link
Member

cuonglm commented Dec 22, 2022

What do you mean by a "non-Name call"? A call to a func value that is not a named function? In this case it doesn't have a func value. Or the method selection expression is one? That would be usually considered as a named function call, though. I'm not sure I understand how this connects with escape analysis.

"non-Name call" is a function call where the call expression is not a ONAME node. In (*Object).Initialize(obj).Update(), after peeling a way the ODOTMETH Update, we have (*Object).Initialize(obj) which is a call. RewriteNonNameCall see that a call and attempt to assign it to a temp.

I think if the call have no side-effect argument, then we should not apply RewriteNonNameCall. The purpose of that function is making sure the oder of evaluation in f()(g()), in case of f()(), we can skip the rewriting.

I think a significant performance regression would qualify a backport. But I'm not sure this one is significant. It would need to see an impact on real code, instead of synthetic benchmarks.

I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer

@cherrymui
Copy link
Member

cherrymui commented Dec 22, 2022

where the call expression is not a ONAME node

Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called? (*Object).Initialize(obj) is not returning a func value, though, but a *Object.

Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else? Thanks.

I think this affects https://groups.google.com/g/golang-dev/c/8blU7CnuhZ4/m/d21xuN8hAQAJ?utm_medium=email&utm_source=footer

That looks like a regression from 1.19 to 1.20, though. We may want to address it before 1.20 release. But no backport involved.

@cuonglm
Copy link
Member

cuonglm commented Dec 22, 2022

Sorry, I must still have missed something. When would a call expression be an ONAME node? I'd think a call expression would be an OCALL or OCALLXXX node. Or you mean the function being called?

Sorry, I mean the callee expression. In f()(), the callee expression is f(), which is a non-Name node.

(*Object).Initialize(obj) is not returning a func value, though, but a *Object.

Yeah, I am talking at AST level. (*Object).Initialize(obj) is the callee expression of the call expression (*Object).Initialize(obj).Update()

Back off a level. The AST is not malformed. But the escape analysis behave weirdly, at least it looks like so. I'd think we want to understand why the escape analysis doesn't work as expected. Is it fundamentally not able to handle the current AST? (Then changing the AST would probably be the right answer.) Or it is something else?

I think it's reasonable for escape analysis to behave like that, because we now introduce new temp var for (*Object).Initialize(obj) call. Before CL 392834, escape analysis would see (*Object).Initialize(obj).Update() call. But now it would see:

tmp := (*Object).Initialize(obj)
tmp.Update()

We have a special case for ONEW inside RewriteNonNameCall, which will help a better escape analysis if we not rewriting at all.

@cherrymui
Copy link
Member

(*Object).Initialize(obj) is the callee expression of the call expression (*Object).Initialize(obj).Update()

Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (*Object).Initialize(obj).Update, instead of (*Object).Initialize(obj).

I think it's reasonable for escape analysis to behave like that

I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs to

                tmp := obj.Initialize()
                tmp.Update()

it actually does not allocate. Also not allocate with

                tmp := (*Object).Initialize(&obj)
                tmp.Update()

@cuonglm
Copy link
Member

cuonglm commented Dec 23, 2022

Sorry, I must still have missed something... If by "callee expression" you mean the function being called, in this case it would be (*Object).Initialize(obj).Update, instead of (*Object).Initialize(obj).

It is! In case of ODOTMETH, ODOTINTER, OMETHVALUE, we peel away the selector, so we have (*Object).Initialize(obj). That being said, for x.g(), it's a Name call, because after peeling away selector in callee expression, we have x left. But for x.f().g(), we have x.f() left, which is a non-Name call.

I don't think so, sorry. In fact, if I rewrite the source code of BenchmarkAllocs to

Hmm, sounds like there's actual mismatch between emitting autotmp vs hand written one. In case of hand written one, there's a ODCL node for tmp at the top of for body. Why in case of emitting autotmp, we haven't:

tmp := Temp((*np).Type())
as := ir.NewAssignStmt(base.Pos, tmp, *np)
as.Def = true

So you're right that fixing the AST (emitting the ODCL for autotmp) will fix the problem.

@dr2chase
Copy link
Contributor

I still think this is incorrect behavior by escape analysis, but it is taking me a while to figure out exactly where it goes wrong. I've been stepping through the compiler in a debugger since yesterday (as time/focused-attention permit). But thanks for all the investigation, this will tell me where to be especially attentive.

Also, everything gets inlined. This is in code that no longer contains real calls, though there are INLCALLs. Here's the pruned test case:

package foo

var output int

type Object struct {
	Val int
}

type B struct {
	N int
}

func (io *Object) Initialize() *Object {
	io.Val = 5
	return io
}

func (uo *Object) Update() *Object {
	uo.Val = uo.Val + 1
	return uo
}

func BenchmarkAllocs(b *B) {
	for i := 0; i < b.N; i++ {
		var obj Object
		obj.Initialize().Update()
		output = obj.Val
	}
}

@cuonglm
Copy link
Member

cuonglm commented Dec 23, 2022

Also, everything gets inlined. This is in code that no longer contains real calls,

Yeah, but all discussion above happens during old typecheck pass, so before any inlining can happen.

@dr2chase
Copy link
Contributor

The escape analysis runs on inlined code. That's where the decision is made wrong, in my opinion. If this AST is actually malformed, we need some way of rejecting it. Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

@cuonglm
Copy link
Member

cuonglm commented Dec 23, 2022

Are autotmps required to have ODCL nodes, or would it just solve this problem for us? I just now went looking at uses of typecheck.Temp and I don't see them getting declared; rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

No, autotmps do not require to ave ODCL node, but emitting ODCL will help escape analysis record the loop depth correctly.

rather than making all those changes (about 50), I'd like to see if this is an easy change in escape analysis.

What do you mean "50"? Is this the diff in CL 459295 or something else? If you refer to the diff, you may want to look at latest patchset, which is just one line change.

@dr2chase
Copy link
Contributor

The problem is loopdepth, but I think this is a bug in escape analysis -- anytime there is an autotmp used in a loop, its loopdepth will be wrong unless it is declared, and so far, none of them are declared.

"50" refers to places in the compiler source code where we create temporaries; any of these with a pointer type is conceivably a source of such an escape analysis hiccup. And, also, adding all these declarations for temporaries where none existed before might cause other problems. I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.

@cuonglm
Copy link
Member

cuonglm commented Dec 23, 2022

And, also, adding all these declarations for temporaries where none existed before might cause other problems.

I'm AFK, but there's a place in compiler where we emit ODCL for autotmp, I cant find it offhand, though.

@cuonglm
Copy link
Member

cuonglm commented Dec 24, 2022

I think the heuristic to try is that the loop level of an autotmp is the minimum level at which it is assigned; that is where it would have been declared.

Hmm, I think it can apply not just for autotmp, but any local ONAME that appear in the LHS of the assignment. If as.Def is true, we should record the loopdepth.

@seankhliao seankhliao changed the title cmd/compile: Fluent interfaces doing new allocations in go 1.19 cmd/compile: fluent interfaces doing new allocations in go 1.19 Dec 24, 2022
@dr2chase
Copy link
Contributor

dr2chase commented Dec 25, 2022

ONAMEs can have assignments at any depth. It's important to record the least-nested visibility because that is a form of escape (many-to-one). For autotmps, we do seem to declare them, and we do seem to declare them at varying depth (i.e., I logged this stuff complete with stack traces) and it "seems" to get them right for 1.20. I think for 1.19, this is just a minor performance regression and we live with it. Arguably, as the keeper of many benchmarks, I should add this microbenchmark, since it would be nice not to regress again.

@cherrymui
Copy link
Member

Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).

@cuonglm
Copy link
Member

cuonglm commented Dec 28, 2022

Sorry for the late reply. I agree with @dr2chase that targeting the escape analysis is the right thing to do here. And the heuristic of using the minimum level where an autotmp is assigned is reasonable to me (unless it is otherwise declared).

Agree, but if we're going to backport, then emitting ODCL seems safer. Otherwise, we can fix escape analysis in 1.21 cycle.

@gopherbot
Copy link

Change https://go.dev/cl/459496 mentions this issue: cmd/compile: record loopdepth for autotmp if declared in assignment

@cherrymui
Copy link
Member

Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment .autotmp_3 = ~R0 causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.

@mknyszek mknyszek added this to the Go1.21 milestone Dec 28, 2022
@cuonglm
Copy link
Member

cuonglm commented Dec 29, 2022

Besides the escape analysis change, I think the escape analysis diagnostics also needs some improvement. If it is the assignment .autotmp_3 = ~R0 causing the escape due to loop depth, it would be good to mention loop depth. Otherwise it is very confusing why this assignment causes problem.

With -m=3, loop depth is included in diagnostics:

p.go:26:26:[2] BenchmarkAllocs stmt: .autotmp_3 := ~R0

[2] -> loop depth is 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: All-But-Submitted
Development

No branches or pull requests

7 participants