Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: go statement desugaring results in additional allocations #46349

Closed
qmuntal opened this issue May 24, 2021 · 25 comments
Closed

cmd/compile: go statement desugaring results in additional allocations #46349

qmuntal opened this issue May 24, 2021 · 25 comments

Comments

@qmuntal
Copy link
Contributor

@qmuntal qmuntal commented May 24, 2021

What version of Go are you using (go version)?

go version devel go1.17-cca23a7 Sat May 22 00:51:17 2021 +0000 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\***\AppData\Local\go-build
set GOENV=C:\Users\***\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\***\go\pkg\mod
set GONOPROXY=*.hp.com
set GONOSUMDB=*.hp.com
set GOOS=windows
set GOPATH=C:\Users\***\go
set GOPRIVATE=*.hp.com
set GOPROXY=direct
set GOROOT=C:\Users\***\sdk\gotip
set GOSUMDB=off
set GOTMPDIR=
set GOTOOLDIR=C:\Users\***\sdk\gotip\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=devel go1.17-cca23a7 Sat May 22 00:51:17 2021 +0000
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\***\Documents\code\go3mf-thumbnailer\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\***\AppData\Local\Temp\go-build497339627=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
  "sync"
  "testing"
)

func BenchmarkAlloc(b *testing.B) {
  count := 10
  for n := 0; n < b.N; n++ {
    var wg sync.WaitGroup
    wg.Add(count)
    for i := 0; i < count; i++ {
        go func() {
          wg.Done()
        }()
     }
     wg.Wait()
  }
}

gotip test -run AAAA -bench BenchmarkAlloc -benchmem

What did you expect to see?

Same result as with go1.16.3 and gotip+GOEXPERIMENT=noregabi:

goos: windows
goarch: amd64
pkg: main
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkCreate-8         211135              6922 ns/op              16 B/op          1 allocs/op
PASS
ok      main      2.048s

What did you see instead?

Alloc count grow linearly with the size of the loop:

goos: windows
goarch: amd64
pkg: main      
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkCreate-8         183998              7023 ns/op             176 B/op         11 allocs/op
PASS
ok      main            2.575s

Notice that the number of allocations is directly proportional to the number of iterations inside the loop.

Context

I've found this regression while trying gotip in a private codebase. The provided benchmark is just a synthetic and minimal example I've found that reproduces what I'm seeing in the real code.

This is a memory profile snap for this benchmark, generated using go tool pprof -alloc_objects mem.out:

         .          .      6:)
         .          .      7:
         .          .      8:func BenchmarkCreate(b *testing.B) {
         .          .      9:   count := 10
         .          .     10:   for n := 0; n < b.N; n++ {
    131074     131074     11:           var wg sync.WaitGroup
         .          .     12:           wg.Add(count)
         .          .     13:           for i := 0; i < count; i++ {
   1507351    1507351     14:                   go func() {
         .          .     15:                           wg.Done()
         .          .     16:                   }()
         .          .     17:           }
         .          .     18:           wg.Wait()
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

Thanks for the report.

I believe this is due to a known change in Go 1.17 related to go and defer 'desugaring,' that may result in more allocations in some cases. You mention that you saw this in a larger codebase, but could you tell us more about what you saw there? Specifically, is there a significant performance regression in that case?

CC @thanm @cherrymui

Loading

@mknyszek mknyszek added this to the Go1.17 milestone May 24, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

Not marking this a release blocker until we have more information. Also because to some degree this is expected, and for the most part we've found the benefits to outweigh the costs overall by a significant margin.

Loading

@mknyszek mknyszek changed the title allocation regression when using regabi in tip cmd/compile: go statement desugaring results in additional allocations May 24, 2021
@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 24, 2021

I believe this is due to a known change in Go 1.17 related to go and defer 'desugaring,' that may result in more allocations in some cases. You mention that you saw this in a larger codebase, but could you tell us more about what you saw there? Specifically, is there a significant performance regression in that case?

I'm seeing a huge increase of allocations in all multiple parts of the codebase that are called from within a loop an unbounded number of times. In some realistic non-worth-case benchmarks I'm seeing 1000% allocation increase.

The codebase that I'm benchmarking parses 3D software-generated files, such as point clouds or 3D meshes, which can have millions of 3D points. Until now the parsing algorithm had O(1) complexity with respect to the number of vectors, but in gotip is O(N).

On the other hand, I'm also seeing a 12% speed performance gain in most of the benchmarks, which is much appreciated when processing big files. Still, I'm not sure that this speed gain is realistic as there are more allocations so I expect the GC to eat part of this improvement.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

@qmuntal I think we need more details as to the nature of these heap allocations. Do they all involve a go statement? Or do they look like something else (like defer statements)?

You can use the go build flag -gcflags=-m=2 to get inlining and escape analysis data at build time to help your understand what the new sources of allocation are.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 24, 2021

@mknyszek Not all the offending lines use go or defer. Here are some examples

val, err := strconv.ParseUint(string(a.Value), 10, 24) // a.Value is a byte slice
if d.OnStart != nil {
   // t is a complex struct holding a byte slice reference
   d.OnStart(t) // <- Ofending!
}
wn := runtime.GOMAXPROCS(0)
var wg sync.WaitGroup
wg.Add(wn)
for wi := 0; wi < wn; wi++ {
  go func(wi int) { // <- Ofending!
    // ...
  }
}
// This comes from https://tip.golang.org/src/mime/mediatype.go?s=3622:3707#152
 params = make(map[string]string)

I will do the inlining and scape analysis tomorrow morning (CET), I'm running out of time now.

Thanks for jumping into this.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

CC @mdempsky @dr2chase for escape analysis and even more compiler regabi expertise.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 24, 2021

Yes, with register ABI there are more go'd closures needs allocation now. But I'm not sure whether it is really a problem for performance. We already needs to allocate the new goroutine's stack (which is not counted as memory allocation in profile, but it's there). Now we need to do an extra allocation for that closure (which is typically a much smaller allocation). As you posted, the overall performance is improved.

We could do something clever, e.g. allocating the closure on the new goroutine's stack. But there is a complexity for that. I'm not sure it's worth doing. We could consider that if it is a real problem.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 24, 2021

The escape analysis isn't really relevant here. For go statement, it runs asynchronously and so in general we always need to allocate that closure. Except the case that it is not a real closure, i.e. it doesn't capture anything. With register ABI, there are more closures become "real closures", in that we need to capture arguments as well.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

Right, it's not really relevant to the desugaring, I looped in escape analysis people because reacting to @qmuntal claim that lots of other non-go, non-defer sites now end up allocating. But, it's still unclear to me where the new allocations are coming from.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 24, 2021

My codebase is memory intensive and I've always tried to reduce the allocations in order to get performance gains indirectly. In this case, I haven't seen any performance regression but just much more allocations everywhere, so If you say this is the expected behavior when migrating to regabi then it's ok for me. If in the future I see there is a real performance issue here I can report it back with more data.

I let you decide what to do with this issue, I can either close it or provide you more data to better understand where to new allocations are coming from.

Loading

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 24, 2021

In that case, I'll at least move this issue out of the Go 1.17 milestone.

Loading

@mknyszek mknyszek removed this from the Go1.17 milestone May 24, 2021
@mknyszek mknyszek added this to the Backlog milestone May 24, 2021
@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 24, 2021

Yes, with register ABI there are more go'd closures needs allocation now. But I'm not sure whether it is really a problem for performance. We already needs to allocate the new goroutine's stack (which is not counted as memory allocation in profile, but it's there). Now we need to do an extra allocation for that closure (which is typically a much smaller allocation)

@cherrymui would it make sense to also hide the allocations for that closure from the profile? I tend to hunt down allocations for the sake of performance and from go1.17 the memory profiles will have much more noise due to these additional small allocations that presumably don't have a big performance impact and may be unavoidable.

For example, in my codebase the following line allocates in go1.17 and does not allocate in go1.16:

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

In go1.16 I would have been worried because the allocation would probably be a.Value escaping to the heap, which is bad, but in go1.17 this allocation is inevitable and irrelevant because it is related to the... register abi?

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 24, 2021

would it make sense to also hide the allocations for that closure from the profile?

I think we probably don't want to do that. Under the hood it is just a regular Go allocation. And go realClosure() already allocates in the past and counted in profile.

In general, while memory profile counts all allocations, not all of them have the same impact to performance (CPU and memory usage).

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

In go1.16 I would have been worried because the allocation would probably be a.Value escaping to the heap, which is bad, but in go1.17 this allocation is inevitable and irrelevant because it is related to the... register abi?

At a quick look it is not obvious to me that this is register ABI related. (The only thing expected due to register ABI is for the go statement (and perhaps defer statement inside a loop, which is rare).) Maybe there is a real issue here. I could take a careful look.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 24, 2021

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

Could you explain more about this? Could you show the output of -gcflags=-m on this line, for both Go 1.16 and tip?

For a simple example I run, there is an allocation for the []byte to string conversion, and it is the same with Go 1.16 and tip. The underlying a.Value does not necessarily escape, also same with Go 1.16 and tip.

Loading

@zikaeroh
Copy link
Contributor

@zikaeroh zikaeroh commented May 25, 2021

I also saw a change in allocations (some 10 extra) in my benchmarks stemming from regabidefer when it was stabilized a bit ago, but didn't report it as I couldn't figure out how to actually determine where the allocations were coming from with a pprof diff and the performance was still largely better with regabi enabled (so, probably not too effectual in my case). I tried removing code and found it to be a little confusing; it seemed like it was coming from deferring an opencensus trace span end call, but I tried to reproduce it outside my application with no success, so gave up.

-gcflags=-m for me is identical between regabi and noregabi. -m=2 is (regabi left, noregabi right):

235,281c235,247
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.file (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.file.close() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
---
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape

But, maybe there's something else I can inspect.

If it's of interest and I'm not sidetracking things, I'm happy to run whatever is needed. I know I was a little surprised to see more allocations when the regabi was enabled.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 25, 2021

Could you explain more about this? Could you show the output of -gcflags=-m on this line, for both Go 1.16 and tip?

For a simple example I run, there is an allocation for the []byte to string conversion, and it is the same with Go 1.16 and tip. The underlying a.Value does not necessarily escape, also same with Go 1.16 and tip.

I was wrong when saying that line did not allocate in go 1.16 but did allocate in go 1.17. In both versions this line is allocating and string(a.Value) escapes to heap. Yet there are more allocations in this line than in go 1.16.

This is the output of -gcflags=-m for both versions: string(a.Value) escapes to heap

I found another strange thing that could be related to this issue. When comparing inuse_objects with regabi disabled and enabled, the former has 5461 objects allocated and the latest 70619. Almost all the new allocations come from runtime.doaddtimer which is indirectly triggered by runtime.bgscavenge.

This is an svg dump from the go tool pprof:

inuse_objects.zip

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 25, 2021

@zikaeroh Thanks for the information. If I see it correctly, the difference in -m=2 output above should not make a difference in allocation, at least by itself.

Maybe the extra allocations you saw came from go'd (or deferred in a loop) closures, which is expected.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 25, 2021

@qmuntal

Yet there are more allocations in this line than in go 1.16.

On that line? Could you post the assembly output for that line, using -gcflags=-S, for 1.16 and tip? And maybe provide more context about that line of code?

Almost all the new allocations come from runtime.doaddtimer which is indirectly triggered by runtime.bgscavenge.

I'd be surprised for this being register ABI related. Are your comparing 1.16 vs. tip, or register ABI off (GOEXPERIMENT=none) vs. on? Thanks.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 25, 2021

I'd be surprised for this being register ABI related. Are your comparing 1.16 vs. tip, or register ABI off (GOEXPERIMENT=none) vs. on? Thanks.

I'm comparing register ABI off (GOEXPERIMENT=noregabi) vs on, both in tip.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 25, 2021

On that line? Could you post the assembly output for that line, using -gcflags=-S, for 1.16 and tip? And maybe provide more context about that line of code?

Just this line of code is in a public repo, so you can check it yourself: https://github.com/HPInc/go3mf/blob/master/decoder.go#L462

These are the assembly dumps:

Go tip

decoder.go:462)      MOVQ    "".a+248(SP), R8
decoder.go:462)      MOVQ    "".a+256(SP), CX
decoder.go:462)      XORL    AX, AX
decoder.go:462)      MOVQ    R8, BX
decoder.go:462)      PCDATA  $1, $4
decoder.go:462)      CALL    runtime.slicebytetostring(SB)
decoder.go:462)      MOVL    $10, CX
decoder.go:462)      MOVL    $24, DI
decoder.go:462)      CALL    strconv.ParseUint(SB)

Go 1.16

decoder.go:462)      MOVQ    "".a+256(SP), AX
decoder.go:462)      MOVQ    "".a+264(SP), DI
decoder.go:462)      MOVQ    $0, (SP)
decoder.go:462)      MOVQ    AX, 8(SP)
decoder.go:462)      MOVQ    DI, 16(SP)
decoder.go:462)      PCDATA  $1, $1
decoder.go:462)      CALL    runtime.slicebytetostring(SB)
decoder.go:462)      MOVQ    24(SP), AX
decoder.go:462)      MOVQ    32(SP), CX
decoder.go:462)      MOVQ    AX, (SP)
decoder.go:462)      MOVQ    CX, 8(SP)
decoder.go:462)      MOVQ    $10, 16(SP)
decoder.go:462)      MOVQ    $24, 24(SP)
decoder.go:462)      CALL    strconv.ParseUint(SB)
decoder.go:462)      MOVQ    32(SP), AX

Edited: Headers swapped.

Loading

@thanm
Copy link
Contributor

@thanm thanm commented May 25, 2021

That second excerpt (labeled "Go tip") does not look like Go tip to me-- it looks like Go tip plus GOEXPERIMENT=noregabi. All of the values being passed to subroutines are on the stack.

Loading

@qmuntal
Copy link
Contributor Author

@qmuntal qmuntal commented May 25, 2021

That second excerpt (labeled "Go tip") does not look like Go tip to me-- it looks like Go tip plus GOEXPERIMENT=noregabi. All of the values being passed to subroutines are on the stack.

You are partially right, I swapped the headers... I will edit the post to fix it. Thanks!

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 26, 2021

@qmuntal thanks for posting the assembly. From allocation's perspective they look the same. It is unclear to me where the extra allocation comes from.

Loading

@networkimprov
Copy link

@networkimprov networkimprov commented May 27, 2021

@gopherbot remove WaitingForInfo

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 4, 2021

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

Loading

@gopherbot gopherbot closed this Jul 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants