Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: under what circumstances does 1.18 generate significantly larger binaries than 1.17 when compiling the same source code? #52270

Open
adriancable opened this issue Apr 11, 2022 · 11 comments
Labels
NeedsInvestigation

Comments

@adriancable
Copy link

@adriancable adriancable commented Apr 11, 2022

What version of Go are you using (go version)?

1.17 vs 1.18

What operating system and processor architecture are you using (go env)?

linux/arm

Details

Regarding output binary size, the go 1.18 release notes only talk about changes (relative to 1.17) that potentially reduce size of output objects, due to linker optimisations etc. But under what circumstances would we expect 1.18 to produce significantly larger binaries than 1.17 for the same code?

Example: we have a medium-sized project that runs on an embedded linux/arm system. In this environment both performance (we do a lot of crypto stuff) and size are important. Output binary sizes for 1.17 vs 1.18 are as follows:

ldflags | 1.17      | 1.18
--------+-----------+-----------
none    | 9548064   | 10586552
-w -s   | 6684672   | 7340032

Both with no build flags and with -w -s we see a binary size increase of around 10% between 1.17 and 1.18, which is quite significant. (Binaries produced by 1.17 and 1.18 are similarly performant for us, so these extra bytes are 'for nothing'.)

I am not sure whether this is strictly a 'compiler bug' or if there are circumstances where 1.18 simply produces more bloated code than 1.17. If that is true, it would be extremely useful to know what those circumstances are, so we can work around them, not migrate to 1.18 for our build process, etc.

Any light and wisdom on this would be much appreciated!

-Adrian

@adriancable adriancable changed the title cmd/compile: under what circumstances does 1.18 generate significantly larger binaries than 1.17? cmd/compile: under what circumstances does 1.18 generate significantly larger binaries than 1.17 when compiling the same source code? Apr 11, 2022
@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 11, 2022

Possibly it is related to inlining changes, try compiling with -gcflags=-l in both versions and see if that size delta persists.

Otherwise, I'm not aware of any changes that would cause this. Maybe try running nm on both and diffing the set of functions. That will let us know if it is extra functions (maybe something wrong with dead function elimination) or larger individual functions. If the latter, maybe try to find a function that got a lot larger.
Could also be something else, like extra traceback info, but I'd start with those checks.

@adriancable
Copy link
Author

@adriancable adriancable commented Apr 11, 2022

Regarding -gcflags=-l, size is now 9546457 (1.17) vs. 10585621 (1.18), which is a minimal increase (< 0.1%) both for 1.17 and 1.18 relative to the no build flags case. So it isn't to do with inlining in this case.

Regarding dead function elimination, there may be something there:

root@builder:~/x# nm main-1.17 | wc -l
10807
root@builder:~/x# nm main-1.18 | wc -l
11434

So it does look like there are more functions at least. I will put something together diff-ish to dig into the differences and report back.

@adriancable
Copy link
Author

@adriancable adriancable commented Apr 11, 2022

@randall77 - I looked in depth at the diff of the nm. There is something 'suspect' happening with anonymous functions I think.

For example in 1.18 I am seeing things like:

...
crypto/rand.altGetRandom
crypto/rand.batched
crypto/rand.batched.func1
crypto/rand.(*devReader).Read
crypto/rand.(*devReader).Read.func1
crypto/rand.(*devReader).Read.func2
crypto/rand.getRandomBatch
...

whereas in the same place for 1.17 it looks like:

crypto/rand.altGetRandom
crypto/rand.batched
crypto/rand.batched.func1
crypto/rand.(*devReader).Read
crypto/rand.getRandomBatch

My understanding is that .func1, .func2 etc. are anonymous functions, so I am confused why 1.18 should be emitting (a lot) more of these than 1.17 from the same source.

A back of the envelope calculation suggests this is responsible for most of the increase.

Update: comparing sizes of emitted functions in 1.17 vs 1.18, the code for 1.18 is also consistently larger than 1.17, but this is typically on the order of 1-3%. So most of the total size increase in the output binary in 1.18 does indeed seem to be due to the emission of the additional 'mystery' anonymous functions.

Update 2: looking at the source for crypto/rand.(*devReader).Read, which is in go/src/crypto/rand/rand_unix.go, there are no anonymous functions in there at all. So I am even less sure what 1.18 is doing here.

@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 11, 2022

Indeed, both crypto/rand.(*devReader).Read.func1 and crypto/rand.(*devReader).Read.func2 are unreferenced by any code in the binary except themselves.

At tip the names changed to crypto/rand.(*reader).Read.func{1,2}, but the same issue occurs.

@cherryyz @thanm

This program, when compiled with GOOS=linux GOARCH=arm GOARM=7 demonstrates the problem at tip.

package main

import "crypto/rand"

func main() {
	var b [64]byte
	n, err := rand.Read(b[:])
	println(n, err)
}
% GOOS=linux GOARCH=arm GOARM=7 go build ~/gowork/issue52270.go
% nm ./issue52270 | grep reader | grep crypto
0007f2b0 T crypto/rand.(*reader).Read
0007f728 T crypto/rand.(*reader).Read.func1
0007f6d4 T crypto/rand.(*reader).Read.func2
000a8050 R go.itab.*crypto/rand.reader,io.Reader
0007f83c T type..eq.crypto/rand.reader
% objdump -d ./issue52270 | grep crypto | grep reader | grep func
crypto/rand.(*reader).Read.func2:
   7f6dc:	07 00 00 9a 	bls	#28 <crypto/rand.(*reader).Read.func2+0x2c>
   7f6ec:	06 00 00 1a 	bne	#24 <crypto/rand.(*reader).Read.func2+0x38>
   7f708:	f1 ff ff ea 	b	#-60 <crypto/rand.(*reader).Read.func2>
   7f718:	f4 ff ff 1a 	bne	#-48 <crypto/rand.(*reader).Read.func2+0x1c>
   7f724:	f1 ff ff ea 	b	#-60 <crypto/rand.(*reader).Read.func2+0x1c>
crypto/rand.(*reader).Read.func1:
   7f730:	07 00 00 9a 	bls	#28 <crypto/rand.(*reader).Read.func1+0x2c>
   7f740:	06 00 00 1a 	bne	#24 <crypto/rand.(*reader).Read.func1+0x38>
   7f75c:	f1 ff ff ea 	b	#-60 <crypto/rand.(*reader).Read.func1>
   7f76c:	f4 ff ff 1a 	bne	#-48 <crypto/rand.(*reader).Read.func1+0x1c>
   7f778:	f1 ff ff ea 	b	#-60 <crypto/rand.(*reader).Read.func1+0x1c>

All 5 of those references are from inside the functions in question.

I think this has to do with the defers in crypto/rand.(*reader), as the bodies of func1 and func2 just call mutex.Unlock and timer.Stop, respectively, which are the 2 things deferred in crypto/rand.(*reader). I understand that we have to generate wrappers for defers for ABI switching - maybe we're doing this for arm as well, even if it isn't strictly needed?
Presumably the references keeping these functions alive are from the open-coded defer info.

@adriancable
Copy link
Author

@adriancable adriancable commented Apr 11, 2022

@randall77 - thanks for the very complete analysis of the issue. I am still wondering what changed in 1.18 to make this problem happen, when in 1.17 and below it did not.

@thanm
Copy link
Contributor

@thanm thanm commented Apr 11, 2022

@Randall7 this code size increase looks to be a side effect of the register ABI work. AIUI we now enable defer wrappers everywhere (even for architectures like arm and 386 where args are still passed entirely on the stack), since it greatly simplifies the runtime defer processing code. If we didn't do this, we'd have to continue to maintain the legacy version of runtime.deferproc (and related funcs) that have special handling for arguments of deferred func calls.

You wrote: "Indeed, both crypto/rand.(*devReader).Read.func1 and crypto/rand.(*devReader).Read.func2 are unreferenced by any code in the binary except themselves."

This part I don't think I understand -- when I build with "go build -ldflags=-dumpdep" I see this:

...
crypto/rand.(*reader).Read -> crypto/rand.(*reader).Read.func1
crypto/rand.(*reader).Read -> crypto/rand.(*reader).Read.func2
...

In other words, the *.func1 and *.func2 wrappers are being kept alive because they are referenced by crypto/rand.(*reader).Read (which is expected).

@adriancable
Copy link
Author

@adriancable adriancable commented Apr 11, 2022

@thanm - that's interesting, but I wonder if that is the full explanation here? From the Go 1.18 release notes, there is this, which presumably is talking about the change you are referring to:

Go 1.17 implemented a new way of passing function arguments and results using registers instead of the stack on 64-bit x86 architecture on selected operating systems. Go 1.18 expands the supported platforms to include 64-bit ARM (GOARCH=arm64), big- and little-endian 64-bit PowerPC (GOARCH=ppc64, ppc64le), as well as 64-bit x86 architecture (GOARCH=amd64) on all operating systems. On 64-bit ARM and 64-bit PowerPC systems, benchmarking shows typical performance improvements of 10% or more.

But reading this makes it sounds like the architectural change actually happened in 1.17, and 1.18 simply extends that to more platforms.

Yet the effect I reported only occurs in 1.18, not 1.17. If this is the same effect, it's not clear to me how 1.17 was able to handle the register ABI changes for amd64 without requiring the defer wrappers.

Since deferred calls are normally quite short, individual function wrappers for every one of them would plausibly be significant in terms of overhead and this is what we are seeing.

@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 11, 2022

This part I don't think I understand -- when I build with "go build -ldflags=-dumpdep" I see this:

...
crypto/rand.(*reader).Read -> crypto/rand.(*reader).Read.func1
crypto/rand.(*reader).Read -> crypto/rand.(*reader).Read.func2
...
In other words, the *.func1 and *.func2 wrappers are being kept alive because they are referenced by crypto/rand.(*reader).Read (which is expected).

Never mind, the references are there, it just isn't obvious from the disassembly.
It does a load from another part of the function's instructions:

   7f350:       50 23 9f e5     ldr     r2, [pc, #848]

That the referenced instruction stream location is:

   7f6a8:       28 f7 07 00     andeq   pc, r7, r8, lsr #14

That address, 0x7f728 is the start of crypto/rand.(*reader).Read.func1. That function isn't mentioned by name in the disassembly because the disassembler doesn't understand data in code sections :(

@dmitshur dmitshur added the NeedsInvestigation label Apr 11, 2022
@adriancable
Copy link
Author

@adriancable adriancable commented Apr 13, 2022

Has it been decided whether the register ABI will be ported to 32-bit ARM? This is still probably the most popular 'mid-range' embedded architecture out there and I think go is being used a lot here.

If the register ABI won't be ported to 32-bit ARM, then this change in 1.18 make me a little sad, because it means that a change to the compiler has been made (according to the release notes, for the sake of performance) which adds ~10% bloat to binaries on a platform that won't ever get that performance gain. Of course most changes involve trade-offs including edge cases that suffer even if gains are net positive on average, but in this case for 32-bit ARM it would be 'all suffering' with no gain.

@randall77
Copy link
Contributor

@randall77 randall77 commented Apr 13, 2022

I don't think anyone is currently working on, or planning to work on, regabi for 32-bit arm. We of course welcome contributions if anyone wants to work on it.

@adriancable
Copy link
Author

@adriancable adriancable commented Apr 13, 2022

@randall77 - thanks for the info, I fully understand it isn't possible to please everyone. I hope that (even in the absence of register ABI for 32-bit ARM) that a way will be found moving forward to avoid passing the 'bloat' of unsupported register ABI onto 32-bit ARM nonetheless.

But even if not, I still think the world is a better place with Golang. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

4 participants