Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines #38068

Closed
mjl- opened this issue Mar 25, 2020 · 26 comments
Closed

cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines #38068

mjl- opened this issue Mar 25, 2020 · 26 comments

Comments

@mjl-
Copy link

@mjl- mjl- commented Mar 25, 2020

What version of Go are you using (go version)?

$ go version
go1.14.1

Also tested with all versions from go1.13 and up, same problem.

Does this issue reproduce with the latest release?

Latest release, yes. Haven't tried tip.

What operating system and processor architecture are you using (go env)?

I ran into this when building the same code on many different machines: freebsd12 freebsd/386, debian10-stable linux/386, ubuntu18lts linux/amd64, debian-unstable linux/amd64, macos-latest darwin/amd64.

What did you do?

I built the same code on all the machines, with options that should result in the exact same binary, but got different results on same of the machines. I tracked it down to the machines with a single cpu creating a different binary. Building with GO19CONCURRENTCOMPILATION=0 on the multi-cpu machines resulted in binaries identical to the ones produced by the single-cpu machines.

Here is an example to run on a multi-cpu machine (I used linux/amd64, but should not make a difference):

git clone https://github.com/golang/tools
cd tools
git checkout a49f79bcc2246daebe8647db377475ffc1523d7b  # latest at time of writing, no special meaning
cd cmd/present

# build
GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 $HOME/sdk/go1.14.1/bin/go build -trimpath -a -ldflags=-buildid=

sha256sum present
# e14b34025c0e9bc8256f05a09d245449cfd382c2014806bd1d676e7b0794a89f  present

cp present present.morecpu  # keep for diff, later on

# build, now with GO19CONCURRENTCOMPILATION=0
GO19CONCURRENTCOMPILATION=0 GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 $HOME/sdk/go1.14.1/bin/go build -trimpath -a -ldflags=-buildid=

sha256sum present
# 44e21497225e8d88500b008ec961d64907ca496104a18802aaee817397c4fb11  present

What did you expect to see?

The exact same binary.

What did you see instead?

Different binaries.

Details

I found this problem by building on a single-cpu VM, where NumCPU is 1. That probably prevents concurrency during compiles. GO19CONCURRENTCOMPILATION=0 disables concurrent compiles. The compiler and linker use runtime.NumCPU() in a few places, and perhaps GO19CONCURRENTCOMPILATION=0 isn't enough but just hides the symptoms.

FYI, so far I always got the same binary on machines with multiple cpu's, whether 2 cpu's or more, like 8 cpu's. But perhaps that's just because a high level of parallelism isn't reached.

The problem does not manifest for very simple programs (e.g. goimports in the same git checkout). Perhaps because there isn't enough to parallelize.

When building with -ldflags=-w, omitting the DWARF symbol table, the two build commands produce the same output again. I looked into that because of the diff below. I've seen earlier "reproducible build commands" that included -ldflags="-s -w". I don't expect those to be required to get reproducible builds.

$ diff <(objdump -x present.morecpu) <(objdump -x present)
2,3c2,3
< present.morecpu:     file format elf64-x86-64
< present.morecpu
---
> present:     file format elf64-x86-64
> present
60c60
<  18 .zdebug_loc   00338d92  0000000001049464  0000000001049464  00c19464  2**0
---
>  18 .zdebug_loc   00338d92  0000000001049457  0000000001049457  00c19457  2**0
62c62
<  19 .zdebug_ranges 0010d750  00000000010e2de1  00000000010e2de1  00cb2de1  2**0
---
>  19 .zdebug_ranges 0010d750  00000000010e2dd6  00000000010e2dd6  00cb2dd6  2**0
@bcmills bcmills added this to the Go1.15 milestone Mar 25, 2020
@bcmills
Copy link
Member

@bcmills bcmills commented Mar 25, 2020

Tentatively milestoning as a Go 1.15 release blocker: @jayconrod has been working to iron out the remaining reproducibility bugs, and it would be really unfortunate to miss that target by this one issue.

(CC @matloob)

@bcmills bcmills changed the title cmd/go: builds not reproducible on single-cpu vs multiple-cpu machines cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines Mar 25, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 25, 2020

CC @thanm because DWARF.

@thanm
Copy link
Member

@thanm thanm commented Mar 25, 2020

I'll take a look.

@thanm thanm self-assigned this Mar 25, 2020
@thanm
Copy link
Member

@thanm thanm commented Mar 25, 2020

I can reproduce the problem pretty easily. Issue is not with .debug_loc/.debug_range but with .debug_info, I am working on identifying a root cause.

@thanm
Copy link
Member

@thanm thanm commented Mar 26, 2020

This is a pretty interesting bug to work on. Normally when debugging such problems "-S" and "-m" are your friends, but those are not an option when the concurrent back end is turned on.

One of the functions where I can see the problem in the cmd/present build is the method encoding/asn1.BitString.At. This function winds up with different DWARF depending on whether the parallel back end is on.

What's interesting about this method is that it has a value receiver, and the compiler decides at some point that it's going to generated a pointer receiver wrapper (in genwrapper), e.g. encoding/asn1.(*BitString).At.

When the parallel back end is not on, the sequence of events is:

== ARGS: -c=1
| compile "".BitString.At
| dwarfGen for "".BitString.At
| genwrapper creates "".(*BitString).At
| inline "".BitString.At into "".(*BitString).At
| compile "".(*BitString).At
| dwarfGen for "".(*BitString).At

When the parallel back end runs, the sequence of events is:

== ARGS: -c=8
genwrapper creates "".(*BitString).At
inline "".BitString.At into "".(*BitString).At
<in parallel>
| compile "".BitString.At         |    compile "".(*BitString).At
| dwarfGen for "".BitString.At    |    dwarfGen for "".(*BitString).At

At first glance this doesn't look so bad, but it turns out that there is actually IR sharing going on between the two routines: their fn.Dcl lists point to the same nodes for params, variables, etc). This means that the actions of the inliner are going to wind up making perturbations that are going to be visible the DWARF gen code, which is bad.

I'm still thinking about the best way to fix this; I need to fig into the wrapper generation code to understand exactly why/how there is sharing and what the story is there.

@thanm
Copy link
Member

@thanm thanm commented Apr 1, 2020

Oops, got the wrong github user. Trying again: @jeremyfaller

@aclements
Copy link
Member

@aclements aclements commented Apr 16, 2020

Ping @thanm. Any progress on this?

@thanm
Copy link
Member

@thanm thanm commented Apr 16, 2020

No progress to report, I have been bogged down with other work.

I have some ideas on how to fix this, but I am still trying to come up with a good solution.

@josharian
Copy link
Contributor

@josharian josharian commented Apr 16, 2020

One large scale architectural approach is to try to end the lifetime of gc.Nodes at ssa construction time. That would involve building replacement built-for-purpose data structures from the nodes which then get used throughout ssa, dwarf, and everything else downstream. This would also be one big helpful step towards eliminating nodes entirely.

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

I am finally getting back to looking at this bug again. More detail below.

Background: when the initial version of DWARF inline support in Go 1.11 rolled out, at that point inlining of calls in wrappers was disabled (the comment at the time said that there was a bad interaction with indexed export data, with a TODO to turn it back on). In addition, wrapper generation happened very late -- long after the main part of the compile was done.

In the 2018 time frame (this was for Go 1.13), wrapper inlining was enabled again in this commit:

fa913a36a2 cmd/compile/internal/gc: inline autogenerated (*T).M wrappers

A little later in the Go 1.13 development cycle we added stack objects:

cbafcc55e8 cmd/compile,runtime: implement stack objects

This added a section of code to the "compile" function that included a call to dtypesym. The relevant code looks like this:

	// Make sure type syms are declared for all types that might
	// be types of stack objects. We need to do this here
	// because symbols must be allocated before the parallel
	// phase of the compiler.
	for _, n := range fn.Func.Dcl {
		switch n.Class() {
		case PPARAM, PPARAMOUT, PAUTO:
			if livenessShouldTrack(n) && n.Name.Addrtaken() {
				dtypesym(n.Type)
				// Also make sure we allocate a linker symbol
				// for the stack object data, for the same reason.
				if fn.Func.lsym.Func.StackObjects == nil {
					fn.Func.lsym.Func.StackObjects = Ctxt.Lookup(fn.Func.lsym.Name + ".stkobj")
				}
			}
		}
	}

	if compilenow() {
		compileSSA(fn, 0)
	} else {
		compilequeue = append(compilequeue, fn)
	}

What's interesting here is that dtypesym can trigger a call to genwrapper, which in turn calls "compile". This means that wrapper generation happens much earlier, and is intermixed with compilation of "regular" functions.

So let's say that the order of functions in xtop is:

BitString.At
XYZ (some random function that uses *BitString)
ABC (unrelated function)

In the non-concurrent scenario, the order of operations is:

1. function "compile" is invoked on BitString.At
   -> compile invokes compileSSA, which generates DWARF
2. function "compile" invoked on XYZ
   -> calls dtypesym(*BitString)
       -> calls genwrapper for (*BitString).At()
          -> genwrapper calls inlcalls
	     *** inliner inlines BitString.At into (*BitString).At()
	  -> genwrapper invokes "compile" on (*BitString).At()
	     -> compile invokes compileSSA, which generates DWARF
3. function "compile" is invoked on ABC
   -> compile invokes compileSSA, which generates DWARF

The important thing about sequence above is that at the point where we generate DWARF for BitString.At, it has not been inlined. Hence according to the DWARF rules, we emit a single subprogram DIE and move on.

Now the concurrent scenario. The order of operations here is:

1. function "compile" is invoked on BitString.At
   -> this adds BitString.At to compilequeue, but no SSA or DWARF gen
2. function "compile" invoked on XYZ
   -> calls dtypesym(*BitString)
       -> calls genwrapper for (*BitString).At()
          -> genwrapper causes inlcalls
	     *** inliner inlines BitString.At into (*BitString).At()
	  -> genwrapper invokes "compile" on (*BitString).At()
	     -> (*BitString).At() added to compile queue
3. function "compile" is invoked on ABC
4. enter parallel phase
   DWARF-gen, SSA happen in parallel for all functions in the compile queue

Key thing to note here is that when a parallel worker picks up BitString.At and compiles it, at that point in time BitString.At has been inlined, meaning that according to DWARF we need to emit an abstract subprogram DIE and a concrete DIE that refers back to it.

What's triggering the actual code difference is the code in the compiler's assembleInlines() function, which treats inlined and non-inlined functions differently with respect to the order it assigns to the variable/param children of the function. In the non-inlined case, it can just the order in the function's Dcl list. In the inlined case, it has to do a lot of work to assign variables to the correct inlined instance, and as a result the child ordering comes out differently.

In terms of a fix, I can think of a couple of options.

The simplest thing would be to just turn off inlining of wrappers. This would be the simplest option, but also undesirable given that we would lose whatever performance gain we got when it was turned on in 1.13.

A second option would be to change assembleInlines() in dwinl.go to look not at whether a function was actually inlined, but whether it was a inlinable in the first place (e.g. marked as an inline candidate). This would require capturing the pre-inline Dcl list at an earlier stage. There is also some risk in that the ordering of child variable DIEs in subprogram DIEs would be perturbed (this is probably pretty minor). In addition, this doesn't simplify things much -- things are already complicated enough in assembleInlines().

A third option would be to delay wrapper compilation. The idea here would be that when 'genwrapper' is invoked, it captures its inputs and stores them to a list, then when the main parallel phase is complete, we can drain the list, generating the wrappers and doing the compilation then. The potential drawback here is that we'd lose some parallelism in the compiler (would need to benchmark here).

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

Also a fourth option: change the non-concurrent path so that we generate all wrappers before compiling any inlinable functions. That might actually be a good way to go.

@josharian
Copy link
Contributor

@josharian josharian commented May 15, 2020

The simplest thing would be to just turn off inlining of wrappers.

I think this is what we should do for 1.15. It is simple and low risk. We have significant other compiler performance gains this cycle to offset any losses here. We should benchmark of course, but IIRC, this mainly matters for a handful of packages containing a great many types.

We can then return to a proper but higher risk fix for 1.16. Your fourth option sounds pretty good to me, but I haven't looked at the code in any detail in a long time.

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

@josharian thanks for that. I'll work up CLs for options 1 and 4 and do some experimentation.

I also need to figure out how to write a good test for this bug. Although I was able to reproduce it pretty easily on my laptop, it's very timing sensitive (on my workstation, which has more cores, the problem doesn't trigger for me using the submitter's recipe).

@gopherbot
Copy link

@gopherbot gopherbot commented May 15, 2020

Change https://golang.org/cl/234177 mentions this issue: cmd/compile: disable inlining into method wrappers (for now)

@gopherbot
Copy link

@gopherbot gopherbot commented May 15, 2020

Change https://golang.org/cl/234178 mentions this issue: cmd/compile: delay inlinable method compilation for -c=1

@josharian
Copy link
Contributor

@josharian josharian commented May 15, 2020

Both those CLs look pretty reasonable to me. Having seen them, I’m weakly inclined towards the deeper fix.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 15, 2020

I'd like to propose a 5th option. Not sure if this would actually fix the issue, but...

Before compilation, can we iterate through xtop, find all the wrappers that need to be generated, and generate them? Including making any inline decisions for them.
Then we're not generating new functions to compile while compiling, which has always been a thorny source of nondeterminism.

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

@randall77 yes, I think doing that would also fix the problem. I can send a CL for that as well. I wasn't aware that we frown on adding functions during compilation... if that's one of our goals then I think it would definitely be better to go with your option.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 15, 2020

I think we just try to keep as many complicated bits out of the parallel phase as possible. We've historically had trouble with symbols, errors, etc. showing up in the parallel phase and having to jump through hoops to serialize them deterministically. The less we do in the parallel phase, the better, at least from a code maintenance perspective (i.e. as long as it is a tiny fraction of compile time).

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

OK. Just to be clear, the damage in this specific bug is being done before the parallel phase starts-- the bad behavior in question can also be triggered by "go tool compile -d compilelater" with the concurrent back end turned off.

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

@randall77 I made an attempt at your suggestion, but it seems more complicated than I thought.

In particular, with this code in "compile()":

	// Make sure type syms are declared for all types that might
	// be types of stack objects. We need to do this here
	// because symbols must be allocated before the parallel
	// phase of the compiler.
	for _, n := range fn.Func.Dcl {
		switch n.Class() {
		case PPARAM, PPARAMOUT, PAUTO:
			if livenessShouldTrack(n) && n.Name.Addrtaken() {
				dtypesym(n.Type)
				// Also make sure we allocate a linker symbol
				// for the stack object data, for the same reason.
				if fn.Func.lsym.Func.StackObjects == nil {
					fn.Func.lsym.Func.StackObjects = Ctxt.Lookup(fn.Func.lsym.Name + ".stkobj")
				}
			}
		}
	}

Note the "n.Name.Addrtaken()" in the if guarding the dtypesym call. This property is set by the type checker, but also in the "order" phase. If I hoist the code above out into a loop prior to the compilation loop (as you suggested) it's going to see a smaller set of addr-taken variables, due to the fact that "order" hasn't run yet.

With this in mind I think it might be better to stick to option 4.

@randall77
Copy link
Contributor

@randall77 randall77 commented May 15, 2020

I'm not sure what that code has to do with my suggestion. I'm not suggesting that that code be moved.

I'm not suggesting we compile the wrappers early, just generate them.
I guess that would require removing the funccompile call at the end of genwrapper, and replacing that with adding the function to the xtop list instead?

@thanm
Copy link
Member

@thanm thanm commented May 15, 2020

not suggesting we compile the wrappers early, just generate them.

This is what I tried to do. The problem is that without running "order" you don't have an accurate picture of "address taken". Here is a CL with my attempt (the new functionality is guarded by an environment variable):

https://go-review.googlesource.com/c/go/+/234115

I think this is what you were suggesting, but maybe I am missing something? In particular see the "PROBLEM" comment in "funcgenwrappers".

@gopherbot
Copy link

@gopherbot gopherbot commented May 18, 2020

Change https://golang.org/cl/234417 mentions this issue: cmd/compile: add test case for issue 38068

@thanm
Copy link
Member

@thanm thanm commented May 19, 2020

Discussed in the compiler/runtime meeting today. The plan is move forward with https://go-review.googlesource.com/c/go/+/234178 as opposed to the other candidate fixes.

@gopherbot gopherbot closed this in fed33d7 May 21, 2020
gopherbot pushed a commit that referenced this issue May 29, 2020
New test case for issue 38068, which deals with build reproducibility:
do a pair of compilations, the first with the concurrent back end
turned on, and the second with -c=1, then check to make sure we get
the same output (using a test case that triggers late inlining into
wrapper methods).

Updates #38068.

Change-Id: I4afaf78898706a66985f09d18f6f6f29876c9017
Reviewed-on: https://go-review.googlesource.com/c/go/+/234417
Reviewed-by: Keith Randall <khr@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators May 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants