Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines #38068

Open
mjl- opened this issue Mar 25, 2020 · 6 comments
Open

cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines #38068

mjl- opened this issue Mar 25, 2020 · 6 comments
Assignees
Milestone

Comments

@mjl-
Copy link

@mjl- mjl- commented Mar 25, 2020

What version of Go are you using (go version)?

$ go version
go1.14.1

Also tested with all versions from go1.13 and up, same problem.

Does this issue reproduce with the latest release?

Latest release, yes. Haven't tried tip.

What operating system and processor architecture are you using (go env)?

I ran into this when building the same code on many different machines: freebsd12 freebsd/386, debian10-stable linux/386, ubuntu18lts linux/amd64, debian-unstable linux/amd64, macos-latest darwin/amd64.

What did you do?

I built the same code on all the machines, with options that should result in the exact same binary, but got different results on same of the machines. I tracked it down to the machines with a single cpu creating a different binary. Building with GO19CONCURRENTCOMPILATION=0 on the multi-cpu machines resulted in binaries identical to the ones produced by the single-cpu machines.

Here is an example to run on a multi-cpu machine (I used linux/amd64, but should not make a difference):

git clone https://github.com/golang/tools
cd tools
git checkout a49f79bcc2246daebe8647db377475ffc1523d7b  # latest at time of writing, no special meaning
cd cmd/present

# build
GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 $HOME/sdk/go1.14.1/bin/go build -trimpath -a -ldflags=-buildid=

sha256sum present
# e14b34025c0e9bc8256f05a09d245449cfd382c2014806bd1d676e7b0794a89f  present

cp present present.morecpu  # keep for diff, later on

# build, now with GO19CONCURRENTCOMPILATION=0
GO19CONCURRENTCOMPILATION=0 GO111MODULE=on CGO_ENABLED=0 GOOS=linux GOARCH=amd64 $HOME/sdk/go1.14.1/bin/go build -trimpath -a -ldflags=-buildid=

sha256sum present
# 44e21497225e8d88500b008ec961d64907ca496104a18802aaee817397c4fb11  present

What did you expect to see?

The exact same binary.

What did you see instead?

Different binaries.

Details

I found this problem by building on a single-cpu VM, where NumCPU is 1. That probably prevents concurrency during compiles. GO19CONCURRENTCOMPILATION=0 disables concurrent compiles. The compiler and linker use runtime.NumCPU() in a few places, and perhaps GO19CONCURRENTCOMPILATION=0 isn't enough but just hides the symptoms.

FYI, so far I always got the same binary on machines with multiple cpu's, whether 2 cpu's or more, like 8 cpu's. But perhaps that's just because a high level of parallelism isn't reached.

The problem does not manifest for very simple programs (e.g. goimports in the same git checkout). Perhaps because there isn't enough to parallelize.

When building with -ldflags=-w, omitting the DWARF symbol table, the two build commands produce the same output again. I looked into that because of the diff below. I've seen earlier "reproducible build commands" that included -ldflags="-s -w". I don't expect those to be required to get reproducible builds.

$ diff <(objdump -x present.morecpu) <(objdump -x present)
2,3c2,3
< present.morecpu:     file format elf64-x86-64
< present.morecpu
---
> present:     file format elf64-x86-64
> present
60c60
<  18 .zdebug_loc   00338d92  0000000001049464  0000000001049464  00c19464  2**0
---
>  18 .zdebug_loc   00338d92  0000000001049457  0000000001049457  00c19457  2**0
62c62
<  19 .zdebug_ranges 0010d750  00000000010e2de1  00000000010e2de1  00cb2de1  2**0
---
>  19 .zdebug_ranges 0010d750  00000000010e2dd6  00000000010e2dd6  00cb2dd6  2**0
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Mar 25, 2020

@bcmills bcmills added this to the Go1.15 milestone Mar 25, 2020
@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Mar 25, 2020

Tentatively milestoning as a Go 1.15 release blocker: @jayconrod has been working to iron out the remaining reproducibility bugs, and it would be really unfortunate to miss that target by this one issue.

(CC @matloob)

@bcmills bcmills changed the title cmd/go: builds not reproducible on single-cpu vs multiple-cpu machines cmd/compile: builds not reproducible on single-cpu vs multiple-cpu machines Mar 25, 2020
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Mar 25, 2020

CC @thanm because DWARF.

@thanm

This comment has been minimized.

Copy link
Member

@thanm thanm commented Mar 25, 2020

I'll take a look.

@thanm thanm self-assigned this Mar 25, 2020
@thanm

This comment has been minimized.

Copy link
Member

@thanm thanm commented Mar 25, 2020

I can reproduce the problem pretty easily. Issue is not with .debug_loc/.debug_range but with .debug_info, I am working on identifying a root cause.

@thanm

This comment has been minimized.

Copy link
Member

@thanm thanm commented Mar 26, 2020

This is a pretty interesting bug to work on. Normally when debugging such problems "-S" and "-m" are your friends, but those are not an option when the concurrent back end is turned on.

One of the functions where I can see the problem in the cmd/present build is the method encoding/asn1.BitString.At. This function winds up with different DWARF depending on whether the parallel back end is on.

What's interesting about this method is that it has a value receiver, and the compiler decides at some point that it's going to generated a pointer receiver wrapper (in genwrapper), e.g. encoding/asn1.(*BitString).At.

When the parallel back end is not on, the sequence of events is:

== ARGS: -c=1
| compile "".BitString.At
| dwarfGen for "".BitString.At
| genwrapper creates "".(*BitString).At
| inline "".BitString.At into "".(*BitString).At
| compile "".(*BitString).At
| dwarfGen for "".(*BitString).At

When the parallel back end runs, the sequence of events is:

== ARGS: -c=8
genwrapper creates "".(*BitString).At
inline "".BitString.At into "".(*BitString).At
<in parallel>
| compile "".BitString.At         |    compile "".(*BitString).At
| dwarfGen for "".BitString.At    |    dwarfGen for "".(*BitString).At

At first glance this doesn't look so bad, but it turns out that there is actually IR sharing going on between the two routines: their fn.Dcl lists point to the same nodes for params, variables, etc). This means that the actions of the inliner are going to wind up making perturbations that are going to be visible the DWARF gen code, which is bad.

I'm still thinking about the best way to fix this; I need to fig into the wrapper generation code to understand exactly why/how there is sharing and what the story is there.

@thanm thanm added NeedsFix and removed NeedsInvestigation labels Mar 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.