Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows/arm64 #50706

Open
bcmills opened this issue Jan 20, 2022 · 12 comments
Labels
arch-arm64 NeedsInvestigation OS-Windows Tools
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Jan 20, 2022

--- FAIL: TestCallgraph (11.99s)
    main_test.go:85: err: exit status 1: stderr: go build pkg: loading compiled Go files from cache: reading srcfiles list: cache entry not found: bad checksum
        
    main_test.go:100: got:
         <root> --> pkg.init
        pkg.main2 --> (pkg.D).f
        pkg.main --> pkg.main2
        pkg.main --> (pkg.C).f
        <root> --> pkg.main
        
    main_test.go:100: got:
         (*os.File).setDeadline --> (*os.File).checkValid
        (time.Time).IsZero --> (*time.Time).sec
        (time.Time).IsZero --> (*time.Time).nsec
        internal/poll.setDeadlineImpl --> (time.Time).IsZero
…
FAIL
FAIL	golang.org/x/tools/cmd/callgraph	12.869s

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2021-01-01

2022-01-19T20:29:36-7c251d6-9de1ac6/windows-arm64-10
2021-11-02T15:54:27-058ed05-c3cb1ec/windows-arm64-10
2021-11-01T13:50:47-513e3fb-4a84298/windows-arm64-10
2021-10-14T17:38:39-e69ba9d-011fd00/windows-arm64-10
2021-09-14T02:53:17-384e5da-ee91bb8/windows-arm64-10

@gopherbot gopherbot added the Tools label Jan 20, 2022
@gopherbot gopherbot added this to the Unreleased milestone Jan 20, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 20, 2022

The loading compiled Go files from cache error string is a hapax legomenon in the Go project; it definitely comes from here, in cmd/go/internal/work:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=739;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb
The reading srcfiles list comes from here:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/exec.go;l=1006;drc=master

The error appears to indicate file corruption in the cmd/go build cache, but I don't have any theories as to how that corruption is occurring or why it seems to only affect this one test on this one builder, and the test understandably doesn't provide much detail on the sequence or timing of the go invocations it is running.

Given the failure mode, I think the bug is more likely in os, syscall, or cmd/go itself than in x/tools/cmd/callgraph. windows/arm64 is not a first-class port and lacks a longtest builder, so it may be that x/tools/cmd/callgraph is incidentally triggering an underlying bug in an interaction that is being skipped (or isn't covered at all) in the os and/or cmd/go tests.

We are also running a relatively old Windows 10 build (#48946, CC @golang/release, @zx2c4), so I can't rule out a bug in the underlying platform either.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 20, 2022

This is a release-blocker via #11811, but given that this is not a first-class port and appears to be a platform-specific bug affecting only one test, I plan to add a test skip for this specific builder in x/tools/cmd/callgraph.TestCallgraph and then move this issue to the Backlog without investigating further.

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

@bcmills bcmills removed this from the Unreleased milestone Jan 20, 2022
@bcmills bcmills added this to the Go1.18 milestone Jan 20, 2022
@bcmills bcmills changed the title x/tools/cmd/callgraph: TestCallgraph failures with "cache entry not found: bad checksum" on windows-arm64-10 cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows-arm64-10 Jan 20, 2022
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 20, 2022

Change https://golang.org/cl/379734 mentions this issue: cmd/callgraph: skip TestCallgraph on the windows-arm64-10 builder

@heschi heschi added the NeedsFix label Jan 20, 2022
@rsc
Copy link
Contributor

@rsc rsc commented Jan 21, 2022

The 'bad checksum' means we read a file that was named for a sha256 hash and the content did not match that sha256.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 21, 2022

The fact that this is only windows/arm64 and that we've seen absolutely no mentions of it on other systems or in other bug reports makes me feel okay with this not being a release-blocker. If there really is corruption, the content-addressed and checksum-checked nature of the cache means that the system is either failstop or works correctly. So far we are getting no reports of failstop other than this one.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 21, 2022

I agree, but the failure rate for TestCallgraph is high enough that I think we should at least add that skip in the interim to avoid masking other failures on the builders.

@rsc
Copy link
Contributor

@rsc rsc commented Jan 23, 2022

t.Skips are always OK in my book.

gopherbot pushed a commit to golang/tools that referenced this issue Jan 24, 2022
We don't know whether this failure is due to a Go bug or a platform
bug, so we'll skip it on the one builder to reduce noise, but not the
GOOS/GOARCH as a whole. If we do not observe failures on other
windows/arm64 builders, we can perhaps chalk it up to a platform bug.
If we do observe failures on other builders, then we'll have more data
to investigate with.

For golang/go#50706

Change-Id: I52511dd4a5cff80953823d9cf901975ff4657457
Reviewed-on: https://go-review.googlesource.com/c/tools/+/379734
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Trust: Daniel Martí <mvdan@mvdan.cc>
@bcmills bcmills removed this from the Go1.18 milestone Jan 26, 2022
@bcmills bcmills added this to the Backlog milestone Jan 26, 2022
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 27, 2022

Change https://golang.org/cl/381314 mentions this issue: cmd/go: cache debugging

@rsc
Copy link
Contributor

@rsc rsc commented Jan 27, 2022

I uploaded https://go-review.googlesource.com/c/go/+/381314 just to have around if we need to patch it in to investigate this further. No intent to submit it.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Feb 24, 2022

If we also observe this failure mode on the new windows-arm64-11 builder once that is up and running, and/or if we upgrade windows/arm64 to a first-class port, we can reprioritize an investigation.

Now observed on windows-arm64-11 as well:

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-01-20

2022-02-17T17:36:57-cda4201-eaf0405/windows-arm64-11

@bcmills bcmills changed the title cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows-arm64-10 cmd/go,os: build cache checksum errors in x/tools/cmd/callgraph.TestCallgraph on windows/arm64 Feb 24, 2022
@bcmills bcmills added the NeedsInvestigation label Feb 24, 2022
@gopherbot gopherbot removed the NeedsFix label Feb 24, 2022
@bcmills
Copy link
Member Author

@bcmills bcmills commented Apr 1, 2022

A couple more. Whatever the cause, this is not fixed in Windows 11.

greplogs --dashboard -md -l -e 'reading srcfiles list: cache entry not found: bad checksum' --since=2022-02-24

2022-04-01T20:25:27-153e30b-32ff9b5/windows-arm64-11
2022-04-01T17:19:22-cda13e2-df89f2b/windows-arm64-11

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 4, 2022

Change https://go.dev/cl/397996 mentions this issue: cmd/callgraph: expand windows/arm64 skip to the whole platform

gopherbot pushed a commit to golang/tools that referenced this issue Apr 4, 2022
This test produces apparent file corruption on all of the
windows/arm64 builders. I suspect that this is a low-level bug (in
either the platform itself or the Go standard library on
windows/arm64).

Since windows/arm64 is not yet a first-class port, this test can be
skipped for now. However, if windows/arm64 becomes a first-class port
the underlying file-corruption bug should be investigated and fixed.

Updates golang/go#50706.

Change-Id: I0bc80cefee50895d40acc658286eb7ef8790493a
Reviewed-on: https://go-review.googlesource.com/c/tools/+/397996
Reviewed-by: Russ Cox <rsc@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
gopls-CI: kokoro <noreply+kokoro@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 NeedsInvestigation OS-Windows Tools
Projects
None yet
Development

No branches or pull requests

4 participants