New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: darwin memory corruption? #22988

Open
bradfitz opened this Issue Dec 4, 2017 · 22 comments

Comments

Projects
None yet
7 participants
@bradfitz
Member

bradfitz commented Dec 4, 2017

What is going on here?

https://storage.googleapis.com/go-build-log/5914916e/darwin-amd64-10_11_7e43df8a.log

ok  	internal/trace	0.572s
ok  	io	0.033s
ok  	io/ioutil	0.011s
ok  	log	0.009s
ok  	log/syslog	1.336s
ok  	math	0.009s
ok  	math/big	1.266s
ok  	math/bits	0.026s
ok  	math/cmplx	0.012s
ok  	math/rand	0.377s
ok  	mime	0.013s
ok  	mime/multipart	0.271s
ok  	mime/quotedprintable	0.101s
ok  	net	8.465s
# net/http
/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/http/request.go:535:23: bufio.size·3 declared and not used
FAIL	net/http [build failed]
2017/12/04 12:10:04 Failed: exit status 2


Error: tests failed: dist test failed: go_test:net/http: exit status 1

@bradfitz bradfitz added the NeedsFix label Dec 4, 2017

@bradfitz bradfitz added this to the Go1.10 milestone Dec 4, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 4, 2017

@mdempsky

This comment has been minimized.

Member

mdempsky commented Dec 4, 2017

Very weird that it failed only on darwin-amd64.

I just saw this runtime flake on darwin-amd64: #22987

I wonder if there's a darwin-amd64 memory corruption issue. A trybot retry would be nice to see if the failure was a flake or not.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 4, 2017

darwin-amd64 is a pool of 10 physical machines running VMware, running max 20 VMs. It's possible that it's bad memory, but that wouldn't be my first guess.

@mdempsky

This comment has been minimized.

Member

mdempsky commented Dec 4, 2017

I was thinking software memory corruption, not hardware.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 4, 2017

Ah, indeed.

@mdempsky

This comment has been minimized.

Member

mdempsky commented Dec 4, 2017

The retry has passed darwin-amd64, so I'm thinking it's not a compiler issue.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 4, 2017

Reassign to @aclements for runtime memory corruption?

@mdempsky mdempsky assigned aclements and unassigned mdempsky Dec 4, 2017

@mdempsky mdempsky changed the title from cmd/compile: "bufio.size·3 declared and not used" to cmd/compile: "bufio.size·3 declared and not used" flake Dec 4, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 6, 2017

@bradfitz bradfitz changed the title from cmd/compile: "bufio.size·3 declared and not used" flake to runtime: darwin memory corruption? Dec 6, 2017

@mdempsky

This comment has been minimized.

Member

mdempsky commented Dec 6, 2017

It looks like either clang or testp1 crashed with a segmentation fault without printing anything. Unfortunately, it seems the test's logs don't appear to make it possible to discern which happened.

However, either way something seems very broken: either clang (provided by the system) segfaulted, or the Go test program's SIGSEGV handler messed up.

@mdempsky

This comment has been minimized.

Member

mdempsky commented Dec 6, 2017

The darwin-amd64-10_11 builder has been pretty stable on build.golang.org. Are there any notable differences between the darwin-amd64-10_11 builders and trybots?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Dec 7, 2017

@mdempsky, they're identical (same VM images). Only difference is trybot runs shard out over 4 VMs and build.golang.org runs shard out over 3 VMs.

@gopherbot

This comment has been minimized.

gopherbot commented Dec 8, 2017

Change https://golang.org/cl/83016 mentions this issue: runtime: reset write barrier buffer on all flush paths

@gopherbot

This comment has been minimized.

gopherbot commented Dec 8, 2017

Change https://golang.org/cl/83015 mentions this issue: runtime: mark heapBits.bits nosplit

gopherbot pushed a commit that referenced this issue Dec 11, 2017

runtime: mark heapBits.bits nosplit
heapBits.bits is used during bulkBarrierPreWrite via
heapBits.isPointer, which means it must not be preempted. If it is
preempted, several bad things can happen:

1. This could allow a GC phase change, and the resulting shear between
the barriers and the memory writes could result in a lost pointer.

2. Since bulkBarrierPreWrite uses the P's local write barrier buffer,
if it also migrates to a different P, it could try to append to the
write barrier buffer concurrently with another write barrier. This can
result in the buffer's next pointer skipping over its end pointer,
which results in a buffer overflow that can corrupt arbitrary other
fields in the Ps (or anything in the heap, really, but it'll probably
crash from the corrupted P quickly).

Fix this by marking heapBits.bits go:nosplit. This would be the
perfect use for a recursive no-preempt annotation (#21314).

This doesn't actually affect any binaries because this function was
always inlined anyway. (I discovered it when I was modifying heapBits
and make h.bits() no longer inline, which led to rampant crashes from
problem 2 above.)

Updates #22987 and #22988 (but doesn't fix because it doesn't actually
change the generated code).

Change-Id: I60ebb928b1233b0613361ac3d0558d7b1cb65610
Reviewed-on: https://go-review.googlesource.com/83015
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot pushed a commit that referenced this issue Dec 11, 2017

runtime: reset write barrier buffer on all flush paths
Currently, wbBufFlush does nothing if the goroutine is dying on the
assumption that the system is crashing anyway and running the write
barrier may crash it even more. However, it fails to reset the
buffer's "next" pointer. As a result, if there are later write
barriers on the same P, the write barrier will overflow the write
barrier buffer and start corrupting other fields in the P or other
heap objects. Often, this corrupts fields in the next allocated P
since they tend to be together in the heap.

Fix this by always resetting the buffer's "next" pointer, even if
we're not doing anything with the pointers in the buffer.

Updates #22987 and #22988. (May fix; it's hard to say.)

Change-Id: I82c11ea2d399e1658531c3e8065445a66b7282b2
Reviewed-on: https://go-review.googlesource.com/83016
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rick Hudson <rlh@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
@aclements

This comment has been minimized.

Member

aclements commented Jan 18, 2018

I ran 430 runs on the darwin-amd64-10_11 builder. It turns out after ~350 runs it will start reliably failing with

--- FAIL: TestFileInfoHeader (0.00s)
        tar_test.go:214: stat testdata/small.txt: no such file or directory
--- FAIL: TestPax (0.00s)
        writer_test.go:531: stat testdata/small.txt: no such file or directory
--- FAIL: TestPaxSymlink (0.00s)
        writer_test.go:571: stat testdata/small.txt: no such file or directory
--- FAIL: TestPaxNonAscii (0.00s)
        writer_test.go:611: stat testdata/small.txt: no such file or directory
--- FAIL: TestPaxXattrs (0.00s)
        writer_test.go:670: stat testdata/small.txt: no such file or directory
--- FAIL: TestPaxHeadersSorted (0.00s)
        writer_test.go:704: stat testdata/small.txt: no such file or directory
--- FAIL: TestUSTARLongName (0.00s)
        writer_test.go:751: stat testdata/small.txt: no such file or directory
FAIL
FAIL    archive/tar     0.102s
...
--- FAIL: TestGoRunDirs (0.02s)
        go_test.go:3542: running testgo [run x.go sub/sub.go]
        go_test.go:3542: standard error:
        go_test.go:3542: stat x.go: no such file or directory
                
        go_test.go:3542: testgo failed as expected: exit status 1
        go_test.go:3543: wrong output
        go_test.go:3543: pattern named files must all be in one directory; have ./ and sub/ not found in standard error
FAIL
FAIL    cmd/go  113.023s

But if we ignore those, the few other failures I got all look like completely plausible flakes (net timeouts, etc) and not like memory corruption.

@bradfitz (or anyone) are there significant differences between how the trybots run all.bash on darwin-amd64-10_11 and me just firing up a gomote and running all.bash in a loop?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jan 20, 2018

@bradfitz (or anyone) are there significant differences between how the trybots run all.bash on darwin-amd64-10_11 and me just firing up a gomote and running all.bash in a loop?

Well, they don't run all.bash. They run make.bash, and then they run go tool dist test --list to gather the list of test names, and then for each test, it runs go tool dist test $DIST_TEST_NAME.

But that's basically what all.bash does anyway, so I can't imagine why they'd be different.

@aclements

This comment has been minimized.

Member

aclements commented Jan 21, 2018

I've been running the following at 23aefcd for the past few days:

stress -p 2 bash -c 'set -e; VM=$(gomote create darwin-amd64-10_11); function xxx { gomote destroy $VM; }; trap xxx EXIT; gomote push $VM; gomote run $VM go/src/all.bash'

Of 738 runs, there are a few that may be memory corruption:

##### ../test
# go run run.go -- fixedbugs/issue11326b.go
exit status 2
# command-line-arguments
2018/01/20 11:43:06 duplicate symbol  (types 64 and 45) in  and /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/pkg/darwin_amd64/runtime.a(_go_.o)

FAIL    fixedbugs/issue11326b.go        0.962s
--- FAIL: TestGenFlowGraph (7.30s)
        ssa_test.go:91: Failed: exit status 2:
                Out: 
                Stderr: # reflect
                panic: runtime error: invalid memory address or nil pointer dereference
                [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x10c21db]
                
                goroutine 1 [running]:
                cmd/internal/obj.(*objWriter).writeRef(0xc421903180, 0x8, 0x0)
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/internal/obj/objfile.go:157 +0x2b
                cmd/internal/obj.(*objWriter).writeRefs(0xc421903180, 0xc4218bfa00)
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/internal/obj/objfile.go:190 +0x6e
                cmd/internal/obj.WriteObjFile(0xc420316000, 0xc420b41240)
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/internal/obj/objfile.go:110 +0x236
                cmd/compile/internal/gc.dumpobj1(0x7fff5fbffb3d, 0x23, 0x3)
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/compile/internal/gc/obj.go:160 +0x41f
                cmd/compile/internal/gc.dumpobj()
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/compile/internal/gc/obj.go:50 +0x51
                cmd/compile/internal/gc.Main(0x182c610)
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/compile/internal/gc/main.go:683 +0x2b0e
                main.main()
                        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/compile/main.go:49 +0x89
                
FAIL
FAIL    cmd/compile/internal/gc 41.904s
Building packages and commands for darwin/amd64.
# cmd/gofmt
panic: runtime error: index out of range

goroutine 1 [running]:
cmd/internal/dwarf.HasChildren(...)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/internal/dwarf/dwarf.go:952
cmd/link/internal/ld.reversetree(0xc422fcdf30)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/internal/ld/dwarf.go:315 +0x8a
cmd/link/internal/ld.reversetree(0x12f0870)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/internal/ld/dwarf.go:316 +0x6d
cmd/link/internal/ld.dwarfgeneratedebugsyms(0xc42009a000)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/internal/ld/dwarf.go:1730 +0xa43
cmd/link/internal/ld.(*Link).dodata(0xc42009a000)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/internal/ld/data.go:1569 +0x3413
cmd/link/internal/ld.Main(0x12e9320, 0x10, 0x20, 0x1, 0x7, 0x10, 0x11f7729, 0x1b, 0x11f4bdf, 0x14, ...)
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/internal/ld/main.go:224 +0xb7f
main.main()
        /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/cmd/link/main.go:62 +0x277
go tool dist: FAILED: /private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/pkg/tool/darwin_amd64/go_bootstrap install -gcflags=all= -ldflags=all= std cmd: exit status 2

There are also 2 instances of

pipe: too many open files
FAIL    net     10.017s

though I suspect that's just a network flake, plus 8 network timeouts, and 4 file system failures. The file system failures are interesting, since they manifest as missing files, but I think think these are an infrastructure problem, since they include two bash: run.bash: No such file or directory failures.

@aclements

This comment has been minimized.

Member

aclements commented Jan 21, 2018

It's interesting that these are all in cmd/compile or cmd/link with the exception of https://storage.googleapis.com/go-build-log/7d48d376/darwin-amd64-10_11_f834bf71.log. It might be possible to reproduce this faster by just running lots of builds without tests.

@josharian

This comment has been minimized.

Contributor

josharian commented Jan 21, 2018

Given that these are all cmd/compile and cmd/link, it might be worth running a stress test with concurrent compilation off, to rule that out—or not. (A race condition in the compiler could in theory lead to object file data corruption that would cause a linker crash.)

@josharian

This comment has been minimized.

Contributor

josharian commented Jan 21, 2018

(And/or run a stress test with a race-enabled toolchain.)

@aclements

This comment has been minimized.

Member

aclements commented Jan 22, 2018

I believe this may be a hardware problem. I started another run, this time recording the host name of failures. I've reproduced three failures that appear to be memory corruption and all three of them happened on host Gos-Mac-104.local. I think my next step is going to be to get a gomote on that particular host and a gomote on a different box, run an all.bash loop, and see if there's a significant difference in failure rate between the two.

@aclements

This comment has been minimized.

Member

aclements commented Jan 23, 2018

I got two more failures that look like memory corruption, but weren't on Gos-Mac-104.local. However, out of the 41 hours I ran the loop for, the three failures on Gos-Mac-104.local happened in a 1.5 hour window, after which I believe Gos-Mac-104.local fell out of the VM pool and I got only 2 more failures in the remaining 30.5 hours of running.

So I'd still like to try stress testing on Gos-Mac-104.local. Unfortunately, I don't know how to get that host back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment