Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

hartzell · 2017-01-06T23:03:25Z

I happen to have access to a large system and am using it to build a variety of packages. My go builds almost always fail. Details below.

What version of Go are you using (`go version`)?

Bootstrapping go: from here: https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz

Go from here: https://storage.googleapis.com/golang/go1.7.4.src.tar.gz

What operating system and processor architecture are you using (`go env`)?

[hartzelg@rpbuchop001 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec"
GOTOOLDIR="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec/pkg/tool/linux_amd64"
CC="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/gcc-5"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build188975734=/tmp/go-build -gno-record-gcc-switches"
CXX="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/g++-5"
CGO_ENABLED="1"

What did you do?

Building the go recipe in spack like so:

spack install go@1.7.4

(which builds and uses a bootstrapping system from the 1.4 snapshot).

It boils down to '/usr/bin/bash' 'all.bash'.

I'm seeing this in a job running via jenkins that simplifies to this:

git clone https://github.com/llnl/spack
cd spack
export PATH=`pwd`/bin:$PATH
ulimit -u 8192
spack install go@1.7.4

What did you expect to see?

A successful build.

What did you see instead?

A panic during testing. Full output (1500+ lines) in this gist.

Here's the first few lines.

ok  	regexp	0.455s
ok  	regexp/syntax	1.000s
panic: test timed out after 3m0s

goroutine 22864 [running]:
panic(0x5be600, 0xc42017a010)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/panic.go:500 +0x1a1 fp=0xc42010cf48 sp=0xc42010ceb8
testing.startAlarm.func1()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:918 +0x10b fp=0xc42010cfb0 sp=0xc42010cf48
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010cfb8 sp=0xc42010cfb0
created by time.goFunc
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/time/sleep.go:154 +0x44

goroutine 1 [chan receive, 2 minutes, locked to thread]:
runtime.gopark(0x629c68, 0xc420016598, 0x60cadd, 0xc, 0xc420110b17, 0x3)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:259 +0x13a fp=0xc420110b10 sp=0xc420110ae0
runtime.goparkunlock(0xc420016598, 0x60cadd, 0xc, 0xc420000117, 0x3)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:265 +0x5e fp=0xc420110b50 sp=0xc420110b10
runtime.chanrecv(0x5bc980, 0xc420016540, 0x0, 0xc420110c01, 0x47434c)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:496 +0x2df fp=0xc420110bd8 sp=0xc420110b50
runtime.chanrecv1(0x5bc980, 0xc420016540, 0x0)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:378 +0x35 fp=0xc420110c10 sp=0xc420110bd8
testing.(*T).Run(0xc42001a180, 0x611a7e, 0x18, 0x62a6c8, 0xc42011fc01)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:647 +0x316 fp=0xc420110cb8 sp=0xc420110c10
testing.RunTests.func1(0xc42001a180)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:793 +0x6d fp=0xc420110d08 sp=0xc420110cb8
testing.tRunner(0xc42001a180, 0xc420110de8)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc420110d30 sp=0xc420110d08
testing.RunTests(0x6295c0, 0x70f7c0, 0xc4, 0xc4, 0x43bf04)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:799 +0x2f5 fp=0xc420110e18 sp=0xc420110d30
testing.(*M).Run(0xc420110ef8, 0x40f769)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:743 +0x85 fp=0xc420110ea0 sp=0xc420110e18
runtime_test.TestMain(0xc420110ef8)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_test.go:27 +0x2f fp=0xc420110ef0 sp=0xc420110ea0
main.main()
	runtime/_test/_testmain.go:836 +0xc6 fp=0xc420110f58 sp=0xc420110ef0
runtime.main()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:183 +0x1f4 fp=0xc420110fb0 sp=0xc420110f58
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420110fb8 sp=0xc420110fb0

goroutine 17 [syscall, 2 minutes, locked to thread]:
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010efb8 sp=0xc42010efb0

Some time later (15 minutes after the test failed, I still have a go-build-related process running:

hartzelg  94605  400  0.0   4060  1740 ?        Rl   14:33  86:51 /tmp/go-build307875912/a.exe

and jenkins thinks that the job is still running.

The text was updated successfully, but these errors were encountered:

minux · 2017-01-06T23:08:11Z

The test that times out is: runtime_test.TestCrashDumpsAllThreads Do you still have the a.exe running? Is it possible to obtain a backtrace of all running threads using gdb?

bradfitz · 2017-01-06T23:09:16Z

Are you sure this is related to the machine's size? Does this build fail the same way on a smaller machine with the same software?

hartzell · 2017-01-06T23:21:30Z

Do you still have the a.exe running? Is it possible to obtain a backtrace
of all running threads using gdb?

No, it is not still running. I have something else running for a bit, but can rerun this later and if it ends up in the same state can to grab the backtrace (but below...).

Can you give me instructions on getting the backtrace?

hartzell · 2017-01-06T23:28:38Z

Are you sure this is related to the machine's size?

Not sure, but it seems related to this box. I was originally building successfully on a Digital Ocean 16GB instance without any trouble. I anecdotally believe that I've successfully built "in-house" on other systems, but since it occasionally succeeds on the big box I'm not really sure.

It's also possible that the box has hardware issues (though I haven't seen any other tools).

I can find/use a smaller box with the same config and see what it does though it'll take a day or two.

minux · 2017-01-06T23:30:04Z

Find out the pid of the stray process, run gdb -p $PID and at the gdb prompt, run the following: set logging file /tmp/gdb.log set logging on set height 0 thread apply all bt quit And then paste the output (in /tmp/gdb.log). You also try running the test directly with (after building with make.bash): go test runtime -run=TestCrashDumpsAllThreads Also note that Go 1.8 will be released in a month, so please also give the latest master branch a try.

hartzell · 2017-01-06T23:30:52Z

The test that times out is: runtime_test.TestCrashDumpsAllThreads

Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?

davecheney · 2017-01-06T23:37:34Z

On non intel systems I'd say yes. But on intel, when this test hangs, adding more time won't help, it's either a real bug or a flakey test.

…

On Sat, Jan 7, 2017 at 10:30 AM, George Hartzell ***@***.***> wrote: The test that times out is: runtime_test.TestCrashDumpsAllThreads Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA3G5avH-RZ8_gMFMmp7dyYO4uYISks5rPs6tgaJpZM4LdNRC> .

hartzell · 2017-01-06T23:37:57Z

Between filing the issue and reading @minux and @bradfitz followups, I started a run with GOMAXPROCS=16 in a grasping at straws debugging effort and switched the working directory from an NFS mount to something local (because debugging's more fun if you change multiple things at the same time, sigh...; but seriously I'm also trying to debug some infra. issues).

I'll let it finish (because I'm Curious), swap the working dir back and see if I can catch that "leftover" process running.

bradfitz · 2017-01-06T23:38:54Z

NFS has been problematic in the past, IIRC.

hartzell · 2017-01-06T23:39:51Z

NFS has been problematic in the past, IIRC.

That's generally how I feel about it... 🙁

minux · 2017-01-06T23:47:58Z

The TestCrashDumpsAllThreadstest ( https://golang.org/src/runtime/crash_unix_test.go#L26) shouldn't take much time to run. If it hangs, it's usually some problem in the implementation of crash dump (which involves some signal trickery.) From the test log, the test timed out while waiting for the process to finish the crash dump: goroutine 22861 [syscall, 2 minutes]: syscall.Syscall6(0xf7, 0x1, 0x1718d, 0xc420474000, 0x1000004, 0x0, 0x0, 0xc42010bb07, 0xc420037500, 0xc42014cd00) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/syscall/asm_linux_amd64.s:44 +0x5 fp=0xc42010baf0 sp=0xc42010bae8 os.(*Process).blockUntilWaitable(0xc42000d230, 0xe, 0xc42010bd1f, 0x1) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/wait_waitid.go:28 +0xbc fp=0xc42010bb88 sp=0xc42010baf0 os.(*Process).wait(0xc42000d230, 0x506471, 0xc42000d244, 0x3) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec_unix.go:22 +0xab fp=0xc42010bc18 sp=0xc42010bb88 os.(*Process).Wait(0xc42000d230, 0x0, 0x3, 0x0) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/doc.go:49 +0x2b fp=0xc42010bc48 sp=0xc42010bc18 os/exec.(*Cmd).Wait(0xc420190160, 0x6f7f00, 0xc42010bd60) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec/exec.go:434 +0x6d fp=0xc42010bcd8 sp=0xc42010bc48 runtime_test.TestCrashDumpsAllThreads(0xc42001a240) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_unix_test.go:99 +0x849 fp=0xc42010bf78 sp=0xc42010bcd8 testing.tRunner(0xc42001a240, 0x62a6c8) /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc42010bfa0 sp=0xc42010bf78 runtime.goexit() /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010bfa8 sp=0xc42010bfa0 created by testing.(*T).Run /tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:646 +0x2ec Therefore the test has already successfully send SIGQUIT to the process, but the process fails to stop, indicating something is wrong about signal handling in the runtime.

hartzell · 2017-01-06T23:52:17Z

Having gotten the snark out of my system, it's worth noting that even though I've moved the Jenkins working directory, Spack (unless you configure it otherwise) runs its unbundle, build and install steps in a directory tree symlinked into /tmp, which in my case is a 15TB local filesystem.

The change will speed up some other parts of the job (recursively removing the tree takes forever in our NFS environment) but after consideration I don't think it will change the outcome of this issue.

Still, it's a change and it's worth being up front about it.

hartzell · 2017-01-10T02:15:01Z

[sorry for the delay, been down with a cold...]

Last week I fired off a build w/ GOMAXPROCS=16 before I received the feedback above.

The gist is here.

I think it's different, but I'm guessing....

Unless there's new feedback my next step will be to try to recreate the original crash and catch the errant process in gdb.

hartzell · 2017-01-11T22:40:35Z

With one caveat, I have been unable to get the build to fail for go@1.8rc1.

The caveat is that I seem to need to adjust the maximum number of user processes upward. The default seems to be 4096. Left as is I see this:

##### ../test/bench/go1
testing: warning: no tests to run
PASS
ok  	_/tmp/hartzelg/spack-stage/spack-stage-D6YdOI/go/test/bench/go1	1.916s

##### ../test
runtime: failed to create new OS thread (have 14 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x55a76e, 0x9)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/panic.go:596 +0x95
runtime.newosproc(0xc42023e400, 0xc42012c000)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/os_linux.go:163 +0x18c
runtime.newm(0x564320, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1628 +0x137
runtime.startTheWorldWithSema()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1122 +0x1b6
runtime.systemstack(0xc420037300)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1132

goroutine 305 [running]:
runtime.systemstack_switch()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:281 fp=0xc42027ec98 sp=0xc42027ec90
runtime.gcStart(0x0, 0xc42055c000)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/mgc.go:1070 +0x29d fp=0xc42027ecd0 sp=0xc42027ec98
runtime.mallocgc(0x80000, 0x52c080, 0xc420388001, 0xc42027edf0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/malloc.go:773 +0x48f fp=0xc42027ed70 sp=0xc42027ecd0
runtime.makeslice(0x52c080, 0x7fe00, 0x7fe00, 0x0, 0x0, 0x3fe00)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/slice.go:54 +0x7b fp=0xc42027edc0 sp=0xc42027ed70
bytes.makeSlice(0x7fe00, 0x0, 0x0, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:201 +0x77 fp=0xc42027ee00 sp=0xc42027edc0
bytes.(*Buffer).ReadFrom(0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x2aaaaacad028, 0xc420492230, 0x1)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:173 +0x2ab fp=0xc42027ee70 sp=0xc42027ee00
io.copyBuffer(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x0, 0x0, 0x0, 0x4c5fbb, 0xc420225600, 0x6c6961642f656301)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:384 +0x2cb fp=0xc42027eed8 sp=0xc42027ee70
io.Copy(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0xc420492230, 0x302e312d32700101, 0xc420457fc8)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:360 +0x68 fp=0xc42027ef38 sp=0xc42027eed8
os/exec.(*Cmd).writerDescriptor.func1(0x2d6b636170732f67, 0x6270722d656d6963)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:254 +0x4d fp=0xc42027ef98 sp=0xc42027ef38
os/exec.(*Cmd).Start.func1(0xc420225600, 0xc4202f6020)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:371 +0x27 fp=0xc42027efd0 sp=0xc42027ef98
runtime.goexit()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42027efd8 sp=0xc42027efd0
created by os/exec.(*Cmd).Start
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:372 +0x4e4

goroutine 1 [chan receive, 1 minutes]:
main.(*tester).runPending(0xc4201540f0, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:970 +0x31f
main.(*tester).run(0xc4201540f0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:214 +0x91a
main.cmdtest()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:42 +0x2bb
main.xmain()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/main.go:43 +0x233
main.main()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/util.go:509 +0x283
[...]

If I ulimit -u 8192 then I am 5 for 5 successful builds for go@1.8rc1.

I'm inclined to close this and move on, but would appreciate feedback about whether there's something more use

hartzell · 2017-02-03T21:13:20Z

FWIW, go@1.7.5 and ulimit -u 8192 also seem to work.

hartzell · 2017-02-06T16:00:57Z

I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).

I see no sign of the straggling process that I reported earlier.

Here's a gist of the test failure.

https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9

RLH · 2017-02-06T16:10:09Z

If the Go process is set to timeout after 3 minutes and the OS decides not to give the Go process any CPU for 3 minutes, I would not be surprised that the Go process times out.

…

On Mon, Feb 6, 2017 at 11:01 AM, George Hartzell ***@***.***> wrote: I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy). Here's a gist of the test failure. https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#18551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA7Wn5P0Os00Pdvhx4S-EYeHuz8sZsw9ks5rZ0PEgaJpZM4LdNRC> .

hartzell · 2017-02-06T17:28:17Z

Is there a way to extend the timeout? @minux mentioned above that the test itself was quick and that if it took any appreciable time that it was an indication that something was amiss.

bradfitz · 2017-02-06T17:30:19Z

It takes 16 seconds on my 2 core machine. Extending the timeout doesn't seem like the answer.

hartzell · 2017-02-06T17:35:47Z

Hmmm. Would that scale linearly with core count? Would having 1.5TB of RAM (even if not all in use) make it take longer?

((16 seconds / 2 cores) * 144 cores ) * 1 minute/60 seconds = 19.2 minutes

hartzell · 2017-02-06T17:37:35Z

Reading through the test to look for timeouts etc... showed me a reference to GOMAXPROCS. For grins (aka grasping at staws) I just started a run with export GOMAXPROCS=8 near the top, thinking that it might limit the number of threads/goroutines/procs/... that were involved.

hartzell · 2017-02-06T18:26:44Z

It failed with GOMAXPROCS=8, but with a much shorter stack trace (which at least suggests that I turned a knob that was hooked up to something).

https://gist.github.com/hartzell/6363949312627aa1d417124e1ac7b3fc

I was back in that terminal session fairly quickly after it failed and did not see any build-related processes hanging around.

Is there any way that I can either instrument this or provoke it into telling me something useful?

hartzell · 2017-02-06T21:23:44Z

On rereading I was reminded that @minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.

[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok  	runtime	0.864s
[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok  	runtime	0.882s
[hartzelg@rpbuchop001 go]$ uptime
 12:46:07 up 17 days, 23:14,  1 user,  load average: 142.63, 145.37, 146.39

And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):

[hartzelg@rpbuchop001 go]$ while (true)
> do
> GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
> done
ok  	runtime	0.749s
ok  	runtime	0.830s
ok  	runtime	0.840s
ok  	runtime	0.678s
ok  	runtime	0.906s
ok  	runtime	0.917s
ok  	runtime	0.847s
ok  	runtime	0.727s
ok  	runtime	0.790s
ok  	runtime	0.785s
ok  	runtime	0.720s
ok  	runtime	1.013s
ok  	runtime	0.765s
ok  	runtime	0.850s
ok  	runtime	0.719s
ok  	runtime	1.070s
ok  	runtime	1.240s
ok  	runtime	0.900s
ok  	runtime	0.997s
ok  	runtime	0.864s
ok  	runtime	0.865s
ok  	runtime	0.982s
ok  	runtime	0.723s
ok  	runtime	0.904s
ok  	runtime	0.929s
ok  	runtime	0.955s
ok  	runtime	0.709s
ok  	runtime	0.779s
ok  	runtime	0.843s
ok  	runtime	0.860s
ok  	runtime	0.840s
ok  	runtime	0.887s
ok  	runtime	0.837s
ok  	runtime	0.773s
ok  	runtime	0.760s
ok  	runtime	1.047s

Is there some way that running the build inside Spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.

davecheney · 2017-02-06T21:25:07Z

You should not set GOROOT if you have built from source.

…

On Tue, 7 Feb 2017, 08:23 George Hartzell ***@***.***> wrote: On rereading I was reminded that @minux <https://github.com/minux> pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail. ***@***.*** go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.864s ***@***.*** go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads ok runtime 0.882s ***@***.*** go]$ uptime 12:46:07 up 17 days, 23:14, 1 user, load average: 142.63, 145.37, 146.39 And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%): ***@***.*** go]$ while (true) > do > GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads > done ok runtime 0.749s ok runtime 0.830s ok runtime 0.840s ok runtime 0.678s ok runtime 0.906s ok runtime 0.917s ok runtime 0.847s ok runtime 0.727s ok runtime 0.790s ok runtime 0.785s ok runtime 0.720s ok runtime 1.013s ok runtime 0.765s ok runtime 0.850s ok runtime 0.719s ok runtime 1.070s ok runtime 1.240s ok runtime 0.900s ok runtime 0.997s ok runtime 0.864s ok runtime 0.865s ok runtime 0.982s ok runtime 0.723s ok runtime 0.904s ok runtime 0.929s ok runtime 0.955s ok runtime 0.709s ok runtime 0.779s ok runtime 0.843s ok runtime 0.860s ok runtime 0.840s ok runtime 0.887s ok runtime 0.837s ok runtime 0.773s ok runtime 0.760s ok runtime 1.047s Is there some way that running the build inside Spack <https://github.com/llnl/spack> (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#18551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA9gg-RtK_opY1oeqAIOkWrXrLaNWks5rZ49igaJpZM4LdNRC> .

hartzell · 2017-02-07T00:56:29Z

[edit to add comment about what I was trying to do...]

@davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here.

I came along afterwards and did this:

[hartzelg@rpbuchop001 go]$ ./bin/go version
go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63

Actually, I was trying to run the tests as @minux suggested, version is just a standin....

Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.

What should I have done?

davecheney · 2017-02-07T03:46:37Z

If your CI builds go in a random directory then moves the files after th build it should set the variable GOROOT_FINAL to the final location.

…

On Tue, 7 Feb 2017, 11:56 George Hartzell ***@***.***> wrote: @davecheney <https://github.com/davecheney> -- I'm not sure how else to do it. The python code that did the build (via spack) is here <https://github.com/LLNL/spack/blob/develop/var/spack/repos/builtin/packages/go/package.py> . I came along afterwards and did this: ***@***.*** go]$ ./bin/go version go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63 Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work. What should I have done? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA-1n0aP-0FRASR6Pm5KSfy5WQGjbks5rZ8E-gaJpZM4LdNRC> .

hartzell · 2017-02-07T04:27:12Z

Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.

davecheney · 2017-02-07T04:30:19Z

Is it possible to take your build system out of the picture and build from source directly? I'm really concerned that setting GOROOT introduces the possibility of pointing the go tool at a different version of Go which has a long history of causing weird bugs.

…

On Tue, Feb 7, 2017 at 3:27 PM, George Hartzell ***@***.***> wrote: Sure. But the problem is that I'm poking at a build that *failed*, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the *heisenbug*. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18551 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA2NsK0mwk8KdDgNYCf6o9flq-R6Cks5rZ_KhgaJpZM4LdNRC> .

hartzell · 2017-02-07T17:23:36Z

I can (and will) try to recreate the failure running the build by hand. That might be a useful simplification.

The build system does not use GOROOT (at least not in any way that I can suss out). The failures that happen there are almost certainly not related to GOROOT.

The build does set GOROOT_FINAL to the location where the tree will be copied if the build succeeds. That seems to be consistent with the documentation for building from source.

The only use of GOROOT, which you've latched on to, was me trying to figure out how to follow @minux's instructions on how to run the failing test by hand. I take your point to heart and believe that the results I get when I use GOROOT to run the tests within the staging directory might be misleading.

Given that I have a tree that's been built with GOROOT_FINAL set and that I am trying to debug a problem with that tree, what is my alternative to setting GOROOT?

hartzell · 2017-02-07T18:13:26Z

I've been running the build by hand.

I have ulimit -u 8192, otherwise I see this:

##### ../test
runtime: failed to create new OS thread (have 11 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

The build just succeeded 3 for 3. That's not as comforting as it might be, there have been periods where it has worked via Spack before.

The automated (Jenkins, Spack) build failed last night. The automated job has export GOMAXPROCS=4 because I was playing with it and left it in. The failure was the same one I've been seeing, with the smaller stack trace that results from setting GOMAXPROCS.

The machine has a long running job that is keeping all of the cores mostly busy, it was running yesterday and last night and is running now. Load is around 140. Plenty of free RAM.

The Spack build framework does a bit of magic with filehandles (and possibly other things) to manage the tasks that it runs.

@minux mentions that the failing test (or the thing being tested) involves "signal trickery". Is it possible that the Spack framework is doing something that's getting in the way?

That seems fragile, and if it were then I'd expect that the Spack build would never succeed, but I'm grasping at straws, so....

hartzell · 2017-02-20T02:19:01Z

I'm still casting about for a way to get a handle on this.

As I've detailed above, the core of my test case is building go via spack, like so:

spack install go@1.8   # 1.8 has been released since I started this thread....

I run a build every night as a Jenkins job. The Jenkins master is a Docker container and the job runs in a slave executor on a real Linux system (no VM, no container). That build seems to nearly always fail.

I sometimes checkout a Spack tree and run the command by hand, that seems to generally work these days (though when I filed this Issue my initial test case w/ 1.7.3 failed when run by hand).

I can also cd into the Spack clone that failed e.g. last night and run spack install go@1.8 and it works consistently (since 1.8 at least).

I've moved the build from the 144 core machine onto a 24-core box. When building by hand (at least) I no longer need to ulimit -u 8192). The jenkins build still fails. Runs by hand still work.

Is there anything about runtime_test.TestCrashDumpsAllThreads that might interact badly with a Jenkins master->slave ssh session?

hartzell · 2017-02-20T03:18:07Z

I've pulled go build out into a separate job that just does

#!/bin/sh
set -e
spack install go@1.8

This builds a bootstrap go@1.4 and uses it to build 1.8.

It crashes. Here is a gist of an example crash. It seems to be a timeout but does not mention TestCrashDumpsAllThreadstest as the earlier ones did.

On the other hand, the next one (run without the -e) does mention TestCrashDumpsAllThreadtest gist here.

hartzell · 2017-02-20T17:33:53Z

I've reproduced this problem on a smaller machine and w/out involving Spack.

Jenkins seems to be central to the problem. I've opened a new issue: #19203.

hartzell mentioned this issue Jan 6, 2017

Trouble building go via Spack (but probably not Spack's problem). spack/spack#2770

Closed

bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2017

bradfitz added this to the Go1.9Maybe milestone Jan 13, 2017

hartzell mentioned this issue Feb 20, 2017

runtime: go1.8 build fails when run as a Jenkins job (was #18551) #19203

Closed

hartzell closed this as completed Feb 20, 2017

hartzell mentioned this issue Feb 22, 2017

runtime: runtime test timeout in TestCrashDumpsAllThreads #19196

Closed

golang locked and limited conversation to collaborators Feb 20, 2018

gopherbot added the FrozenDueToAge label Feb 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

hartzell commented Jan 6, 2017

minux commented Jan 6, 2017 via email

bradfitz commented Jan 6, 2017

hartzell commented Jan 6, 2017

hartzell commented Jan 6, 2017

minux commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

davecheney commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

bradfitz commented Jan 6, 2017

hartzell commented Jan 6, 2017

minux commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

hartzell commented Jan 10, 2017

hartzell commented Jan 11, 2017

hartzell commented Feb 3, 2017

hartzell commented Feb 6, 2017 •

edited

Loading

RLH commented Feb 6, 2017 via email

hartzell commented Feb 6, 2017

bradfitz commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

davecheney commented Feb 6, 2017 via email

hartzell commented Feb 7, 2017 •

edited

Loading

davecheney commented Feb 7, 2017 via email

hartzell commented Feb 7, 2017

davecheney commented Feb 7, 2017 via email

hartzell commented Feb 7, 2017

hartzell commented Feb 7, 2017

hartzell commented Feb 20, 2017

hartzell commented Feb 20, 2017

hartzell commented Feb 20, 2017

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

Comments

hartzell commented Jan 6, 2017

What version of Go are you using (go version)?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

minux commented Jan 6, 2017 via email

bradfitz commented Jan 6, 2017

hartzell commented Jan 6, 2017

hartzell commented Jan 6, 2017

minux commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

davecheney commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

bradfitz commented Jan 6, 2017

hartzell commented Jan 6, 2017

minux commented Jan 6, 2017 via email

hartzell commented Jan 6, 2017

hartzell commented Jan 10, 2017

hartzell commented Jan 11, 2017

hartzell commented Feb 3, 2017

hartzell commented Feb 6, 2017 • edited Loading

RLH commented Feb 6, 2017 via email

hartzell commented Feb 6, 2017

bradfitz commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

hartzell commented Feb 6, 2017

davecheney commented Feb 6, 2017 via email

hartzell commented Feb 7, 2017 • edited Loading

davecheney commented Feb 7, 2017 via email

hartzell commented Feb 7, 2017

davecheney commented Feb 7, 2017 via email

hartzell commented Feb 7, 2017

hartzell commented Feb 7, 2017

hartzell commented Feb 20, 2017

hartzell commented Feb 20, 2017

hartzell commented Feb 20, 2017

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

hartzell commented Feb 6, 2017 •

edited

Loading

hartzell commented Feb 7, 2017 •

edited

Loading