Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

Closed
hartzell opened this issue Jan 6, 2017 · 33 comments
Closed

Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551

hartzell opened this issue Jan 6, 2017 · 33 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@hartzell
Copy link

hartzell commented Jan 6, 2017

I happen to have access to a large system and am using it to build a variety of packages. My go builds almost always fail. Details below.

What version of Go are you using (go version)?

Bootstrapping go: from here: https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz

Go from here: https://storage.googleapis.com/golang/go1.7.4.src.tar.gz

What operating system and processor architecture are you using (go env)?

[hartzelg@rpbuchop001 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH=""
GORACE=""
GOROOT="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec"
GOTOOLDIR="/isilon/Analysis/scratch/hartzelg/linuxbrew/Cellar/go/1.7.4/libexec/pkg/tool/linux_amd64"
CC="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/gcc-5"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build188975734=/tmp/go-build -gno-record-gcc-switches"
CXX="/isilon/Analysis/scratch/hartzelg/linuxbrew/bin/g++-5"
CGO_ENABLED="1"

What did you do?

Building the go recipe in spack like so:

spack install go@1.7.4

(which builds and uses a bootstrapping system from the 1.4 snapshot).

It boils down to '/usr/bin/bash' 'all.bash'.

I'm seeing this in a job running via jenkins that simplifies to this:

git clone https://github.com/llnl/spack
cd spack
export PATH=`pwd`/bin:$PATH
ulimit -u 8192
spack install go@1.7.4

What did you expect to see?

A successful build.

What did you see instead?

A panic during testing. Full output (1500+ lines) in this gist.

Here's the first few lines.

ok  	regexp	0.455s
ok  	regexp/syntax	1.000s
panic: test timed out after 3m0s

goroutine 22864 [running]:
panic(0x5be600, 0xc42017a010)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/panic.go:500 +0x1a1 fp=0xc42010cf48 sp=0xc42010ceb8
testing.startAlarm.func1()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:918 +0x10b fp=0xc42010cfb0 sp=0xc42010cf48
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010cfb8 sp=0xc42010cfb0
created by time.goFunc
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/time/sleep.go:154 +0x44

goroutine 1 [chan receive, 2 minutes, locked to thread]:
runtime.gopark(0x629c68, 0xc420016598, 0x60cadd, 0xc, 0xc420110b17, 0x3)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:259 +0x13a fp=0xc420110b10 sp=0xc420110ae0
runtime.goparkunlock(0xc420016598, 0x60cadd, 0xc, 0xc420000117, 0x3)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:265 +0x5e fp=0xc420110b50 sp=0xc420110b10
runtime.chanrecv(0x5bc980, 0xc420016540, 0x0, 0xc420110c01, 0x47434c)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:496 +0x2df fp=0xc420110bd8 sp=0xc420110b50
runtime.chanrecv1(0x5bc980, 0xc420016540, 0x0)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/chan.go:378 +0x35 fp=0xc420110c10 sp=0xc420110bd8
testing.(*T).Run(0xc42001a180, 0x611a7e, 0x18, 0x62a6c8, 0xc42011fc01)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:647 +0x316 fp=0xc420110cb8 sp=0xc420110c10
testing.RunTests.func1(0xc42001a180)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:793 +0x6d fp=0xc420110d08 sp=0xc420110cb8
testing.tRunner(0xc42001a180, 0xc420110de8)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610 +0x81 fp=0xc420110d30 sp=0xc420110d08
testing.RunTests(0x6295c0, 0x70f7c0, 0xc4, 0xc4, 0x43bf04)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:799 +0x2f5 fp=0xc420110e18 sp=0xc420110d30
testing.(*M).Run(0xc420110ef8, 0x40f769)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:743 +0x85 fp=0xc420110ea0 sp=0xc420110e18
runtime_test.TestMain(0xc420110ef8)
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_test.go:27 +0x2f fp=0xc420110ef0 sp=0xc420110ea0
main.main()
	runtime/_test/_testmain.go:836 +0xc6 fp=0xc420110f58 sp=0xc420110ef0
runtime.main()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/proc.go:183 +0x1f4 fp=0xc420110fb0 sp=0xc420110f58
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc420110fb8 sp=0xc420110fb0

goroutine 17 [syscall, 2 minutes, locked to thread]:
runtime.goexit()
	/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086 +0x1 fp=0xc42010efb8 sp=0xc42010efb0

Some time later (15 minutes after the test failed, I still have a go-build-related process running:

hartzelg  94605  400  0.0   4060  1740 ?        Rl   14:33  86:51 /tmp/go-build307875912/a.exe

and jenkins thinks that the job is still running.

@minux
Copy link
Member

minux commented Jan 6, 2017 via email

@bradfitz
Copy link
Contributor

bradfitz commented Jan 6, 2017

Are you sure this is related to the machine's size? Does this build fail the same way on a smaller machine with the same software?

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

Do you still have the a.exe running? Is it possible to obtain a backtrace
of all running threads using gdb?

No, it is not still running. I have something else running for a bit, but can rerun this later and if it ends up in the same state can to grab the backtrace (but below...).

Can you give me instructions on getting the backtrace?

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

Are you sure this is related to the machine's size?

Not sure, but it seems related to this box. I was originally building successfully on a Digital Ocean 16GB instance without any trouble. I anecdotally believe that I've successfully built "in-house" on other systems, but since it occasionally succeeds on the big box I'm not really sure.

It's also possible that the box has hardware issues (though I haven't seen any other tools).

I can find/use a smaller box with the same config and see what it does though it'll take a day or two.

@minux
Copy link
Member

minux commented Jan 6, 2017 via email

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

The test that times out is: runtime_test.TestCrashDumpsAllThreads

Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system?

@davecheney
Copy link
Contributor

davecheney commented Jan 6, 2017 via email

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

Between filing the issue and reading @minux and @bradfitz followups, I started a run with GOMAXPROCS=16 in a grasping at straws debugging effort and switched the working directory from an NFS mount to something local (because debugging's more fun if you change multiple things at the same time, sigh...; but seriously I'm also trying to debug some infra. issues).

I'll let it finish (because I'm Curious), swap the working dir back and see if I can catch that "leftover" process running.

@bradfitz
Copy link
Contributor

bradfitz commented Jan 6, 2017

NFS has been problematic in the past, IIRC.

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

NFS has been problematic in the past, IIRC.

That's generally how I feel about it... 🙁

@minux
Copy link
Member

minux commented Jan 6, 2017 via email

@hartzell
Copy link
Author

hartzell commented Jan 6, 2017

Having gotten the snark out of my system, it's worth noting that even though I've moved the Jenkins working directory, Spack (unless you configure it otherwise) runs its unbundle, build and install steps in a directory tree symlinked into /tmp, which in my case is a 15TB local filesystem.

The change will speed up some other parts of the job (recursively removing the tree takes forever in our NFS environment) but after consideration I don't think it will change the outcome of this issue.

Still, it's a change and it's worth being up front about it.

@hartzell
Copy link
Author

[sorry for the delay, been down with a cold...]

Last week I fired off a build w/ GOMAXPROCS=16 before I received the feedback above.

The gist is here.

I think it's different, but I'm guessing....

Unless there's new feedback my next step will be to try to recreate the original crash and catch the errant process in gdb.

@hartzell
Copy link
Author

With one caveat, I have been unable to get the build to fail for go@1.8rc1.

The caveat is that I seem to need to adjust the maximum number of user processes upward. The default seems to be 4096. Left as is I see this:

##### ../test/bench/go1
testing: warning: no tests to run
PASS
ok  	_/tmp/hartzelg/spack-stage/spack-stage-D6YdOI/go/test/bench/go1	1.916s

##### ../test
runtime: failed to create new OS thread (have 14 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x55a76e, 0x9)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/panic.go:596 +0x95
runtime.newosproc(0xc42023e400, 0xc42012c000)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/os_linux.go:163 +0x18c
runtime.newm(0x564320, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1628 +0x137
runtime.startTheWorldWithSema()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1122 +0x1b6
runtime.systemstack(0xc420037300)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:327 +0x79
runtime.mstart()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/proc.go:1132

goroutine 305 [running]:
runtime.systemstack_switch()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:281 fp=0xc42027ec98 sp=0xc42027ec90
runtime.gcStart(0x0, 0xc42055c000)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/mgc.go:1070 +0x29d fp=0xc42027ecd0 sp=0xc42027ec98
runtime.mallocgc(0x80000, 0x52c080, 0xc420388001, 0xc42027edf0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/malloc.go:773 +0x48f fp=0xc42027ed70 sp=0xc42027ecd0
runtime.makeslice(0x52c080, 0x7fe00, 0x7fe00, 0x0, 0x0, 0x3fe00)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/slice.go:54 +0x7b fp=0xc42027edc0 sp=0xc42027ed70
bytes.makeSlice(0x7fe00, 0x0, 0x0, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:201 +0x77 fp=0xc42027ee00 sp=0xc42027edc0
bytes.(*Buffer).ReadFrom(0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x2aaaaacad028, 0xc420492230, 0x1)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/bytes/buffer.go:173 +0x2ab fp=0xc42027ee70 sp=0xc42027ee00
io.copyBuffer(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0x0, 0x0, 0x0, 0x4c5fbb, 0xc420225600, 0x6c6961642f656301)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:384 +0x2cb fp=0xc42027eed8 sp=0xc42027ee70
io.Copy(0x5f52c0, 0xc420492230, 0x5f54c0, 0xc4203c20c8, 0xc420492230, 0x302e312d32700101, 0xc420457fc8)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/io/io.go:360 +0x68 fp=0xc42027ef38 sp=0xc42027eed8
os/exec.(*Cmd).writerDescriptor.func1(0x2d6b636170732f67, 0x6270722d656d6963)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:254 +0x4d fp=0xc42027ef98 sp=0xc42027ef38
os/exec.(*Cmd).Start.func1(0xc420225600, 0xc4202f6020)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:371 +0x27 fp=0xc42027efd0 sp=0xc42027ef98
runtime.goexit()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc42027efd8 sp=0xc42027efd0
created by os/exec.(*Cmd).Start
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/os/exec/exec.go:372 +0x4e4

goroutine 1 [chan receive, 1 minutes]:
main.(*tester).runPending(0xc4201540f0, 0x0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:970 +0x31f
main.(*tester).run(0xc4201540f0)
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:214 +0x91a
main.cmdtest()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/test.go:42 +0x2bb
main.xmain()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/main.go:43 +0x233
main.main()
	/tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.8rc1-3xsr7krvb6myjjrepxa23xgd4ih6gk5t/src/cmd/dist/util.go:509 +0x283
[...]

If I ulimit -u 8192 then I am 5 for 5 successful builds for go@1.8rc1.

I'm inclined to close this and move on, but would appreciate feedback about whether there's something more use

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 13, 2017
@bradfitz bradfitz added this to the Go1.9Maybe milestone Jan 13, 2017
@hartzell
Copy link
Author

hartzell commented Feb 3, 2017

FWIW, go@1.7.5 and ulimit -u 8192 also seem to work.

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy).

I see no sign of the straggling process that I reported earlier.

Here's a gist of the test failure.

https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9

@RLH
Copy link
Contributor

RLH commented Feb 6, 2017 via email

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

Is there a way to extend the timeout? @minux mentioned above that the test itself was quick and that if it took any appreciable time that it was an indication that something was amiss.

@bradfitz
Copy link
Contributor

bradfitz commented Feb 6, 2017

It takes 16 seconds on my 2 core machine. Extending the timeout doesn't seem like the answer.

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

Hmmm. Would that scale linearly with core count? Would having 1.5TB of RAM (even if not all in use) make it take longer?

((16 seconds / 2 cores) * 144 cores ) * 1 minute/60 seconds = 19.2 minutes

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

Reading through the test to look for timeouts etc... showed me a reference to GOMAXPROCS. For grins (aka grasping at staws) I just started a run with export GOMAXPROCS=8 near the top, thinking that it might limit the number of threads/goroutines/procs/... that were involved.

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

It failed with GOMAXPROCS=8, but with a much shorter stack trace (which at least suggests that I turned a knob that was hooked up to something).

https://gist.github.com/hartzell/6363949312627aa1d417124e1ac7b3fc

I was back in that terminal session fairly quickly after it failed and did not see any build-related processes hanging around.

Is there any way that I can either instrument this or provoke it into telling me something useful?

@hartzell
Copy link
Author

hartzell commented Feb 6, 2017

On rereading I was reminded that @minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.

[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok  	runtime	0.864s
[hartzelg@rpbuchop001 go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok  	runtime	0.882s
[hartzelg@rpbuchop001 go]$ uptime
 12:46:07 up 17 days, 23:14,  1 user,  load average: 142.63, 145.37, 146.39

And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):

[hartzelg@rpbuchop001 go]$ while (true)
> do
> GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
> done
ok  	runtime	0.749s
ok  	runtime	0.830s
ok  	runtime	0.840s
ok  	runtime	0.678s
ok  	runtime	0.906s
ok  	runtime	0.917s
ok  	runtime	0.847s
ok  	runtime	0.727s
ok  	runtime	0.790s
ok  	runtime	0.785s
ok  	runtime	0.720s
ok  	runtime	1.013s
ok  	runtime	0.765s
ok  	runtime	0.850s
ok  	runtime	0.719s
ok  	runtime	1.070s
ok  	runtime	1.240s
ok  	runtime	0.900s
ok  	runtime	0.997s
ok  	runtime	0.864s
ok  	runtime	0.865s
ok  	runtime	0.982s
ok  	runtime	0.723s
ok  	runtime	0.904s
ok  	runtime	0.929s
ok  	runtime	0.955s
ok  	runtime	0.709s
ok  	runtime	0.779s
ok  	runtime	0.843s
ok  	runtime	0.860s
ok  	runtime	0.840s
ok  	runtime	0.887s
ok  	runtime	0.837s
ok  	runtime	0.773s
ok  	runtime	0.760s
ok  	runtime	1.047s

Is there some way that running the build inside Spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo.

@davecheney
Copy link
Contributor

davecheney commented Feb 6, 2017 via email

@hartzell
Copy link
Author

hartzell commented Feb 7, 2017

[edit to add comment about what I was trying to do...]

@davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here.

I came along afterwards and did this:

[hartzelg@rpbuchop001 go]$ ./bin/go version
go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63

Actually, I was trying to run the tests as @minux suggested, version is just a standin....

Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work.

What should I have done?

@davecheney
Copy link
Contributor

davecheney commented Feb 7, 2017 via email

@hartzell
Copy link
Author

hartzell commented Feb 7, 2017

Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug.

@davecheney
Copy link
Contributor

davecheney commented Feb 7, 2017 via email

@hartzell
Copy link
Author

hartzell commented Feb 7, 2017

I can (and will) try to recreate the failure running the build by hand. That might be a useful simplification.

The build system does not use GOROOT (at least not in any way that I can suss out). The failures that happen there are almost certainly not related to GOROOT.

The build does set GOROOT_FINAL to the location where the tree will be copied if the build succeeds. That seems to be consistent with the documentation for building from source.

The only use of GOROOT, which you've latched on to, was me trying to figure out how to follow @minux's instructions on how to run the failing test by hand. I take your point to heart and believe that the results I get when I use GOROOT to run the tests within the staging directory might be misleading.

Given that I have a tree that's been built with GOROOT_FINAL set and that I am trying to debug a problem with that tree, what is my alternative to setting GOROOT?

@hartzell
Copy link
Author

hartzell commented Feb 7, 2017

I've been running the build by hand.

I have ulimit -u 8192, otherwise I see this:

##### ../test
runtime: failed to create new OS thread (have 11 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

The build just succeeded 3 for 3. That's not as comforting as it might be, there have been periods where it has worked via Spack before.

The automated (Jenkins, Spack) build failed last night. The automated job has export GOMAXPROCS=4 because I was playing with it and left it in. The failure was the same one I've been seeing, with the smaller stack trace that results from setting GOMAXPROCS.

The machine has a long running job that is keeping all of the cores mostly busy, it was running yesterday and last night and is running now. Load is around 140. Plenty of free RAM.

The Spack build framework does a bit of magic with filehandles (and possibly other things) to manage the tasks that it runs.

@minux mentions that the failing test (or the thing being tested) involves "signal trickery". Is it possible that the Spack framework is doing something that's getting in the way?

That seems fragile, and if it were then I'd expect that the Spack build would never succeed, but I'm grasping at straws, so....

@hartzell
Copy link
Author

I'm still casting about for a way to get a handle on this.

As I've detailed above, the core of my test case is building go via spack, like so:

spack install go@1.8   # 1.8 has been released since I started this thread....

I run a build every night as a Jenkins job. The Jenkins master is a Docker container and the job runs in a slave executor on a real Linux system (no VM, no container). That build seems to nearly always fail.

I sometimes checkout a Spack tree and run the command by hand, that seems to generally work these days (though when I filed this Issue my initial test case w/ 1.7.3 failed when run by hand).

I can also cd into the Spack clone that failed e.g. last night and run spack install go@1.8 and it works consistently (since 1.8 at least).

I've moved the build from the 144 core machine onto a 24-core box. When building by hand (at least) I no longer need to ulimit -u 8192). The jenkins build still fails. Runs by hand still work.

Is there anything about runtime_test.TestCrashDumpsAllThreads that might interact badly with a Jenkins master->slave ssh session?

@hartzell
Copy link
Author

I've pulled go build out into a separate job that just does

#!/bin/sh
set -e
spack install go@1.8

This builds a bootstrap go@1.4 and uses it to build 1.8.

It crashes. Here is a gist of an example crash. It seems to be a timeout but does not mention TestCrashDumpsAllThreadstest as the earlier ones did.

On the other hand, the next one (run without the -e) does mention TestCrashDumpsAllThreadtest gist here.

@hartzell
Copy link
Author

I've reproduced this problem on a smaller machine and w/out involving Spack.

Jenkins seems to be central to the problem. I've opened a new issue: #19203.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants