-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building go crashes on 144 core/1+TB RAM CentOS7 machine #18551
Comments
The test that times out is: runtime_test.TestCrashDumpsAllThreads
Do you still have the a.exe running? Is it possible to obtain a backtrace
of all running threads using gdb?
|
Are you sure this is related to the machine's size? Does this build fail the same way on a smaller machine with the same software? |
No, it is not still running. I have something else running for a bit, but can rerun this later and if it ends up in the same state can to grab the backtrace (but below...). Can you give me instructions on getting the backtrace? |
Not sure, but it seems related to this box. I was originally building successfully on a Digital Ocean 16GB instance without any trouble. I anecdotally believe that I've successfully built "in-house" on other systems, but since it occasionally succeeds on the big box I'm not really sure. It's also possible that the box has hardware issues (though I haven't seen any other tools). I can find/use a smaller box with the same config and see what it does though it'll take a day or two. |
Find out the pid of the stray process, run
gdb -p $PID
and at the gdb prompt, run the following:
set logging file /tmp/gdb.log
set logging on
set height 0
thread apply all bt
quit
And then paste the output (in /tmp/gdb.log).
You also try running the test directly with (after building with make.bash):
go test runtime -run=TestCrashDumpsAllThreads
Also note that Go 1.8 will be released in a month, so please also
give the latest master branch a try.
|
Is it possible that what ever it's doing just takes more than the test timeout (and more than the 15 minutes that passes while I was writing up the issue), perhaps because of the size of the system? |
On non intel systems I'd say yes. But on intel, when this test hangs,
adding more time won't help, it's either a real bug or a flakey test.
…On Sat, Jan 7, 2017 at 10:30 AM, George Hartzell ***@***.***> wrote:
The test that times out is: runtime_test.TestCrashDumpsAllThreads
Is it possible that what ever it's doing just takes more than the test
timeout (and more than the 15 minutes that passes while I was writing up
the issue), perhaps because of the size of the system?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18551 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA3G5avH-RZ8_gMFMmp7dyYO4uYISks5rPs6tgaJpZM4LdNRC>
.
|
Between filing the issue and reading @minux and @bradfitz followups, I started a run with I'll let it finish (because I'm Curious), swap the working dir back and see if I can catch that "leftover" process running. |
NFS has been problematic in the past, IIRC. |
That's generally how I feel about it... 🙁 |
The TestCrashDumpsAllThreadstest (
https://golang.org/src/runtime/crash_unix_test.go#L26)
shouldn't take much time to run.
If it hangs, it's usually some problem in the implementation of crash dump
(which involves some signal trickery.)
From the test log, the test timed out while waiting for the
process to finish the crash dump:
goroutine 22861 [syscall, 2 minutes]:
syscall.Syscall6(0xf7, 0x1, 0x1718d, 0xc420474000, 0x1000004, 0x0, 0x0,
0xc42010bb07, 0xc420037500, 0xc42014cd00)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/syscall/asm_linux_amd64.s:44
+0x5 fp=0xc42010baf0 sp=0xc42010bae8
os.(*Process).blockUntilWaitable(0xc42000d230, 0xe, 0xc42010bd1f, 0x1)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/wait_waitid.go:28
+0xbc fp=0xc42010bb88 sp=0xc42010baf0
os.(*Process).wait(0xc42000d230, 0x506471, 0xc42000d244, 0x3)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec_unix.go:22
+0xab fp=0xc42010bc18 sp=0xc42010bb88
os.(*Process).Wait(0xc42000d230, 0x0, 0x3, 0x0)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/doc.go:49 +0x2b
fp=0xc42010bc48 sp=0xc42010bc18
os/exec.(*Cmd).Wait(0xc420190160, 0x6f7f00, 0xc42010bd60)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/os/exec/exec.go:434
+0x6d fp=0xc42010bcd8 sp=0xc42010bc48
runtime_test.TestCrashDumpsAllThreads(0xc42001a240)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/crash_unix_test.go:99
+0x849 fp=0xc42010bf78 sp=0xc42010bcd8
testing.tRunner(0xc42001a240, 0x62a6c8)
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:610
+0x81 fp=0xc42010bfa0 sp=0xc42010bf78
runtime.goexit()
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/runtime/asm_amd64.s:2086
+0x1 fp=0xc42010bfa8 sp=0xc42010bfa0
created by testing.(*T).Run
/tmp/hartzelg/spack-stage/spack-stage-ewK2oy/go/src/testing/testing.go:646
+0x2ec
Therefore the test has already successfully send SIGQUIT to the process,
but the process fails to stop, indicating something is wrong about signal
handling in the runtime.
|
Having gotten the snark out of my system, it's worth noting that even though I've moved the Jenkins working directory, Spack (unless you configure it otherwise) runs its unbundle, build and install steps in a directory tree symlinked into The change will speed up some other parts of the job (recursively removing the tree takes forever in our NFS environment) but after consideration I don't think it will change the outcome of this issue. Still, it's a change and it's worth being up front about it. |
[sorry for the delay, been down with a cold...] Last week I fired off a build w/ GOMAXPROCS=16 before I received the feedback above. The gist is here. I think it's different, but I'm guessing.... Unless there's new feedback my next step will be to try to recreate the original crash and catch the errant process in gdb. |
With one caveat, I have been unable to get the build to fail for go@1.8rc1. The caveat is that I seem to need to adjust the maximum number of user processes upward. The default seems to be 4096. Left as is I see this:
If I I'm inclined to close this and move on, but would appreciate feedback about whether there's something more use |
FWIW, go@1.7.5 and |
I've started seeing failures again. One thing that I've noticed is that the machine is very heavily loaded (another large compute bound job keeping all cores busy). I see no sign of the straggling process that I reported earlier. Here's a gist of the test failure. https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9 |
If the Go process is set to timeout after 3 minutes and the OS decides not
to give the Go process any CPU for 3 minutes, I would not be surprised that
the Go process times out.
…On Mon, Feb 6, 2017 at 11:01 AM, George Hartzell ***@***.***> wrote:
I've started seeing failures again. One thing that I've noticed is that
the machine is very heavily loaded (another large compute bound job keeping
all cores busy).
Here's a gist of the test failure.
https://gist.github.com/hartzell/5d3433593da9d679107c40273c4242f9
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#18551 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA7Wn5P0Os00Pdvhx4S-EYeHuz8sZsw9ks5rZ0PEgaJpZM4LdNRC>
.
|
Is there a way to extend the timeout? @minux mentioned above that the test itself was quick and that if it took any appreciable time that it was an indication that something was amiss. |
It takes 16 seconds on my 2 core machine. Extending the timeout doesn't seem like the answer. |
Hmmm. Would that scale linearly with core count? Would having 1.5TB of RAM (even if not all in use) make it take longer?
|
Reading through the test to look for timeouts etc... showed me a reference to GOMAXPROCS. For grins (aka grasping at staws) I just started a run with |
It failed with GOMAXPROCS=8, but with a much shorter stack trace (which at least suggests that I turned a knob that was hooked up to something). https://gist.github.com/hartzell/6363949312627aa1d417124e1ac7b3fc I was back in that terminal session fairly quickly after it failed and did not see any build-related processes hanging around. Is there any way that I can either instrument this or provoke it into telling me something useful? |
On rereading I was reminded that @minux pointed out that I can run the test "by hand". Using the build that just failed, I can't make it fail.
And then just to work on my lunch for a while (each iteration seems to take ~5 seconds, the load is ~160, most cores around 100%):
Is there some way that running the build inside Spack (roughly analogous to homebrew...) could be making this fail. They do a bit of IO redirection magic but I don't think there's any deep voodoo. |
You should not set GOROOT if you have built from source.
…On Tue, 7 Feb 2017, 08:23 George Hartzell ***@***.***> wrote:
On rereading I was reminded that @minux <https://github.com/minux>
pointed out that I can run the test "by hand". Using the build that just
failed, I can't make it fail.
***@***.*** go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok runtime 0.864s
***@***.*** go]$ GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
ok runtime 0.882s
***@***.*** go]$ uptime
12:46:07 up 17 days, 23:14, 1 user, load average: 142.63, 145.37, 146.39
And then just to work on my lunch for a while (each iteration seems to
take ~5 seconds, the load is ~160, most cores around 100%):
***@***.*** go]$ while (true)
> do
> GOROOT=`pwd` ./bin/go test runtime -run=TestCrashDumpsAllThreads
> done
ok runtime 0.749s
ok runtime 0.830s
ok runtime 0.840s
ok runtime 0.678s
ok runtime 0.906s
ok runtime 0.917s
ok runtime 0.847s
ok runtime 0.727s
ok runtime 0.790s
ok runtime 0.785s
ok runtime 0.720s
ok runtime 1.013s
ok runtime 0.765s
ok runtime 0.850s
ok runtime 0.719s
ok runtime 1.070s
ok runtime 1.240s
ok runtime 0.900s
ok runtime 0.997s
ok runtime 0.864s
ok runtime 0.865s
ok runtime 0.982s
ok runtime 0.723s
ok runtime 0.904s
ok runtime 0.929s
ok runtime 0.955s
ok runtime 0.709s
ok runtime 0.779s
ok runtime 0.843s
ok runtime 0.860s
ok runtime 0.840s
ok runtime 0.887s
ok runtime 0.837s
ok runtime 0.773s
ok runtime 0.760s
ok runtime 1.047s
Is there some way that running the build inside Spack
<https://github.com/llnl/spack> (roughly analogous to homebrew...) could
be making this fail. They do a bit of IO redirection magic but I don't
think there's any deep voodoo.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#18551 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA9gg-RtK_opY1oeqAIOkWrXrLaNWks5rZ49igaJpZM4LdNRC>
.
|
[edit to add comment about what I was trying to do...] @davecheney -- I'm not sure how else to do it. The python code that did the build (via spack) is here. I came along afterwards and did this:
Actually, I was trying to run the tests as @minux suggested, version is just a standin.... Setting GOROOT seemed like an expedient way to use the un-installed tree full of stuff that I'd built and appears (Danger Will Robinson...) to work. What should I have done? |
If your CI builds go in a random directory then moves the files after th
build it should set the variable GOROOT_FINAL to the final location.
…On Tue, 7 Feb 2017, 11:56 George Hartzell ***@***.***> wrote:
@davecheney <https://github.com/davecheney> -- I'm not sure how else to
do it. The python code that did the build (via spack) is here
<https://github.com/LLNL/spack/blob/develop/var/spack/repos/builtin/packages/go/package.py>
.
I came along afterwards and did this:
***@***.*** go]$ ./bin/go version
go: cannot find GOROOT directory: /tmp/hartzelg/spack-cime-rpbuchop001-working-dir/workspace/daily-build/spack/opt/spack/linux-centos7-x86_64/gcc-5.4.0/go-1.7.5-r2ql2ftjxnlu4esctfokszedfrooyf63
Setting GOROOT seemed like an expedient way to use the un-installed tree
full of stuff that I'd built and appears (Danger Will Robinson...) to work.
What should I have done?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18551 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA-1n0aP-0FRASR6Pm5KSfy5WQGjbks5rZ8E-gaJpZM4LdNRC>
.
|
Sure. But the problem is that I'm poking at a build that failed, so nothing got copied anywhere and I'm trying to run the intermediate product through one of the tests by hand to see if I can figure out the heisenbug. |
Is it possible to take your build system out of the picture and build from
source directly? I'm really concerned that setting GOROOT introduces the
possibility of pointing the go tool at a different version of Go which has
a long history of causing weird bugs.
…On Tue, Feb 7, 2017 at 3:27 PM, George Hartzell ***@***.***> wrote:
Sure. But the problem is that I'm poking at a build that *failed*, so
nothing got copied anywhere and I'm trying to run the intermediate product
through one of the tests by hand to see if I can figure out the
*heisenbug*.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18551 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAcA2NsK0mwk8KdDgNYCf6o9flq-R6Cks5rZ_KhgaJpZM4LdNRC>
.
|
I can (and will) try to recreate the failure running the build by hand. That might be a useful simplification. The build system does not use GOROOT (at least not in any way that I can suss out). The failures that happen there are almost certainly not related to GOROOT. The build does set GOROOT_FINAL to the location where the tree will be copied if the build succeeds. That seems to be consistent with the documentation for building from source. The only use of GOROOT, which you've latched on to, was me trying to figure out how to follow @minux's instructions on how to run the failing test by hand. I take your point to heart and believe that the results I get when I use GOROOT to run the tests within the staging directory might be misleading. Given that I have a tree that's been built with GOROOT_FINAL set and that I am trying to debug a problem with that tree, what is my alternative to setting GOROOT? |
I've been running the build by hand. I have
The build just succeeded 3 for 3. That's not as comforting as it might be, there have been periods where it has worked via Spack before. The automated (Jenkins, Spack) build failed last night. The automated job has The machine has a long running job that is keeping all of the cores mostly busy, it was running yesterday and last night and is running now. Load is around 140. Plenty of free RAM. The Spack build framework does a bit of magic with filehandles (and possibly other things) to manage the tasks that it runs. @minux mentions that the failing test (or the thing being tested) involves "signal trickery". Is it possible that the Spack framework is doing something that's getting in the way? That seems fragile, and if it were then I'd expect that the Spack build would never succeed, but I'm grasping at straws, so.... |
I'm still casting about for a way to get a handle on this. As I've detailed above, the core of my test case is building go via spack, like so:
I run a build every night as a Jenkins job. The Jenkins master is a Docker container and the job runs in a slave executor on a real Linux system (no VM, no container). That build seems to nearly always fail. I sometimes checkout a Spack tree and run the command by hand, that seems to generally work these days (though when I filed this Issue my initial test case w/ 1.7.3 failed when run by hand). I can also cd into the Spack clone that failed e.g. last night and run I've moved the build from the 144 core machine onto a 24-core box. When building by hand (at least) I no longer need to Is there anything about |
I've pulled go build out into a separate job that just does
This builds a bootstrap go@1.4 and uses it to build 1.8. It crashes. Here is a gist of an example crash. It seems to be a timeout but does not mention On the other hand, the next one (run without the |
I've reproduced this problem on a smaller machine and w/out involving Spack. Jenkins seems to be central to the problem. I've opened a new issue: #19203. |
I happen to have access to a large system and am using it to build a variety of packages. My go builds almost always fail. Details below.
What version of Go are you using (
go version
)?Bootstrapping go: from here: https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz
Go from here: https://storage.googleapis.com/golang/go1.7.4.src.tar.gz
What operating system and processor architecture are you using (
go env
)?What did you do?
Building the go recipe in spack like so:
(which builds and uses a bootstrapping system from the 1.4 snapshot).
It boils down to
'/usr/bin/bash' 'all.bash'
.I'm seeing this in a job running via jenkins that simplifies to this:
What did you expect to see?
A successful build.
What did you see instead?
A panic during testing. Full output (1500+ lines) in this gist.
Here's the first few lines.
Some time later (15 minutes after the test failed, I still have a go-build-related process running:
and jenkins thinks that the job is still running.
The text was updated successfully, but these errors were encountered: