Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: hangs in TestGdbBacktrace on linux #37405

Open
bcmills opened this issue Feb 24, 2020 · 15 comments
Open

runtime: hangs in TestGdbBacktrace on linux #37405

bcmills opened this issue Feb 24, 2020 · 15 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 24, 2020

2020-02-22T04:31:41-059a5ac/linux-mips64le-mengzhuo

goroutine 23401 [syscall, 11 minutes]:
syscall.Syscall6(0x1475, 0x1, 0x3b6f, 0xc000cad968, 0x1000004, 0x0, 0x0, 0x120153878, 0xc000cad960, 0x120080418)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/syscall/asm_linux_mips64x.s:40 +0x10 fp=0xc000cad910 sp=0xc000cad908 pc=0x1200cc678
os.(*Process).blockUntilWaitable(0xc00001ab70, 0x0, 0x1200103ec, 0x0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/wait_waitid.go:31 +0x88 fp=0xc000cad9f8 sp=0xc000cad910 pc=0x1200e49e8
os.(*Process).wait(0xc00001ab70, 0x1202d17d0, 0x1202d17d8, 0x1202d17c8)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec_unix.go:22 +0x4c fp=0xc000cada68 sp=0xc000cad9f8 pc=0x1200df43c
os.(*Process).Wait(...)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec.go:125
os/exec.(*Cmd).Wait(0xc000ad8f20, 0x0, 0x0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:502 +0x68 fp=0xc000cadad8 sp=0xc000cada68 pc=0x120153e18
os/exec.(*Cmd).Run(0xc000ad8f20, 0xc00009aff0, 0xc000ad8f20)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:340 +0x74 fp=0xc000cadaf8 sp=0xc000cadad8 pc=0x12015340c
os/exec.(*Cmd).CombinedOutput(0xc000ad8f20, 0x3, 0xc000cade78, 0xf, 0xf, 0xc000ad8f20)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:562 +0xbc fp=0xc000cadb20 sp=0xc000cadaf8 pc=0x1201541dc
runtime_test.TestGdbBacktrace(0xc00022a5a0)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/runtime-gdb_test.go:388 +0x6c4 fp=0xc000cadf80 sp=0xc000cadb20 pc=0x120202e64
testing.tRunner(0xc00022a5a0, 0x1202d2e00)
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:992 +0xf8 fp=0xc000cadfc8 sp=0xc000cadf80 pc=0x12010a978
runtime.goexit()
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/asm_mips64x.s:646 +0x4 fp=0xc000cadfc8 sp=0xc000cadfc8 pc=0x120084354
created by testing.(*T).Run
	/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:1043 +0x378

CC @dr2chase @aclements @mengzhuo

@bcmills bcmills added this to the Backlog milestone Feb 24, 2020
@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Feb 24, 2020

It shows that runtime test failed due to timeout.

FAIL	runtime	720.011s
FAIL

Maybe we should dump ps ef and system load after bots failed?

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 14, 2020

Change https://golang.org/cl/227811 mentions this issue: runtime: only shrink stack while preempting

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 25, 2021

This turns out not to be specific to the mips64le builder. See also #39228 (occasional failures instead of hangs).

2021-01-23T19:46:06-9897655/linux-amd64-sid
2020-11-02T03:03:16-0387bed/linux-386-softfloat
2020-06-25T12:02:38-334752d/linux-amd64-staticlockranking

@bcmills bcmills changed the title runtime: hang in TestGdbBacktrace on linux-mips64le-mengzhuo builder runtime: hangs in TestGdbBacktrace on linux Jan 25, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 14, 2021

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 8, 2021

greplogs --dashboard -md -l -e '(?m)panic: test timed out(?:.|\n)*os/exec\...Cmd.*\n.*\nruntime_test\.TestGdbBacktrace' --since=2021-10-15

2021-12-08T04:14:00-a19e72c/linux-amd64-buster
2021-11-16T17:13:33-29ec902/linux-386-softfloat
2021-11-13T00:49:51-530e320/solaris-amd64-oraclerel
2021-10-25T20:08:25-252324e/linux-386-stretch

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 8, 2021

Marking as release-blocker for Go 1.18 due to the regularity of failures, and especially the variety of builders on which failures have been observed. The large number of affected platforms suggests either a problem in the test (which we should fix) or a deep-rooted bug in gdb itself (for which we should either add a workaround or remove the infeasible test).

Note that many of the failures are on linux/386 and linux/amd64, which are first-class ports.

(CC @jeremyfaller)

@bcmills bcmills removed this from the Backlog milestone Dec 8, 2021
@bcmills bcmills added this to the Go1.18 milestone Dec 8, 2021
@aclements
Copy link
Member

@aclements aclements commented Dec 8, 2021

We could implement our own timeout in TestGdbBacktrace so it can fail cleanly and print the output it has so far from GDB.

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 9, 2021

Change https://golang.org/cl/370703 mentions this issue: runtime: run gdb with a timeout for TestGdbBacktrace

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 9, 2021

Change https://golang.org/cl/370701 mentions this issue: testenv: abstract run-with-timeout into testenv

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 9, 2021

Change https://golang.org/cl/370702 mentions this issue: testenv: kill subprocess if SIGQUIT doesn't do it

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 9, 2021

Change https://golang.org/cl/370665 mentions this issue: runtime: use testenv.RunWithTimeout liberally

gopherbot pushed a commit that referenced this issue Dec 12, 2021
This lifts the logic to run a subcommand with a timeout in a test from
the runtime's runTestProg into testenv. The implementation is
unchanged in this CL. We'll improve it in a future CL.

Currently, tests that run subcommands usually just timeout with no
useful output if the subcommand runs for too long. This is a step
toward improving this.

For #37405.

Change-Id: I2298770db516e216379c4c438e05d23cbbdda51d
Reviewed-on: https://go-review.googlesource.com/c/go/+/370701
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
gopherbot pushed a commit that referenced this issue Dec 12, 2021
This makes testenv.RunWithTimeout first attempt to SIGQUIT the
subprocess to get a useful Go traceback, but if that doesn't work, it
sends a SIGKILL instead to make sure we tear down the subprocess. This
is potentially important for non-Go subprocesses.

For #37405.

Change-Id: I9e7e118dc5769ec3f45288a71658733bff30c9cd
Reviewed-on: https://go-review.googlesource.com/c/go/+/370702
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Dec 12, 2021
This sometimes times out and we don't have any useful output for
debugging it. Hopefully this will help.

For #37405.

Change-Id: I79074e6fbb9bd16a864c651109a0acbfc8aa6cef
Reviewed-on: https://go-review.googlesource.com/c/go/+/370703
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@aclements
Copy link
Member

@aclements aclements commented Jan 4, 2022

Hmm. TestGdbBacktrace hasn't failed since I landed the changes to add a clean timeout. greplogs -dashboard -md -l -e "TestGdbBacktrace" -since=2021-12-09 returns nothing.

@aclements
Copy link
Member

@aclements aclements commented Jan 11, 2022

Still no more failures. I'm not sure what to make of this. I'm pretty sure the timeout code I added works because I did an experiment where I changed the test to run gdb on "sleep 10" and changed the timeout to 1 second and it did in fact kill gdb.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 11, 2022

Since we've at least made tangible progress on diagnosing the problem during the 1.18 cycle, I think it would be ok to move this back to the Backlog milestone and/or mark it WaitingForInfo while we wait for another repro.

It's unfortunate but not terribly surprising for flaky tests not to reproduce as often during the code freeze, because the rate of test runs (especially for fast and/or scalable builders) tends to be much higher during the active development window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants