Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: on NetBSD, *Process.Wait sometimes deadlocks after cmd.Process.Signal returns "process already finished" #48789

Open
bcmills opened this issue Oct 5, 2021 · 5 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Oct 5, 2021

I've noticed a recurring pattern in cmd/go test failures on NetBSD builders, and I believe that it indicates a bug in either os.Process or the kernel itself on that platform.

The cmd/go tests start processes running “in the background” using os/exec, and use a waitOrStop function to terminate any remaining processes at the conclusion of each test.

The waitOrStop function starts a goroutine, then blocks on cmd.Wait(). The background goroutine blocks until either the call to cmd.Wait completes or the Context is canceled, then sends a signal to the process. If that signal fails with os: process already finished, then we assume that the process actually has already finished, and the background goroutine simply blocks until the call to cmd.Wait (inevitably) returns.

That all happens here:

go/src/cmd/go/script_test.go

Lines 1164 to 1206 in a05a7d4

errc := make(chan error)
go func() {
select {
case errc <- nil:
return
case <-ctx.Done():
}
err := cmd.Process.Signal(interrupt)
if err == nil {
err = ctx.Err() // Report ctx.Err() as the reason we interrupted.
} else if err.Error() == "os: process already finished" {
errc <- nil
return
}
if killDelay > 0 {
timer := time.NewTimer(killDelay)
select {
// Report ctx.Err() as the reason we interrupted the process...
case errc <- ctx.Err():
timer.Stop()
return
// ...but after killDelay has elapsed, fall back to a stronger signal.
case <-timer.C:
}
// Wait still hasn't returned.
// Kill the process harder to make sure that it exits.
//
// Ignore any error: if cmd.Process has already terminated, we still
// want to send ctx.Err() (or the error from the Interrupt call)
// to properly attribute the signal that may have terminated it.
_ = cmd.Process.Kill()
}
errc <- err
}()
waitErr := cmd.Wait()
if interruptErr := <-errc; interruptErr != nil {
return interruptErr
}


What I'm seeing on (some of?) the NetBSD builders is that after cmd.Process.Signal fails with process already finished, the call to cmd.Wait continues to block, seemingly forever.

The relevant goroutine traces are:

goroutine 1536 [syscall, 8 minutes]:
syscall.Syscall6(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0, 0x0, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_386.s:44 +0x5 fp=0x9117ecc sp=0x9117ec8 pc=0x80bc1e5
syscall.wait4(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/zsyscall_netbsd_386.go:35 +0x5b fp=0x9117f04 sp=0x9117ecc pc=0x80b8f3b
syscall.Wait4(0x42d7, 0x9117f44, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/syscall_bsd.go:145 +0x3b fp=0x9117f28 sp=0x9117f04 pc=0x80b6abb
os.(*Process).wait(0x8e98150)
	/tmp/workdir/go/src/os/exec_unix.go:44 +0xf2 fp=0x9117f60 sp=0x9117f28 pc=0x80e2762
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x8c7bc30)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x4d fp=0x9117f9c sp=0x9117f60 pc=0x817eedd
cmd/go_test.waitOrStop({0x874073c, 0x8dc2300}, 0x8c7bc30, {0x873f89c, 0x869bd38}, 0x6472661e3)
	/tmp/workdir/go/src/cmd/go/script_test.go:1203 +0x11a fp=0x9117fc0 sp=0x9117f9c pc=0x85800ea
cmd/go_test.(*testScript).startBackground.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1143 +0x5f fp=0x9117ff0 sp=0x9117fc0 pc=0x857ff8f
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x9117ff4 sp=0x9117ff0 pc=0x80ab241
created by cmd/go_test.(*testScript).startBackground
	/tmp/workdir/go/src/cmd/go/script_test.go:1142 +0x390

goroutine 1537 [chan send, 8 minutes]:
runtime.gopark(0x869b798, 0x8dbc430, 0xf, 0x16, 0x2)
	/tmp/workdir/go/src/runtime/proc.go:366 +0xf6 fp=0x90e8f04 sp=0x90e8ef0 pc=0x807f376
runtime.chansend(0x8dbc400, 0x87302f0, 0x1, 0x8580314)
	/tmp/workdir/go/src/runtime/chan.go:258 +0x2f4 fp=0x90e8f48 sp=0x90e8f04 pc=0x804db54
runtime.chansend1(0x8dbc400, 0x87302f0)
	/tmp/workdir/go/src/runtime/chan.go:144 +0x24 fp=0x90e8f60 sp=0x90e8f48 pc=0x804d854
cmd/go_test.waitOrStop.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1176 +0x184 fp=0x90e8ff0 sp=0x90e8f60 pc=0x8580314
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x90e8ff4 sp=0x90e8ff0 pc=0x80ab241
created by cmd/go_test.waitOrStop
	/tmp/workdir/go/src/cmd/go/script_test.go:1165 +0x10e

Note that goroutine 1537 is blocked at script_test.go:1176, which is the send on errc after cmd.Process.Signal fails with os: process already finished.

Goroutine 1536 is blocked at the call to cmd.Wait, which is itself blocked on syscall.Wait4.


The failure rate with these symptoms is fairly high: something on the order of 20 failures per month.

greplogs --dashboard -md -l -e '(?m)panic: test timed out.*(?:.*\n)*.*\[syscall, .* minutes\]:\n(?:.+\n\t.+\n)*syscall\.Wait.*\n\t.+\n(?:.+\n\t.+\n)*cmd/go_test\.waitOrStop'

2021-10-04T22:46:23-17674e2/netbsd-386-9_0
2021-10-04T18:15:09-9f8d558/netbsd-386-9_0
2021-09-30T19:56:06-eb9f090/netbsd-386-9_0
2021-09-29T15:23:27-aeb4fba/netbsd-amd64-9_0
2021-09-28T17:18:36-ff7b041/netbsd-amd64-9_0
2021-09-28T15:26:21-583eeaa/netbsd-386-9_0
2021-09-27T18:57:20-3d795ea/netbsd-amd64-9_0
2021-09-22T16:24:17-74ba70b/netbsd-amd64-9_0
2021-09-22T15:00:53-91c2318/netbsd-amd64-9_0
2021-09-21T20:39:31-48cf96c/netbsd-386-9_0
2021-09-20T23:04:13-d7e3e44/netbsd-arm64-bsiegert
2021-09-20T00:13:47-a83a558/netbsd-amd64-9_0
2021-09-17T19:32:44-74e384f/netbsd-amd64-9_0
2021-09-16T23:57:40-8d2a9c3/netbsd-386-9_0
2021-09-16T19:38:19-bcdc61d/netbsd-amd64-9_0
2021-09-10T17:11:39-5a4b9f9/netbsd-amd64-9_0
2021-09-09T16:32:28-a53e3d5/netbsd-amd64-9_0
2021-09-08T11:57:03-9295723/netbsd-386-9_0
2021-09-07T03:56:13-6226020/netbsd-386-9_0
2021-09-04T10:58:11-5ec298d/netbsd-386-9_0
2021-08-31T16:43:46-6815235/netbsd-386-9_0
2021-08-30T22:07:49-b06cfe9/netbsd-386-9_0
2021-08-27T05:13:44-2c60a99/netbsd-386-9_0
2021-08-24T22:23:12-54cdef1/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-arm64-bsiegert
2021-08-22T21:43:43-1958582/netbsd-386-9_0
2021-08-20T03:25:17-c92c2c9/netbsd-386-9_0
2021-08-19T20:50:13-65074a4/netbsd-amd64-9_0
2021-08-18T21:19:22-c2bd9ee/netbsd-386-9_0
2021-08-18T20:11:28-165ebd8/netbsd-386-9_0
2021-08-17T16:22:15-cf12b0d/netbsd-386-9_0
2021-08-17T15:00:04-3001b0a/netbsd-386-9_0
2021-08-17T04:37:32-a304273/netbsd-386-9_0
2021-08-17T01:29:37-1951afc/netbsd-386-9_0
2021-08-16T18:44:38-56a919f/netbsd-386-9_0
2021-08-16T18:44:32-ff36d11/netbsd-amd64-9_0
2021-08-16T13:38:52-a192ef8/netbsd-386-9_0
2021-08-12T17:43:16-39634e7/netbsd-386-9_0
2021-08-09T20:06:35-f1dce31/netbsd-386-9_0
2021-08-06T16:51:12-70546f6/netbsd-386-9_0
2021-07-28T03:27:13-b39e0f4/netbsd-386-9_0
2021-07-15T20:39:22-0941dbc/netbsd-amd64-9_0
2021-06-29T16:57:13-3463852/netbsd-386-9_0
2021-06-28T20:51:30-956c81b/netbsd-386-9_0
2021-06-24T03:45:33-a9bb382/netbsd-386-9_0
2021-06-15T20:59:42-d77f4c0/netbsd-amd64-9_0
2021-06-11T20:31:30-16b5d76/netbsd-386-9_0
2021-06-09T17:11:44-df35ade/netbsd-386-9_0
2021-06-09T15:09:13-139e935/netbsd-386-9_0
2021-05-21T17:43:46-4fda54c/netbsd-arm64-bsiegert
2021-05-21T17:35:47-8876b9b/netbsd-386-9_0
2021-05-17T16:02:12-b1aff42/netbsd-386-9_0
2021-05-12T15:23:09-0388670/netbsd-386-9_0
2021-05-12T02:04:57-1a0ea1a/netbsd-386-9_0
2021-05-11T18:22:54-9b84814/netbsd-386-9_0
2021-05-10T13:16:56-2870259/netbsd-386-9_0
2021-05-06T16:00:55-6c591f7/netbsd-386-9_0
2020-09-19T05:13:19-ccf581f/netbsd-386-9_0

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 5, 2021

@bsiegert, @coypoop: do you know who might be able to look into this on the NetBSD side?

@ianlancetaylor, @tklauser: is this possibly related to #13987?

@bcmills bcmills changed the title os: *Process.Wait sometimes deadlocks after cmd.Process.Signal fails with "process already finished" os: on NetBSD, *Process.Wait sometimes deadlocks after cmd.Process.Signal returns "process already finished" Oct 5, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 5, 2021

#44801 may be related, in that it involves a hang in os.Process.Wait.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 5, 2021

I see two possible sequences that could lead to this.

First, wait6(P_PID, p.Pid, nil, WEXITED|WNOWAIT, nil, nil) might return with no error, and then wait4(p.Pid, &status, 0, &rusage) hangs without returning.

Second, kill(p.Pid, SIGQUIT) might return ESRCH, but wait4(p.Pid, &status, 0, &rusage) might hang without returning.

Neither should be possible. The first case seems more likely. If we don't see a response from somebody familiar with NetBSD we should probably move the netbsd build tag from os/wait_wait6.go to os/wait_unimp.go. CC @tklauser since CL 315281 changed the os package to start using wait_wait6.go on NetBSD.

@tklauser
Copy link
Member

@tklauser tklauser commented Oct 6, 2021

The first case seems more likely to me too, given that the failure seems to have started occuring only after https://golang.org/cl/315281 was submitted on 2021-05-02.

I'll send a CL to move the netbsd build tag from os/wait_wait6.go to os/wait_unimp.go.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 6, 2021

Change https://golang.org/cl/354249 mentions this issue: os: don't use wait6 on netbsd

gopherbot pushed a commit that referenced this issue Oct 7, 2021
CL 315281 changed the os package use wait6 on netbsd. This seems to be
causing frequent test failures as reported in #48789. Revert that change
using wait6 on netbsd for now.

Updates #13987
Updates #16028
For #48789

Change-Id: Ieddffc65611c7f449971eaa8ed6f4299a5f742c2
Reviewed-on: https://go-review.googlesource.com/c/go/+/354249
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Trust: Bryan C. Mills <bcmills@google.com>
Trust: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants