Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: (*Process).Wait sometimes hangs on netbsd #50138

Open
bcmills opened this issue Dec 13, 2021 · 28 comments
Open

os: (*Process).Wait sometimes hangs on netbsd #50138

bcmills opened this issue Dec 13, 2021 · 28 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Dec 13, 2021

greplogs --dashboard -md -l -e 'panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait(?:.*\n)+FAIL\s+cmd/link'

2021-12-12T06:14:07-9c6e8f6/netbsd-386-9_0-n2

goroutine 29 [syscall, 2 minutes]:
syscall.Syscall6(0x21a8, 0x892ee8c, 0x0, 0x8a2c1e0, 0x0, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_386.s:43 +0x5 fp=0x892ee38 sp=0x892ee34 pc=0x80b9605
syscall.wait4(0x21a8, 0x892ee8c, 0x0, 0x8a2c1e0)
	/tmp/workdir/go/src/syscall/zsyscall_netbsd_386.go:34 +0x5b fp=0x892ee70 sp=0x892ee38 pc=0x80b737b
syscall.Wait4(0x21a8, 0x892eeb0, 0x0, 0x8a2c1e0)
	/tmp/workdir/go/src/syscall/syscall_bsd.go:144 +0x3b fp=0x892ee94 sp=0x892ee70 pc=0x80b558b
os.(*Process).wait(0x8a04660)
	/tmp/workdir/go/src/os/exec_unix.go:43 +0x82 fp=0x892eec8 sp=0x892ee94 pc=0x80de982
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x8a18fd0)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x4d fp=0x892ef0c sp=0x892eec8 pc=0x816b07d
os/exec.(*Cmd).Run(0x8a18fd0)
	/tmp/workdir/go/src/os/exec/exec.go:341 +0x43 fp=0x892ef1c sp=0x892ef0c pc=0x816a463
os/exec.(*Cmd).CombinedOutput(0x8a18fd0)
	/tmp/workdir/go/src/os/exec/exec.go:567 +0x89 fp=0x892ef30 sp=0x892ef1c pc=0x816b549
cmd/link.TestContentAddressableSymbols(0x89290e0)
	/tmp/workdir/go/src/cmd/link/link_test.go:879 +0x136 fp=0x892ef9c sp=0x892ef30 pc=0x83824b6
testing.tRunner(0x89290e0, 0x842c054)
	/tmp/workdir/go/src/testing/testing.go:1410 +0x10d fp=0x892efe4 sp=0x892ef9c pc=0x813d19d
testing.(*T).Run.func1()
	/tmp/workdir/go/src/testing/testing.go:1457 +0x28 fp=0x892eff0 sp=0x892efe4 pc=0x813df78
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x892eff4 sp=0x892eff0 pc=0x80ab211
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:1457 +0x36e

2021-10-29T18:34:24-903f313/netbsd-amd64-9_0
2021-10-01T15:59:38-e5ad363/netbsd-arm-bsiegert

goroutine 28 [syscall, 27 minutes]:
syscall.Syscall6(0x1c1, 0xd1f, 0xa09db4, 0x0, 0x9b27e0, 0x0, 0x0)
	/var/gobuilder/buildlet/go/src/syscall/asm_netbsd_arm.s:39 +0x8 fp=0xa09d5c sp=0xa09d58 pc=0x8d3f8
syscall.wait4(0xd1f, 0xa09db4, 0x0, 0x9b27e0)
	/var/gobuilder/buildlet/go/src/syscall/zsyscall_netbsd_arm.go:35 +0x54 fp=0xa09d94 sp=0xa09d5c pc=0x8a694
syscall.Wait4(0xd1f, 0xa09dd8, 0x0, 0x9b27e0)
	/var/gobuilder/buildlet/go/src/syscall/syscall_bsd.go:145 +0x3c fp=0xa09db8 sp=0xa09d94 pc=0x88c58
os.(*Process).wait(0x983290)
	/var/gobuilder/buildlet/go/src/os/exec_unix.go:44 +0x100 fp=0xa09df0 sp=0xa09db8 pc=0xb4f1c
os.(*Process).Wait(...)
	/var/gobuilder/buildlet/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x98cc60)
	/var/gobuilder/buildlet/go/src/os/exec/exec.go:507 +0x50 fp=0xa09e2c sp=0xa09df0 pc=0x1482d0
os/exec.(*Cmd).Run(0x98cc60)
	/var/gobuilder/buildlet/go/src/os/exec/exec.go:341 +0x48 fp=0xa09e3c sp=0xa09e2c pc=0x147810
os/exec.(*Cmd).CombinedOutput(0x98cc60)
	/var/gobuilder/buildlet/go/src/os/exec/exec.go:567 +0x98 fp=0xa09e50 sp=0xa09e3c pc=0x14882c
cmd/link.TestIssue33979.func2({0x983200, 0x21}, {0x9ea0a0, 0x9, 0x9})
	/var/gobuilder/buildlet/go/src/cmd/link/link_test.go:199 +0x90 fp=0xa09ea8 sp=0xa09e50 pc=0x368e14
cmd/link.TestIssue33979.func3({0x9ea0a0, 0x9, 0x9})
	/var/gobuilder/buildlet/go/src/cmd/link/link_test.go:206 +0x60 fp=0xa09ecc sp=0xa09ea8 pc=0x368d5c
cmd/link.TestIssue33979(0x8834a0)
	/var/gobuilder/buildlet/go/src/cmd/link/link_test.go:239 +0x3bc fp=0xa09f98 sp=0xa09ecc pc=0x368790
testing.tRunner(0x8834a0, 0x41871c)
	/var/gobuilder/buildlet/go/src/testing/testing.go:1389 +0x118 fp=0xa09fe0 sp=0xa09f98 pc=0x1195d4
testing.(*T).Run.func1()
	/var/gobuilder/buildlet/go/src/testing/testing.go:1436 +0x30 fp=0xa09fec sp=0xa09fe0 pc=0x11a448
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm.s:824 +0x4 fp=0xa09fec sp=0xa09fec pc=0x7d028
created by testing.(*T).Run
	/var/gobuilder/buildlet/go/src/testing/testing.go:1436 +0x3a0

2021-09-21T20:39:31-48cf96c/netbsd-arm-bsiegert
2021-09-14T14:27:57-181e8cd/netbsd-arm-bsiegert
2021-04-29T15:47:16-12eaefe/freebsd-amd64-11_4
2021-04-28T13:49:52-4fe324d/netbsd-386-9_0
2021-03-05T02:30:31-b62da08/netbsd-386-9_0
2021-02-19T00:40:05-95a44d2/netbsd-arm64-bsiegert
2019-09-04T21:52:18-aae0b5b/linux-ppc64le-power9osu

#44801 may be closely related.

Note that many of this failures are on architectures not believed to be affected by #49209.

@bsiegert, @coypoop: any ideas?

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD labels Dec 13, 2021
@bcmills bcmills added this to the Backlog milestone Dec 13, 2021
@bcmills
Copy link
Member Author

bcmills commented Dec 13, 2021

Could also be related to #48789.

@bcmills
Copy link
Member Author

bcmills commented Dec 13, 2021

The stuck calls appear to be running go run, go tool link, etc. Since these are Go binaries, probably a good starting point would be to send the stuck processes SIGQUIT to try to get goroutine dumps, and then to send them SIGKILL if they still don't respond.

That would at least help us to determine whether the hang is in the subprocess or the parent process.

(I think @aclements and @mknyszek were working on retrofitting that logic to various tests?)

@cherrymui
Copy link
Member

cherrymui commented Dec 14, 2021

The arm and arm64 ones may be due to slow machine.

Yeah, sending a SIGQUIT at timeout is probably a good idea.

@aclements
Copy link
Member

aclements commented Dec 15, 2021

I have CL 370665 to apply timeouts to nearly every subprocess invocation in the runtime test (though wasn't planning to land that until the tree opens). These failures are all in cmd/link or cmd/link/internal/ld. I could roll a CL to use RunWithTimeout in those tests.

@bcmills
Copy link
Member Author

bcmills commented Jan 7, 2022

Looks like the same failure mode in go/internal/gcimporter too: https://build.golang.org/log/0e1b9a393109ba16005d18ff9faca47c50728a8f

goroutine 167 [syscall, 2 minutes]:
syscall.Syscall6(0x1a4b, 0x8853cf0, 0x0, 0x8d7c320, 0x0, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_386.s:43 +0x5 fp=0x8853c9c sp=0x8853c98 pc=0x80b4015
syscall.wait4(0x1a4b, 0x8853cf0, 0x0, 0x8d7c320)
	/tmp/workdir/go/src/syscall/zsyscall_netbsd_386.go:34 +0x5b fp=0x8853cd4 sp=0x8853c9c pc=0x80b28db
syscall.Wait4(0x1a4b, 0x8853d14, 0x0, 0x8d7c320)
	/tmp/workdir/go/src/syscall/syscall_bsd.go:144 +0x3b fp=0x8853cf8 sp=0x8853cd4 pc=0x80b269b
os.(*Process).wait(0x89d1530)
	/tmp/workdir/go/src/os/exec_unix.go:43 +0x82 fp=0x8853d2c sp=0x8853cf8 pc=0x80c9d32
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x89549a0)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x4d fp=0x8853d70 sp=0x8853d2c pc=0x8139b7d
os/exec.(*Cmd).Run(0x89549a0)
	/tmp/workdir/go/src/os/exec/exec.go:341 +0x43 fp=0x8853d80 sp=0x8853d70 pc=0x8138f63
os/exec.(*Cmd).CombinedOutput(0x89549a0)
	/tmp/workdir/go/src/os/exec/exec.go:567 +0x89 fp=0x8853d94 sp=0x8853d80 pc=0x813a049
go/internal/gcimporter_test.compile(0x8801b30, {0x88c4300, 0x1e}, {0x88aaa80, 0xe}, {0x89d1500, 0x27})
	/tmp/workdir/go/src/go/internal/gcimporter/gcimporter_test.go:50 +0x28d fp=0x8853e18 sp=0x8853d94 pc=0x822216d
go/internal/gcimporter_test.TestImportTypeparamTests.func1(0x8801b30)
	/tmp/workdir/go/src/go/internal/gcimporter/gcimporter_test.go:201 +0x406 fp=0x8853f9c sp=0x8853e18 pc=0x8223a86
testing.tRunner(0x8801b30, 0x8970ca0)
	/tmp/workdir/go/src/testing/testing.go:1440 +0x10d fp=0x8853fe4 sp=0x8853f9c pc=0x8109a0d
testing.(*T).Run.func1()
	/tmp/workdir/go/src/testing/testing.go:1487 +0x28 fp=0x8853ff0 sp=0x8853fe4 pc=0x810a7e8
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x8853ff4 sp=0x8853ff0 pc=0x80a8a51
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:1487 +0x36e

@bcmills
Copy link
Member Author

bcmills commented Mar 10, 2022

greplogs --dashboard -md -l -e 'panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait(?:.*\n)+FAIL\s+cmd/link' --since=2022-01-08

2022-03-10T09:12:04-5a040c5/netbsd-amd64-9_0

@bcmills bcmills changed the title cmd/link: tests hanging in os.(*Process).Wait on netbsd builders os: (*Process).Wait sometimes hangs on netbsd Apr 6, 2022
@bcmills
Copy link
Member Author

bcmills commented Apr 6, 2022

Broadening the regexp to search for os.(*Process).Wait generally, since this symptom does not appear to be specific to cmd/link as far as I can see.

greplogs --dashboard -md -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-01-01

@bcmills
Copy link
Member Author

bcmills commented Apr 19, 2022

greplogs --dashboard -md -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-04-06

2022-04-18T22:07:54-f49e802/netbsd-arm-bsiegert

  • 29 minutes, stuck in cmd/link/internal/ld.TestMemProfileCheck on go run of a test program that prints runtime.MemProfileRate and then exits.

@bcmills
Copy link
Member Author

bcmills commented Apr 19, 2022

@bsiegert, @coypoop: given that os/exec is used pervasively, I can't easily filter out these failures during dashboard triage. I also don't see how Go on NetBSD could be usable in a production setting with what appears to be a pervasive deadlock in such a fundamental API. Is there something that can be done to move this forward on the NetBSD side?

@bsiegert
Copy link
Contributor

bsiegert commented Apr 19, 2022

/cc @zoulasc @tklauser

@coypoop
Copy link
Contributor

coypoop commented Apr 20, 2022

The arm and arm64 ones may be due to slow machine.

Is this still a suspicion? I assume the netbsd/arm builder is extra slow.
I didn't try this, but netbsd/arm64's 32-bit compat might be sufficient for running arm32 tests.

@riastradh
Copy link

riastradh commented Apr 20, 2022

Do you have steps to reproduce?

I'm having a little trouble following the initial report, because it seems to cover several operating systems and architectures. Does this happen every time on NetBSD, or on NetBSD/arm, or only sometimes, or what?

If it happens only sometimes, how long does it take successful test runs on the machines where it fails?

@bcmills
Copy link
Member Author

bcmills commented Apr 20, 2022

The arm and arm64 ones may be due to slow machine.

Is this still a suspicion? I assume the netbsd/arm builder is extra slow.

I can't speak for @cherrymui, but given the similar failures on the -amd64 and -386 builders I believe this is likely an architecture-independent synchronization bug in either the NetBSD kernel or the Go os implementation for NetBSD.

@bcmills
Copy link
Member Author

bcmills commented Apr 20, 2022

Do you have steps to reproduce?

Unfortunately no. The failures listed above were found organically in the Go build dashboard — the repro rate is high enough to be significant but not high enough to reproduce on demand.

I'm having a little trouble following the initial report, because it seems to cover several operating systems and architectures.

The freebsd failure in 2021-04-29T15:47:16-12eaefe was likely #46272, and hasn't been seen since that was addressed.
The linux-ppc64le failure was from 2019, and may have been an organic timeout. (It hasn't been a recurring pattern.)
The remainder AFAICT are on NetBSD, on varying architectures.

Does this happen every time on NetBSD, or on NetBSD/arm, or only sometimes, or what?

Intermittently, on NetBSD across all of the architectures for which we have builders.

@bcmills
Copy link
Member Author

bcmills commented Apr 20, 2022

(Note that we also tried to use wait6 for this on NetBSD, but had to roll it back because of other deadlocks on that OS — see #48789.)

@bcmills
Copy link
Member Author

bcmills commented Apr 25, 2022

greplogs --dashboard -md -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-04-19

@bcmills
Copy link
Member Author

bcmills commented May 13, 2022

greplogs --dashboard -md -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-04-23
2022-05-12T22:29:02-da0a6f4/netbsd-arm-bsiegert
2022-05-12T20:19:10-27ace7a/netbsd-arm-bsiegert

Very curious that these recent ones seem to occur in pairs. 🤔

(attn @golang/netbsd)

@bcmills
Copy link
Member Author

bcmills commented May 16, 2022

Then again, the pairings might just be a coincidence.

greplogs --dashboard -md -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-05-13
2022-05-13T19:45:43-ba8310c/netbsd-amd64-9_0

@bcmills
Copy link
Member Author

bcmills commented May 31, 2022

greplogs -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-05-14
2022-05-27T14:57:14-590b53f/netbsd-amd64-9_0 (on release-branch.go1.17)
2022-05-17T03:26:28-41b9d8c/netbsd-arm64-bsiegert (on the main branch)

@gopherbot
Copy link

gopherbot commented May 31, 2022

Change https://go.dev/cl/409595 mentions this issue: dashboard: mark all current netbsd builders as affected by golang/go#50138

gopherbot pushed a commit to golang/build that referenced this issue May 31, 2022
…50138

Since a large fraction of Go tests invoke commands, this issue causes
noise on the builders that cannot be easily bypassed or filtered out.

Failures matching this issue have been observed on all four of the
current NetBSD builders. (The last such failure observed on a
non-NetBSD builder was on freebsd-amd64-11_4, and that builder is no
longer used; no matching failures have been observed on more recent
FreeBSD builders.)

Updates golang/go#50138.

Change-Id: Ied687a63a55407d19c5f1905e79111d302087937
Reviewed-on: https://go-review.googlesource.com/c/build/+/409595
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
@bcmills
Copy link
Member Author

bcmills commented Sep 6, 2022

greplogs -l -e '\Anetbsd-(.*\n)*panic: test timed out.*\n\n(?:goroutine .*:\n(?:.+\n\t.+\n)+\n)*goroutine \d+ \[syscall, \d+ minutes\]:\n(?:.+\n\t.+\n)*os\.\(\*Process\)\.Wait' --since=2022-05-31 --details

2022-09-04T04:17:04-535fe2b/netbsd-386-9_0
2022-09-02T05:05:13-33c1ddd-8e35910/netbsd-amd64-9_0
2022-09-02T02:09:04-a330ca5/netbsd-386-9_0
2022-09-02T00:06:00-40cfaff-0592ce5/netbsd-amd64-9_0
2022-09-01T23:18:06-40cfaff-1280ae7/netbsd-amd64-9_0
2022-09-01T22:37:04-40cfaff-aa5ff29/netbsd-386-9_0
2022-09-01T18:08:05-f16be35-ef84141/netbsd-386-9_0
2022-09-01T15:27:30-6c10975-86e9e0e/netbsd-386-9_0
2022-09-01T03:24:42-49ab44d-91ef076/netbsd-amd64-9_0
2022-09-01T00:42:27-550e1f5-64b260d/netbsd-386-9_0
2022-08-31T23:52:00-550e1f5-ca634fa/netbsd-amd64-9_0
2022-08-31T22:22:43-550e1f5-e4b624e/netbsd-amd64-9_0
2022-08-31T22:10:52-550e1f5-33a7e5a/netbsd-386-9_0
2022-08-31T21:08:24-4ccc73c-ce77a46/netbsd-386-9_0
2022-08-31T21:08:24-4ccc73c-889d326/netbsd-amd64-9_0
2022-08-31T16:26:07-41c3a9b-d2d5929/netbsd-386-9_0
2022-08-31T01:16:54-248c34b-ee0e40a/netbsd-amd64-9_0
2022-08-30T21:30:15-248c34b-bd56cb9/netbsd-amd64-9_0
2022-08-30T00:49:19-248c34b-629ae1c/netbsd-amd64-9_0
2022-08-28T16:04:49-717a671-846c378/netbsd-386-9_0
2022-08-26T19:15:02-717a671-897ad2f/netbsd-amd64-9_0
2022-08-26T18:28:14-717a671-bf812b3/netbsd-386-9_0
2022-08-26T17:48:20-7f23307-296c40d/netbsd-amd64-9_0
2022-08-26T17:15:08-bc1d0d8/netbsd-amd64-9_0
2022-08-26T15:36:36-7c5e035-951d2c6/netbsd-386-9_0
2022-08-25T19:17:14-d35bb19-f64f12f/netbsd-386-9_0
2022-08-25T17:31:33-db6a62c-83b5fe6/netbsd-386-9_0
2022-08-25T14:56:18-db6a62c-e4be2ac/netbsd-amd64-9_0
2022-08-25T04:00:07-db6a62c-8c8429f/netbsd-386-9_0
2022-08-25T04:00:07-db6a62c-8c8429f/netbsd-amd64-9_0
2022-08-24T21:20:11-db6a62c-e4bed41/netbsd-amd64-9_0
2022-08-24T21:09:24-e4bed41/netbsd-386-9_0
2022-08-24T17:56:56-587a153-cfae70c/netbsd-386-9_0
2022-08-24T15:37:38-587a153-d5aa088/netbsd-amd64-9_0
2022-08-24T14:31:08-c837a30-f983a93/netbsd-amd64-9_0
2022-08-24T12:12:12-587a153-b5a9459/netbsd-amd64-9_0
2022-08-23T20:32:50-fc0d423/netbsd-386-9_0
2022-08-23T20:32:50-587a153-fc0d423/netbsd-386-9_0
2022-08-22T18:51:32-a726c9f/netbsd-386-9_0
2022-08-22T16:48:36-a10da77/netbsd-amd64-9_0
2022-08-19T16:55:03-e55fb40-7dad1d2/netbsd-386-9_0
2022-08-19T16:17:50-e55fb40-5729419/netbsd-386-9_0
2022-08-19T15:53:47-e55fb40-f324355/netbsd-amd64-9_0
2022-08-19T03:09:05-e55fb40-a409356/netbsd-386-9_0
2022-08-15T21:54:27-e3c2e4c/netbsd-386-9_0
2022-08-15T20:41:00-938e162-de0f4d1/netbsd-amd64-9_0
2022-08-15T20:02:31-938e162-7b45edb/netbsd-amd64-9_0
2022-08-15T19:17:20-8c83056-7b45edb/netbsd-amd64-9_0
2022-08-15T15:15:10-f1b1557/netbsd-386-9_0
2022-08-14T00:06:23-35f806b-59865f1/netbsd-amd64-9_0
2022-08-12T20:40:05-bebd890-2f6783c/netbsd-386-9_0
2022-08-12T18:15:28-bebd890-b6f87b0/netbsd-amd64-9_0
2022-08-12T12:39:26-88d981e-f67c766/netbsd-amd64-9_0
2022-08-12T00:04:29-c4ec74a-a5cd894/netbsd-386-9_0
2022-08-11T20:16:35-6c2e327/netbsd-amd64-9_0
2022-08-11T20:13:14-db84f53/netbsd-amd64-9_0
2022-08-11T17:53:50-37a81b6-a526ec1/netbsd-amd64-9_0
2022-08-11T16:19:14-37a81b6-2340d37/netbsd-amd64-9_0
2022-08-10T23:26:58-2e6ffd6/netbsd-amd64-9_0
2022-08-10T22:22:48-b2156b5-6b80b62/netbsd-amd64-9_0
2022-08-10T17:41:25-0ad49fd-f19f6c7/netbsd-amd64-9_0
2022-08-10T15:08:24-3950865-c81dfdd/netbsd-amd64-9_0
2022-08-09T14:33:24-0981d9f/netbsd-386-9_0
2022-08-09T14:12:01-92d58ea-662a729/netbsd-amd64-9_0
2022-08-09T11:28:56-92d58ea-0f8dffd/netbsd-amd64-9_0
2022-08-08T18:14:49-a34a97d/netbsd-amd64-9_0
2022-08-08T17:01:54-9b60852-0a86cd6/netbsd-amd64-9_0
2022-08-08T16:05:18-487b350/netbsd-386-9_0
2022-08-08T15:07:46-06d96ee-cd54ef1/netbsd-amd64-9_0
2022-08-08T14:11:19-6e9925c/netbsd-amd64-9_0
2022-08-08T14:11:09-3a9281f/netbsd-amd64-9_0
2022-08-08T06:16:59-06d96ee-0f6ee42/netbsd-386-9_0
2022-08-04T20:05:03-81c7dc4-39728f4/netbsd-386-9_0
2022-08-04T20:05:03-81c7dc4-39728f4/netbsd-amd64-9_0
2022-08-04T20:04:16-3519aa2-39728f4/netbsd-386-9_0
2022-08-04T19:57:25-763f65c-39728f4/netbsd-386-9_0
2022-08-04T17:05:18-3e0a503-fb1bfd4/netbsd-amd64-9_0
2022-08-04T15:50:11-3e0a503-fcdd099/netbsd-386-9_0
2022-08-04T15:50:11-3e0a503-44ff9bf/netbsd-amd64-9_0
2022-08-04T14:58:59-87f47bb-4345620/netbsd-386-9_0
2022-08-03T21:02:27-8b9a1fb-4345620/netbsd-386-9_0
2022-08-03T18:07:40-d08f5dc-fcdd099/netbsd-386-9_0
2022-08-02T18:51:22-74cee27/netbsd-amd64-9_0
2022-08-02T18:19:01-d025cce-be59153/netbsd-amd64-9_0
2022-08-02T18:16:22-10cb435-d723df7/netbsd-amd64-9_0
2022-08-02T18:07:14-4d0b383-d723df7/netbsd-386-9_0
2022-08-02T18:07:14-4d0b383-d723df7/netbsd-amd64-9_0
2022-08-02T16:05:48-4d0b383-f2a9f3e/netbsd-amd64-9_0
2022-07-06T19:34:44-fc07039/netbsd-amd64-9_0
2022-07-06T19:31:41-d69bac6-460a93b/netbsd-386-9_0

(90 matching logs)

@bsiegert
Copy link
Contributor

bsiegert commented Sep 16, 2022

https://go-review.googlesource.com/c/go/+/315281/ from @tklauser seems highly relevant for this, and it was reverted in https://go-review.googlesource.com/c/go/+/354249 because it resulted in test failures (#48789). Tobias, are there any plans to move forward with that change again?

@bcmills
Copy link
Member Author

bcmills commented Sep 19, 2022

@bsiegert, FWIW I think it would be ok to reapply CL 315281 at this point.

I think we have enough evidence by now to show that both wait4 and wait6 lead to deadlocks; since it seems to be broken either way, probably we should use whichever of those system calls will be easier to report and/or debug upstream.

@tklauser
Copy link
Member

tklauser commented Sep 19, 2022

@bsiegert I haven't looked at it in detail after reverting https://go.dev/cl/315281 and AFAIK I was never able to reproduce the issue on my local NetBSD machine to investigate further.

Given Bryan's comment re. both variants deadlocking I can resend that CL though, so we at least shouldn't hit #13987 on NetBSD.

@bsiegert
Copy link
Contributor

bsiegert commented Sep 19, 2022

Yes, please do!

@gopherbot
Copy link

gopherbot commented Sep 19, 2022

Change https://go.dev/cl/431855 mentions this issue: os: use wait6 to avoid wait/kill race on netbsd

gopherbot pushed a commit that referenced this issue Sep 19, 2022
Resend of CL 315281 which was partially reverted by CL 354249 after the
original CL was suspected to cause test failures as reported in #48789.
It seems that both wait4 and wait6 lead to that particular deadlock, so
let's use wait6. That way we at least don't hit #13987 on netbsd.

Updates #13987
For #48789
For #50138

Change-Id: Iadc4a771217b7e9e821502e89afa07036e0dcb6f
Reviewed-on: https://go-review.googlesource.com/c/go/+/431855
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Member Author

bcmills commented Sep 19, 2022

Huh. https://build.golang.org/log/23bd42ab6595d8c52430290c8bf5d7db3886229d appears to be after the above CL was merged, but it's still showing the hang in a call to Wait4 instead of Wait6`. 🤔

That call is the one here:
https://cs.opensource.google/go/go/+/master:src/os/exec_unix.go;l=43;drc=a2baae6851a157d662dff7cc508659f66249698a

Is it possible that NetBSD's wait4 system call is getting confused by a similar PID-reuse race?

@bcmills
Copy link
Member Author

bcmills commented Sep 19, 2022

Hmm, and that Wait4 is the same hang reported in #48789. So it seems that using Wait6 in blockUntilWaitable only changes the frequency, not the nature, of the hang.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Projects
None yet
Development

No branches or pull requests

8 participants