Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: test flaky on darwin #31264

Closed
bcmills opened this issue Apr 4, 2019 · 34 comments

Comments

@bcmills
Copy link
Member

commented Apr 4, 2019

The os/signal test seems to time out fairly frequently on the darwin-amd64-race builder.
It's not obvious to me whether the test is deadlocked or just slow.

Example: https://build.golang.org/log/7c20aa477d782f38c6184e7f8bce05c9648b57b3

panic: test timed out after 3m0s

goroutine 20 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1332 +0x11c
created by time.goFunc
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/time/sleep.go:169 +0x52

goroutine 1 [chan receive]:
testing.(*T).Run(0xc0000c0000, 0x42154f9, 0x9, 0x421e888, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:917 +0x68a
testing.runTests.func1(0xc0000c0000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1157 +0xa7
testing.tRunner(0xc0000c0000, 0xc000046db0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:865 +0x163
testing.runTests(0xc00000e060, 0x435ace0, 0xb, 0xb, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1155 +0x522
testing.(*M).Run(0xc0000ba000, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1072 +0x2ea
main.main()
	_testmain.go:64 +0x21f

goroutine 4 [syscall]:
os/signal.signal_recv(0x4247580)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x30
created by os/signal.init.0
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x4f

goroutine 10 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:266
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x31
os/signal.Stop(0xc00009c300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x2f3
runtime.Goexit()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc0000c0300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:609 +0x5b
testing.(*common).Fatalf(0xc0000c0300, 0x4218523, 0x16, 0xc0000e6d18, 0x1, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:672 +0x9f
os/signal.waitSig(0xc0000c0300, 0xc00009c2a0, 0x4247580, 0x42436a8)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:32 +0x309
os/signal.testCancel(0xc0000c0300, 0xc00003b700)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:133 +0x1f9
os/signal.TestReset(0xc0000c0300)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:187 +0x3e
testing.tRunner(0xc0000c0300, 0x421e888)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:865 +0x163
created by testing.(*T).Run
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:916 +0x652

goroutine 8 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:266
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x31
os/signal.Stop(0xc00009c180)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x2f3
os/signal.TestStress.func1(0xc000022480, 0xc0000224e0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:92 +0x194
created by os/signal.TestStress
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:79 +0xe1
FAIL	os/signal	180.028s

@bcmills bcmills added this to the Go1.13 milestone Apr 4, 2019

@bcmills

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

CC @ianlancetaylor for os/signal

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2019

In the stack trace two goroutines are stuck in this loop in signalWaitUntilIdle (in runtime/sigqueue.go):

	for atomic.Load(&sig.delivering) != 0 {
		Gosched()
	}

sig.delivering can only be non-zero while executing sigsend, also in runtime/sigqueue.go. sigsend is invoked by the signal handler. There is no code in sigsend that blocks, and sigsend always returns with sig.delivering unchanged.

So this seems to be impossible. The failure presumably has something to do with the race detector, but I don't know why it would be Darwin specific.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

@dvyukov for any ideas.

@bcmills

This comment has been minimized.

Copy link
Member Author

commented Apr 5, 2019

Some thoughts:

  • It could be another manifestation of the race in #20748.
  • Perhaps the fact that it's Darwin-specific has something to do with the interaction with libc. (Do we invoke sigaction via libc on Darwin now?) I seem to recall that TSAN tends to defer signal delivery until the next libc call (see #18717), so if we're hitting the TSAN hooks for signal delivery and then the test is idling in a way that doesn't ping back into libc, that could deadlock it.
@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

Not just race, it turns out. @andybons notes that it's also here:

https://build.golang.org/log/65130fd6cf64710896cae0a569087f8fa8df66fc (darwin-amd64-10_14)

@bradfitz bradfitz changed the title os/signal: test flaky on darwin-amd64-race builder os/signal: test flaky on darwin-amd64 Apr 5, 2019

@bradfitz

This comment has been minimized.

Copy link
Member

commented Apr 5, 2019

And that last one is on the Go 1.12 release branch.

@randall77, @ianlancetaylor, should this block the Go 1.12.2 release?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Apr 5, 2019

It's hard to believe that this is new in 1.12.2 compared to 1.12, so I don't see a reason to block the 1.12.2 release.

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

Just saw this on my pre-Mojave Mac as well. It's too bad the stack trace does not include any signal-handling thread stacks. It looks for all the world like sig.delivering is non-zero and it would be very nice to see where exactly the signal-handling thread is.

panic: test timed out after 9m0s

goroutine 24 [running]:
testing.(*M).startAlarm.func1()
	/Users/rsc/go/src/testing/testing.go:1380 +0xdf
created by time.goFunc
	/Users/rsc/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000152100, 0x4167a64, 0x9, 0x4170e60, 0x4078a01)
	/Users/rsc/go/src/testing/testing.go:964 +0x377
testing.runTests.func1(0xc0000fe000)
	/Users/rsc/go/src/testing/testing.go:1205 +0x78
testing.tRunner(0xc0000fe000, 0xc000054e30)
	/Users/rsc/go/src/testing/testing.go:912 +0xbf
testing.runTests(0xc0000d4020, 0x426fca0, 0xb, 0xb, 0x0)
	/Users/rsc/go/src/testing/testing.go:1203 +0x2a7
testing.(*M).Run(0xc0000fa000, 0x0)
	/Users/rsc/go/src/testing/testing.go:1120 +0x160
main.main()
	_testmain.go:64 +0x135

goroutine 34 [syscall]:
os/signal.signal_recv(0x41931e0)
	/Users/rsc/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/Users/rsc/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/Users/rsc/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 23 [runnable]:
runtime.Gosched(...)
	/Users/rsc/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/Users/rsc/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000be240)
	/Users/rsc/go/src/os/signal/signal.go:196 +0x18c
runtime.Goexit()
	/Users/rsc/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc000152100)
	/Users/rsc/go/src/testing/testing.go:656 +0x39
testing.(*common).Fatalf(0xc000152100, 0x416ab68, 0x16, 0xc000043588, 0x1, 0x1)
	/Users/rsc/go/src/testing/testing.go:719 +0x90
os/signal.waitSig(0xc000152100, 0xc0000be1e0, 0x41931e0, 0x418f4e0)
	/Users/rsc/go/src/os/signal/signal_test.go:32 +0x247
os/signal.testCancel(0xc000152100, 0x2be993261bf00)
	/Users/rsc/go/src/os/signal/signal_test.go:133 +0x1ce
os/signal.TestReset(0xc000152100)
	/Users/rsc/go/src/os/signal/signal_test.go:187 +0x30
testing.tRunner(0xc000152100, 0x4170e60)
	/Users/rsc/go/src/testing/testing.go:912 +0xbf
created by testing.(*T).Run
	/Users/rsc/go/src/testing/testing.go:963 +0x350

goroutine 21 [runnable]:
runtime.Gosched(...)
	/Users/rsc/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/Users/rsc/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc000024120)
	/Users/rsc/go/src/os/signal/signal.go:196 +0x18c
os/signal.TestStress.func1(0xc000138120, 0xc000138180)
	/Users/rsc/go/src/os/signal/signal_test.go:92 +0x160
created by os/signal.TestStress
	/Users/rsc/go/src/os/signal/signal_test.go:79 +0xd3
FAIL	os/signal	540.197s
@bcmills

This comment has been minimized.

Copy link
Member Author

commented May 30, 2019

One on darwin-arm64-corellium (https://build.golang.org/log/89688609340e1b045e263677ed0bf9b0cdd84710):

panic: test timed out after 3m0s

goroutine 10 [running]:
testing.(*M).startAlarm.func1()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1382 +0xc0
created by time.goFunc
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/time/sleep.go:169 +0x38

goroutine 1 [chan receive]:
testing.(*T).Run(0x130112200, 0x10248ee96, 0x9, 0x1024981c0, 0x5ce80e01)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:961 +0x2b4
testing.runTests.func1(0x1300f4000)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1207 +0x6c
testing.tRunner(0x1300f4000, 0x13003edf8)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:909 +0xa8
testing.runTests(0x1300b8060, 0x102596e20, 0xb, 0xb, 0x0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1205 +0x24c
testing.(*M).Run(0x1300e0000, 0x0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:1122 +0x140
main.main()
	_testmain.go:66 +0x144

goroutine 19 [syscall]:
os/signal.signal_recv(0x1024bbc60)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:139 +0xb8
os/signal.loop()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_unix.go:23 +0x18
created by os/signal.init.0
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_unix.go:29 +0x30

goroutine 9 [runnable]:
runtime.Gosched(...)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:172 +0x2c
os/signal.Stop(0x130080360)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal.go:196 +0x1a8
runtime.Goexit()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/panic.go:503 +0xdc
testing.(*common).FailNow(0x130112200)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:653 +0x30
testing.(*common).Fatalf(0x130112200, 0x102491f08, 0x16, 0x13003dd78, 0x1, 0x1)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:716 +0x70
os/signal.waitSig(0x130112200, 0x130080300, 0x1024bbc60, 0x1024b7e80)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:32 +0x19c
os/signal.testCancel(0x130112200, 0x13003df00)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:133 +0x188
os/signal.TestReset(0x130112200)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:187 +0x24
testing.tRunner(0x130112200, 0x1024981c0)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:909 +0xa8
created by testing.(*T).Run
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/testing/testing.go:960 +0x298

goroutine 7 [runnable]:
runtime.Gosched(...)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/runtime/sigqueue.go:172 +0x2c
os/signal.Stop(0x1300ca480)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal.go:196 +0x1a8
os/signal.TestStress.func1(0x130128120, 0x130128180)
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:92 +0x110
created by os/signal.TestStress
	/tmp/workdir-host-darwin-arm64-corellium-ios/go/src/os/signal/signal_test.go:79 +0xd0
FAIL	os/signal	180.141s

@bcmills bcmills added the OS-Darwin label May 30, 2019

@bcmills bcmills changed the title os/signal: test flaky on darwin-amd64 os/signal: test flaky on darwin May 30, 2019

@bcmills

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2019

darwin-amd64-10_12:

panic: test timed out after 3m0s

goroutine 21 [running]:
testing.(*M).startAlarm.func1()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1382 +0xdf
created by time.goFunc
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00013c100, 0x4164e26, 0x9, 0x416e1f8, 0x4077c01)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc0000f2000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1207 +0x78
testing.tRunner(0xc0000f2000, 0xc000046e30)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:909 +0xbf
testing.runTests(0xc0000d0020, 0x426ec60, 0xb, 0xb, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1205 +0x2a7
testing.(*M).Run(0xc0000e6000, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:1122 +0x160
main.main()
	_testmain.go:66 +0x135

goroutine 34 [syscall]:
os/signal.signal_recv(0x41903a0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 20 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000ae1e0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x18c
runtime.Goexit()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/panic.go:503 +0xec
testing.(*common).FailNow(0xc00013c100)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:653 +0x39
testing.(*common).Fatalf(0xc00013c100, 0x4167f1f, 0x16, 0xc00004bd88, 0x1, 0x1)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:716 +0x90
os/signal.waitSig(0xc00013c100, 0xc0000ae180, 0x41903a0, 0x418c628)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:32 +0x247
os/signal.testCancel(0xc00013c100, 0x5b15d2f4400)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:133 +0x1ce
os/signal.TestReset(0xc00013c100)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:187 +0x30
testing.tRunner(0xc00013c100, 0x416e1f8)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:909 +0xbf
created by testing.(*T).Run
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:960 +0x350

goroutine 5 [runnable]:
runtime.Gosched(...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc0000b2180)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal.go:196 +0x18c
os/signal.TestStress.func1(0xc0000220c0, 0xc000022120)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:92 +0x160
created by os/signal.TestStress
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_test.go:79 +0xee
FAIL	os/signal	180.025s

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

This has been happening for a while. The first occurrence seems to be on June 18, 2018:

https://build.golang.org/log/d464d075b04b100b96b7e9b7614755c02efd9dfa

The first occurrence that is not on the race builder is from July 13, 2018, but is on 386:

https://build.golang.org/log/5a14c8368922d5d3c4a19e4e7aff37111d916bd3

The first occurrence on amd64, not on the race builder, is July 18, 2018:

https://build.golang.org/log/0666f3b1dfa397938a4e93fe89afb97bdd2769e7

That is an interesting one, because it shows time.Sleep(10 * time.Millisecond) as sleeping for 3 minutes. The next more typical occurrence on amd64, not on the race builder, is from August 20, 2018:

https://build.golang.org/log/924b174c273ad0f136081ef3d1d9751bada8253b

That shows the most common pattern: a goroutine started by TestStress is hanging waiting for a deferred call to Stop to complete, and the test TestReset is also hanging waiting for a deferred call to Stop to complete. The deferred call to Stop from TestReset is occurring because TestReset has failed, timed out waiting for a SIGWINCH to be delivered.

June 2018, when the problems started happening, is when the runtime package was changing to move some operations from direct system calls to library calls. E.g., https://golang.org/cl/118819, https://golang.org/cl/118736. Notably https://golang.org/cl/117176 was committed on June 12, 2018: "runtime: use libc for signal functions on iOS".

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

Actually a possibly more relevant CL is https://golang.org/cl/116875, committed June 12, 2018: "runtime: use libc's signal functions on Darwin".

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

Through sheer luck I triggered the problem on a gomote. The process appears to be burning CPU. I attached to it using lldb and got this stack trace:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff60d11f06 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff60dc8d52 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 96
    frame #2: 0x00007fff60dc64cd libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 222
    frame #3: 0x000000000405f57c signal.test`runtime.pthread_mutex_lock_trampoline + 12
    frame #4: 0x000000000405cf8d signal.test`runtime.asmcgocall + 173
    frame #5: 0x000000000405f570 signal.test`runtime.pthread_mutex_init_trampoline + 32
    frame #6: 0x000000000404c909 signal.test`runtime.pthread_mutex_lock + 57
    frame #7: 0x000000000402b93a signal.test`runtime.semawakeup + 42
    frame #8: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #9: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #10: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #11: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #12: 0x000000000405f263 signal.test`runtime.sigtramp + 51
  * frame #13: 0x000000000402b93a signal.test`runtime.semawakeup + 42
    frame #14: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #15: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #16: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #17: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #18: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #19: 0x000000000400b959 signal.test`runtime.notewakeup + 89
    frame #20: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #21: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #22: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #23: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #24: 0x0000000004044233 signal.test`runtime.sigsend + 275
    frame #25: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #26: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #27: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #28: 0x0000000004041e90 signal.test`runtime.sighandler + 288
    frame #29: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #30: 0x0000000004042f6c signal.test`runtime.sigtrampgo + 572
    frame #31: 0x000000000405f263 signal.test`runtime.sigtramp + 51
    frame #32: 0x000000000405f263 signal.test`runtime.sigtramp + 51

It looks like when sigsend calls notewakeup it is triggering a signal. Thus we never get out of sigsend and so sig.delivering really is always non-zero. Continuing to investigate.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

This C program triggers a SIGSEGV in a signal handler while SIGSEGV is blocked. On normal Unix systems that causes the kernel to immediately kill the program: since the SIGSEGV cannot be delivered, nothing else can happen. On macOS, however, it causes the program to hang while burning CPU. So I think that explains the symptom: in some cases sigsend is triggering a SIGSEGV or some other synchronous signal, causing the program to hang until it is killed by cmd/go.

#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void die(const char *s) {
  perror(s);
  exit(EXIT_FAILURE);
}

char *nil;

static void handler(int sig, siginfo_t *info __attribute__((unused)), void *ctxt __attribute__((unused))) {
  printf("SP %#lx\n", (unsigned long)(__builtin_frame_address(0)));
  *nil = 0; /* trigger SIGSEGV */
}

static char *stack[SIGSTKSZ];

int main() {
  stack_t ss;
  struct sigaction sa;

  memset(&ss, 0, sizeof ss);
  ss.ss_sp = &stack[0];
  ss.ss_size = sizeof stack;
  if (sigaltstack(&ss, NULL) != 0)
    die("sigaltstack");

  memset(&sa, 0, sizeof sa);
  sa.sa_sigaction = handler;
  if (sigfillset(&sa.sa_mask) != 0)
    die("sigfillset");
  sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
  if (sigaction(SIGWINCH, &sa, NULL) != 0)
    die("sigaction");
  if (sigaction(SIGSEGV, &sa, NULL) != 0)
    die("sigaction");

  raise(SIGWINCH);

  exit(EXIT_SUCCESS);
}
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

Actually I now suspect the problem is simple enough. The TestStress test sends signals as fast as possible. Those signals are sent through sigsend. sigsend calls notewakeup, which calls semawakeup, which calls pthread_mutex_lock. In other words, we are calling pthread_mutex_lock from a signal handler. But we are calling the C function, not making a system call. The C function pthread_mutex_lock is not async-signal-safe. If the signal arrives while signal_recv is also locking or unlocking the same mutex, anything could happen.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

So after looking through a bunch of docs and complaints, it appears that macOS primitives for communications between threads are limited.

The sem_post function is async-signal-safe, but there is no way to do a timed wait for a semaphore: macOS does not support sem_timedwait, although it is in IEEE Std 1003.1-2001.

I think the best bet is the API defined in <mach/semaphore.h>, which is more or less what we used to use before converting to libc. That API appears to be available from C, but is sparsely documented. This would mean calling functions semaphore_create, semaphore_signal, semaphore_wait, semaphore_timedwait, semaphore_destroy, and mach_task_self. This assumes that semaphore_signal is async-signal-safe; it is not documented as such, but according to the Intertubes sem_post calls it, so it must be.

@randall77 @eliasnaur Do you see any difficulties with using those functions on macOS and iOS?

@eliasnaur

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Thanks for working on this and for taking iOS limitations into account.

https://developer.apple.com/documentation/kernel/1538205-semaphore_create?language=objc only lists macOS, which means that while semaphore_create and friends might exist and work on iOS, Go programs could end up being rejected by the App Store checks. See #31628 for a recent example with ptrace.

Is the dispatch variant of the semaphore API sufficient? It is listed as supported on all Apple platforms.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Well, that sucks.

What we need is a way to do a timed wait for some condition to occur: wait for up to N nanoseconds for the condition. Right now we we use a condition variable, and use pthread_cond_timedwait_relative_np for the timed wait, but that doesn't work because we have no way to signal a condition variable in a signal handler.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

My next thought is to add a new kind of synchronization construct to the runtime: a signal safe note. On all platforms other than Darwin, we would implement that as a normal note. On Darwin we would call sem_open (it's a real pain that Darwin is also missing sem_init) and call sem_post from the signal handler. This could work because we don't need to do a timed wait on the signal safe note.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

Sorry to push the point if it was completely off: did you see my question about the "dispatch" semaphore variants, such as https://developer.apple.com/documentation/dispatch/1452919-dispatch_semaphore_signal?language=objc?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2019

I saw the link but I didn't understand it, and intended to reply to it with my comment about a timed wait. But looking deeper I see that there is a timed wait there. Thanks.

What I'm unsure about now is that that page says these are available from Objective C and Swift. But I do see <dispatch/dispatch.h> which looks like C code. I will give it a try.

@gopherbot

This comment has been minimized.

Copy link

commented Jun 13, 2019

Change https://golang.org/cl/182258 mentions this issue: runtime: use dispatch semaphores on Darwin

@gopherbot

This comment has been minimized.

Copy link

commented Jun 19, 2019

Change https://golang.org/cl/182880 mentions this issue: Revert "runtime: use dispatch semaphores on Darwin"

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2019

Reopening because rollback was committed.

gopherbot pushed a commit that referenced this issue Jun 21, 2019
Revert "runtime: use dispatch semaphores on Darwin"
This reverts https://golang.org/cl/182258.

The new code caused unpredictable crashes that are not understood.  The old code was occasionally flaky but still better than this approach.

Fixes #32655
Updates #31264

Change-Id: I2e9d27d6052e84bf75106d8b844549ba4f571695
Reviewed-on: https://go-review.googlesource.com/c/go/+/182880
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Katie Hockman <katie@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@katiehockman

This comment has been minimized.

Copy link
Contributor

commented Jun 21, 2019

As an update, the darwin-amd64-10_11 build is now seeing flaky failures again: https://build.golang.org/log/3728af36f8bbb9c6d0341afa90b8afbc1138cf3a

@MichaelTJones

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

Update on the test failures here at home. As it happens, the "lingering death" that causes the tests to timeout with various issues actually happens after or better said, in the wake of, running signal_test. The run by hand, -v output is:

lum:signal mtj$ go test -v
=== RUN   TestSignal
--- PASS: TestSignal (0.00s)
    signal_test.go:44: sighup...
    signal_test.go:53: sigwinch...
    signal_test.go:60: sighup...
    signal_test.go:63: sighup...
=== RUN   TestStress
--- PASS: TestStress (3.01s)
=== RUN   TestReset
signal: killed

The PASS in TestStress does indeed pass in all cases. Changing the number of CPUs in TestStress to one makes the test pass 3 out of 4 times. Changing various waits and delays does nothing. After TestStress in cases where it triggers the issue, the latent issue puts the next test into the unkillable by ^C mode where it does nothing but vegetate and consume processor time.

I read and understood Ian's deeper causal commentary, but wanted to explore ways to provoke the bug less aggressively as an alternative to preventing the test on darwin. I surrender on that, the test is now skipped in my tree.

@MichaelTJones

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

FYI, in os_test.go, in PipeThreads, changing the "but can you do it with just half the threads?" test to a more generous 2/3 the threads, works fine on many core...

defer debug.SetMaxThreads(debug.SetMaxThreads((2*threads) / 3))

Line 2278

@gertcuykens

This comment was marked as resolved.

Copy link

commented Jun 26, 2019

panic: test timed out after 9m0s

goroutine 50 [running]:
testing.(*M).startAlarm.func1()
        /Users/gert/Desktop/go/src/testing/testing.go:1382 +0xdf
created by time.goFunc
        /Users/gert/Desktop/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00014a000, 0x41680df, 0x9, 0x4171488, 0x4077901)
        /Users/gert/Desktop/go/src/testing/testing.go:961 +0x377
testing.runTests.func1(0xc0000e2000)
        /Users/gert/Desktop/go/src/testing/testing.go:1207 +0x78
testing.tRunner(0xc0000e2000, 0xc000054dc0)
        /Users/gert/Desktop/go/src/testing/testing.go:909 +0xc9
testing.runTests(0xc0000c2020, 0x426fc60, 0xb, 0xb, 0x0)
        /Users/gert/Desktop/go/src/testing/testing.go:1205 +0x2a7
testing.(*M).Run(0xc0000de000, 0x0)
        /Users/gert/Desktop/go/src/testing/testing.go:1122 +0x176
main.main()
        _testmain.go:66 +0x135

goroutine 18 [syscall]:
os/signal.signal_recv(0x41936c0)
        /Users/gert/Desktop/go/src/runtime/sigqueue.go:139 +0x9f
os/signal.loop()
        /Users/gert/Desktop/go/src/os/signal/signal_unix.go:23 +0x22
created by os/signal.init.0
        /Users/gert/Desktop/go/src/os/signal/signal_unix.go:29 +0x41

goroutine 35 [runnable]:
runtime.Gosched(...)
        /Users/gert/Desktop/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
        /Users/gert/Desktop/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc000070360)
        /Users/gert/Desktop/go/src/os/signal/signal.go:196 +0x18c
runtime.Goexit()
        /Users/gert/Desktop/go/src/runtime/panic.go:547 +0xec
testing.(*common).FailNow(0xc00014a000)
        /Users/gert/Desktop/go/src/testing/testing.go:653 +0x39
testing.(*common).Fatalf(0xc00014a000, 0x416b1c4, 0x16, 0xc000144ce0, 0x1, 0x1)
        /Users/gert/Desktop/go/src/testing/testing.go:716 +0x90
os/signal.waitSig(0xc00014a000, 0xc000070300, 0x41936c0, 0x418f928)
        /Users/gert/Desktop/go/src/os/signal/signal_test.go:32 +0x247
os/signal.testCancel(0xc00014a000, 0x4b7e59ccf9200)
        /Users/gert/Desktop/go/src/os/signal/signal_test.go:133 +0x20f
os/signal.TestReset(0xc00014a000)
        /Users/gert/Desktop/go/src/os/signal/signal_test.go:187 +0x30
testing.tRunner(0xc00014a000, 0x4171488)
        /Users/gert/Desktop/go/src/testing/testing.go:909 +0xc9
created by testing.(*T).Run
        /Users/gert/Desktop/go/src/testing/testing.go:960 +0x350

goroutine 8 [runnable]:
runtime.Gosched(...)
        /Users/gert/Desktop/go/src/runtime/proc.go:269
os/signal.signalWaitUntilIdle()
        /Users/gert/Desktop/go/src/runtime/sigqueue.go:172 +0x30
os/signal.Stop(0xc00009c120)
        /Users/gert/Desktop/go/src/os/signal/signal.go:196 +0x18c
os/signal.TestStress.func1(0xc000088120, 0xc000088180)
        /Users/gert/Desktop/go/src/os/signal/signal_test.go:92 +0x17d
created by os/signal.TestStress
        /Users/gert/Desktop/go/src/os/signal/signal_test.go:79 +0xf8
FAIL    os/signal       540.019s
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

@gertcuykens Thanks. I think we understand exactly what is happening, and we don't need further examples. What we need is some idea of how to fix it. iOS doesn't seem to have the facility we need.

@eliasnaur

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

I'm not enrolled in the Apple Developer program to be sure, but a good test of whether symbols are available for iOS is to check whether they're available in the Objective C headers. I created a single page iOS project in Xcode and modified main.m:

#import <UIKit/UIKit.h>
#import <mach/semaphore.h>
#import <mach/task.h>
#import <mach/mach.h>


/*#include <sys/types.h>
#include <sys/ptrace.h>
*/


#import "AppDelegate.h"

int main(int argc, char * argv[]) {
    semaphore_t sem;
    task_t me = mach_task_self();
    semaphore_create(me, &sem, 0, 0);
    semaphore_signal(sem);
    semaphore_wait(sem);
    mach_timespec_t ts;
    semaphore_timedwait(sem, ts);
    semaphore_destroy(me, sem);
    
    
    
     //    ptrace(0, 0, 0, 0);
    @autoreleasepool {
        return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
    }
}

Which compiled without errors.

As you can tell, I also tried to use a symbol we know is not available: ptrace. As expected, the ptrace symbol is not defined in ptrace.h for iOS and the build fails.

I think we can go ahead and use your suggested C functions, Ian. In the meantime, perhaps someone with Apple Developer access can make absolutely sure the functions pass the App Store tests. @steeve, @sanderpick?

@gopherbot

This comment has been minimized.

Copy link

commented Jun 28, 2019

Change https://golang.org/cl/184169 mentions this issue: runtime: use a pipe to wake up signal_recv on Darwin

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

I went with a different approach that should always work: use a special purpose note, based on a pipe, for signal handling. @eliasnaur It would be great if you could check that CL 184169 works on arm and arm64. Thanks. I verified that it seems to work on amd64, using gomote.

@gopherbot gopherbot closed this in c485e8b Jun 30, 2019

@gopherbot

This comment has been minimized.

Copy link

commented Jun 30, 2019

Change https://golang.org/cl/184377 mentions this issue: runtime: use correct register in darwin/386 pipe_trampoline

gopherbot pushed a commit that referenced this issue Jul 1, 2019
runtime: use correct register in darwin/386 pipe_trampoline
Updates #31264

Change-Id: I745744dd3fdaa432d70e8dc9336547017bac89ee
Reviewed-on: https://go-review.googlesource.com/c/go/+/184377
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Elias Naur <mail@eliasnaur.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.