Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc/cgo: TestSignalForwardingExternal sometimes fails with wrong signal SIGINT #53907

jonathan-albrecht-ibm opened this issue Jul 15, 2022 · 11 comments


Copy link

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 15, 2022

What version of Go are you using (go version)?

$ go version
go version devel go1.19-2aa473cc54 Fri Jul 15 14:56:54 2022 +0000 linux/s390x

Does this issue reproduce with the latest release?


What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOVERSION="devel go1.19-2aa473cc54 Fri Jul 15 14:56:54 2022 +0000"
GOGCCFLAGS="-fPIC -m64 -march=z196 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build182921307=/tmp/go-build -gno-record-gcc-switches"

What did you do?

cd misc/cgo/testcarchive
go test -failfast -count=5000 -timeout=0 -run TestSignalForwardingExternal

What did you expect to see?

Test should always pass

What did you see instead?

--- FAIL: TestSignalForwardingExternal (1.85s)
    carchive_test.go:681: got signal: quit (core dumped); expected signal segmentation fault
exit status 1
FAIL    misc/cgo/testcarchive   5177.741s

This test failed twice while I was testing new s390x builder hardware:

This failure could also happen on other architectures but for some reason its more likely to happen on the new machine I'm testing. As far as I'm able to see, this failure has not happened on the existing s390x builders.

I'm able to reproduce the same failure on linux amd64 by removing the code related to waiting for the testp program to send the OK trigger which makes the window where a go thread could handle the SIGSEV signal larger:

ok, err := r.ReadString('\n')

I ran the test under strace and I think this could happen since this CL in go1.15 which makes SIGSEGV always crash the go program. In c-archive programs GOTRACEBACK=crash is always effectively set and that causes the go signal handler to send SIGINT to its own process:


and that's how the tests sees the testp process has exited with SIGINT instead of SIGSEGV.

I think the fix could be for the test to send SIGUSR1 to testp instead of SIGSEGV. I've tried that out on s390x and amd64 and haven't seen any failures. If that makes sense, I can submit a CL.

@jonathan-albrecht-ibm jonathan-albrecht-ibm added the NeedsFix label Jul 15, 2022
Copy link
Contributor Author

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 21, 2022

@ianlancetaylor I think you might be the most familiar with this test. Could you have a look when you get a chance?

Copy link

@ianlancetaylor ianlancetaylor commented Jul 21, 2022

The point of the test is to check that SIGSEGV sent from some other process is handled correctly. It has to be SIGSEGV, or some other synchronous signal, because those signals are handled specially in c-archive/c-shared mode. The test was added for #14899, q.v. Sending SIGUSR1 would not be a useful test here.

The point of waiting for the OK string is to make sure that the Go runtime is fully initialized, by the call to Noop in the C code. That initialization should ensure that the Go SIGSEGV signal handler is installed. If we don't wait for the string, then it is possible that the signal handle has not been installed. So that doesn't seem like a good check of what is happening.

When I run this on linux-amd64 the signal is delivered to a C thread, so sigtrampgo sees that g == nil and winds up calling badsignal. That seems correct as at this point in the program we don't expect any Go threads to be running.

You say that GOTRACEBACK=crash is effectively set here, but I don't quite see why that is. Can you explain?

I do agree that if the signal is somehow delivered to a Go thread, we may not take the expected action.

One thing we should certainly do is log the test program's standard error, so that we can see what it thinks is happening.

Copy link
Contributor Author

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 21, 2022

Thanks for the explanation, that helps a lot.

Regarding the "GOTRACEBACK=crash is effectively set here" comment, I was referring to:


Lines 435 to 437 in 244c8b0

if islibrary || isarchive {
t |= tracebackCrash

which will always set the crash flag for c-archive programs if I understand correctly.

I have the strace output for a failed test run and it includes both a failing test program execution and a successful one. Because TestSignalForwardingExternal runs the test program in a loop and the check done in the expectSignal function calls t.Error* instead of t.Fatal* the test program gets run a second time. Looking at the strace output again, I think it does show that the a Go thread is getting the SIGSEGV signal in the SIGINT case. Here is the full strace output: strace-failed.txt

SIGINT strace output first time in the loop:

pid 977139 - TestSignalForwardingExternal thread
pid 977322 - testp C thread
pid 977324 - some testp Go thread
I think a Go thread (pid 977324) is getting the SIGSEGV instead of the C thread (pid 977322).
This snippet starts at line 154635:

[pid 977322] write(2, "OK\n", 3 <unfinished ...>
[pid 977324] nanosleep({tv_sec=0, tv_nsec=20},  <unfinished ...>
[pid 977322] <... write resumed>)       = 3
[pid 977139] <... epoll_wait resumed>[{events=EPOLLIN, data={u32=1023, u64=4395431198072}}], 128, -1) = 1
[pid 977322] select(0, NULL, NULL, NULL, {tv_sec=60, tv_usec=0} <unfinished ...>
[pid 977139] clock_gettime(CLOCK_MONOTONIC, {tv_sec=442529, tv_nsec=884131926}) = 0
[pid 977139] futex(0x290e58, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 977136] <... futex resumed>)       = 0
[pid 977139] read(3,  <unfinished ...>
[pid 977136] sched_yield( <unfinished ...>
[pid 977139] <... read resumed>"OK\n", 4096) = 3
[pid 977139] kill(977322, SIGSEGV <unfinished ...>
[pid 977324] <... nanosleep resumed>NULL) = 0
[pid 977139] <... kill resumed>)        = 0
[pid 977324] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=977135, si_uid=1001} ---
[pid 977322] <... select resumed>)      = ? ERESTARTNOHAND (To be restarted if no handler)

... <goroutine dump output from pid 977324 and other stuff>...

[pid 977324] kill(977322, SIGQUIT <unfinished ...>
[pid 977135] <... clock_gettime resumed>{tv_sec=442529, tv_nsec=894865017}) = 0
[pid 977324] <... kill resumed>)        = 0
[pid 977322] <... select resumed>)      = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 977324] nanosleep({tv_sec=5, tv_nsec=4995000},  <unfinished ...>
[pid 977322] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=977322, si_uid=1001} ---
[pid 977135] epoll_wait(4, [], 128, 0)  = 0
[pid 977327] <... futex resumed>)       = ?
[pid 977326] <... futex resumed>)       = ?
[pid 977325] <... futex resumed>)       = ?
[pid 977324] <... nanosleep resumed> <unfinished ...>) = ?
[pid 977323] <... futex resumed>)       = ?
[pid 977135] epoll_wait(4,  <unfinished ...>
[pid 977327] +++ killed by SIGQUIT (core dumped) +++
[pid 977326] +++ killed by SIGQUIT (core dumped) +++
[pid 977325] +++ killed by SIGQUIT (core dumped) +++
[pid 977324] +++ killed by SIGQUIT (core dumped) +++
[pid 977323] +++ killed by SIGQUIT (core dumped) +++
[pid 977322] +++ killed by SIGQUIT (core dumped) +++

SIGSEGV strace output second time in the loop:

pid 977135 - TestSignalForwardingExternal thread
pid 977334 - testp C thread
This snippet starts at line 156599:

[pid 977334] write(2, "OK\n", 3 <unfinished ...>
[pid 977135] <... epoll_wait resumed>[{events=EPOLLIN, data={u32=1023, u64=4395431198072}}], 128, -1) = 1
[pid 977334] <... write resumed>)       = 3
[pid 977135] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 977334] select(0, NULL, NULL, NULL, {tv_sec=60, tv_usec=0} <unfinished ...>
[pid 977135] <... clock_gettime resumed>{tv_sec=442530, tv_nsec=272942121}) = 0
[pid 977135] futex(0x290e58, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 977135] read(3,  <unfinished ...>
[pid 977338] <... nanosleep resumed>NULL) = 0
[pid 977136] <... futex resumed>)       = 0
[pid 977338] futex(0xc00003ed48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 977136] clock_gettime(CLOCK_MONOTONIC,  <unfinished ...>
[pid 977135] <... read resumed>"OK\n", 4096) = 3
[pid 977336] <... nanosleep resumed>NULL) = 0
[pid 977136] <... clock_gettime resumed>{tv_sec=442530, tv_nsec=272980794}) = 0
[pid 977135] kill(977334, SIGSEGV <unfinished ...>
[pid 977336] nanosleep({tv_sec=0, tv_nsec=20},  <unfinished ...>
[pid 977136] nanosleep({tv_sec=0, tv_nsec=20},  <unfinished ...>
[pid 977135] <... kill resumed>)        = 0
[pid 977334] <... select resumed>)      = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 977135] waitid(P_PID, 977334,  <unfinished ...>
[pid 977334] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_USER, si_pid=977135, si_uid=1001} ---
[pid 977334] rt_sigprocmask(SIG_SETMASK, NULL, ~[KILL STOP], 8) = 0
[pid 977334] rt_sigprocmask(SIG_SETMASK, ~[], NULL, 8) = 0
[pid 977334] sigaltstack(NULL, {ss_sp=NULL, ss_flags=SS_DISABLE, ss_size=0}) = 0
[pid 977334] sigaltstack({ss_sp=0xc000040000, ss_flags=0, ss_size=32768}, NULL) = 0
[pid 977336] <... nanosleep resumed>NULL) = 0
[pid 977334] gettid( <unfinished ...>
[pid 977336] nanosleep({tv_sec=0, tv_nsec=20},  <unfinished ...>
[pid 977334] <... gettid resumed>)      = 977334
[pid 977334] rt_sigprocmask(SIG_UNBLOCK, [SEGV], NULL, 8) = 0
[pid 977334] rt_sigaction(SIGSEGV, {sa_handler=SIG_DFL, sa_mask=~[], sa_flags=SA_ONSTACK|SA_RESTART|SA_SIGINFO}, NULL, 8) = 0
[pid 977334] getpid()                   = 977334
[pid 977334] gettid()                   = 977334
[pid 977334] tgkill(977334, 977334, SIGSEGV) = 0
[pid 977136] <... nanosleep resumed>NULL) = 0
[pid 977334] --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_TKILL, si_pid=977334, si_uid=1001} ---

... <other stuff I don't understand> ...

[pid 977339] +++ killed by SIGSEGV (core dumped) +++
[pid 977338] +++ killed by SIGSEGV (core dumped) +++
[pid 977337] +++ killed by SIGSEGV (core dumped) +++
[pid 977336] +++ killed by SIGSEGV (core dumped) +++
[pid 977335] +++ killed by SIGSEGV (core dumped) +++
[pid 977334] +++ killed by SIGSEGV (core dumped) +++

One difference between the two test program executions is where the <... select resumed>) = ? ERESTARTNOHAND (To be restarted if no handler) occurs but I don't know if that's significant.

Copy link

@ianlancetaylor ianlancetaylor commented Jul 21, 2022

Thanks for the pointer about GOTRACEBACK=crash.

It does seem possible to the signal to be delivered to a Go thread. I think that will fix the test problem. Thanks.

Copy link

@gopherbot gopherbot commented Jul 21, 2022

Change mentions this issue: misc/cgo/testcarchive: permit SIGQUIT for TestSignalForwardingExternal

Copy link
Contributor Author

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 22, 2022

Thanks for creating the CL @ianlancetaylor! I'm running go test -failfast -count=10000 -timeout=0 ./ -run TestSignalForwarding[EG].* on it now. So far so good.

Since this issue shows up on the new s390x builder VMs I'm preparing, could it also be backported?

Copy link

@ianlancetaylor ianlancetaylor commented Jul 22, 2022

@jonathan-albrecht-ibm Which releases would you want to see the patch in?

Copy link
Contributor Author

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 22, 2022

@ianlancetaylor I'd like to get this patch into the last two releases so that when I add the new s390x dashboard builder VMs there wouldn't be any chance of those builders failing on this test when testing security/point releases. I'm assuming that's needed but let me know if not or if there is another way of handling it.

I know 1.19 is about to be released so really I'd like to get it into which ever release branches are going to have further releases. Would that be release-branch.go1.19 and release-branch.go1.18 right now?

Copy link
Contributor Author

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Jul 25, 2022

I ran the TestSignalForwardingExternal and TestSignalForwardingGo tests from the CL on my s390x VM that was prone to failing the original TestSignalForwardingExternal test 10,000 times and got no failures:

$ go test -failfast -count=10000 -timeout=0 ./ -run TestSignalForwarding[EG].*
ok      misc/cgo/testcarchive   32440.474s

Copy link

@ianlancetaylor ianlancetaylor commented Jul 25, 2022

@gopherbot Please open backport to 1.18.

This is a test-only fix that avoids occasional test failures on s390x and possibly other systems.

Copy link

@gopherbot gopherbot commented Jul 25, 2022

Backport issue(s) opened: #54056 (for 1.18), #54239 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

4 participants