New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestSignalIgnoreSIGTRAP flaky on OpenBSD-amd64 #17496

Open
bradfitz opened this Issue Oct 18, 2016 · 10 comments

Comments

Projects
None yet
9 participants
@bradfitz
Member

bradfitz commented Oct 18, 2016

Trybot flake from https://go-review.googlesource.com/31173

https://storage.googleapis.com/go-build-log/a2e3e68c/openbsd-amd64-gce58_b0eb773d.log

ok      reflect 0.266s
ok      regexp  0.362s
ok      regexp/syntax   0.307s
--- FAIL: TestSignalIgnoreSIGTRAP (0.05s)
    crash_test.go:106: testprognet SignalIgnoreSIGTRAP exit status: exit status 2
    crash_unix_test.go:176: want OK
        , got SIGTRAP: trace trap
        PC=0x45a694 m=0

        goroutine 1 [syscall]:
        syscall.Syscall(0x25, 0x6554, 0x5, 0x0, 0x0, 0x6554, 0x0)
            /tmp/workdir/go/src/syscall/asm_unix_amd64.s:19 +0x5 fp=0xc420047e08 sp=0xc420047e00
        syscall.Kill(0x6554, 0x5, 0x1, 0x1)
            /tmp/workdir/go/src/syscall/zsyscall_openbsd_amd64.go:666 +0x4b fp=0xc420047e58 sp=0xc420047e08
        main.SignalIgnoreSIGTRAP()
            /tmp/workdir/go/src/runtime/testdata/testprognet/signal.go:24 +0x8d fp=0xc420047ea0 sp=0xc420047e58
        main.main()
            /tmp/workdir/go/src/runtime/testdata/testprognet/main.go:34 +0x1f1 fp=0xc420047f48 sp=0xc420047ea0
        runtime.main()
            /tmp/workdir/go/src/runtime/proc.go:185 +0x20a fp=0xc420047fa0 sp=0xc420047f48
        runtime.goexit()
            /tmp/workdir/go/src/runtime/asm_amd64.s:2160 +0x1 fp=0xc420047fa8 sp=0xc420047fa0

        goroutine 17 [syscall, locked to thread]:
        runtime.goexit()
            /tmp/workdir/go/src/runtime/asm_amd64.s:2160 +0x1

        goroutine 5 [runnable]:
        os/signal.loop()
            /tmp/workdir/go/src/os/signal/signal_unix.go:20
        created by os/signal.init.1
            /tmp/workdir/go/src/os/signal/signal_unix.go:28 +0x41

        rax    0x0
        rbx    0x0
        rcx    0x45a694
        rdx    0x0
        rdi    0x6554
        rsi    0x5
        rbp    0xc420047e48
        rsp    0xc420047e00
        r8     0x0
        r9     0x0
        r10    0x0
        r11    0x246
        r12    0xc42006c4e1
        r13    0x1
        r14    0xf3
        r15    0x3
        rip    0x45a694
        rflags 0x246
        cs     0x2b
        fs     0x2c8feeaa0
        gs     0x29

FAIL
FAIL    runtime 34.550s
2016/10/17 23:20:53 Failed: exit status 1
@rsc

This comment has been minimized.

Contributor

rsc commented Oct 27, 2016

Plenty of failures, all OpenBSD:

$ grep -lR ,.got.SIGTRAP:.trace.trap rev 
rev/2016-01-12T02:00:48-352e287/openbsd-386-gce56
rev/2016-01-20T04:43:43-5395846/openbsd-386-gce56
rev/2016-02-09T22:14:02-fd458ba/openbsd-amd64-gce58
rev/2016-04-12T23:14:16-982274c/openbsd-386-gce58
rev/2016-05-01T02:36:46-9813951/openbsd-386-gce58
rev/2016-05-15T06:24:02-b4bf066/openbsd-386-gce58
rev/2016-05-31T13:02:09-4223294/openbsd-386-gce58
rev/2016-06-01T10:28:01-ad074e2/openbsd-386-gce58
rev/2016-08-16T00:24:16-e0d8064/openbsd-386-gce58
rev/2016-09-06T17:21:22-b926bf8/openbsd-386-gce58
rev/2016-09-16T03:45:18-9658de3/openbsd-386-gce58
rev/2016-09-21T01:25:29-35d22af/openbsd-386-gce58
rev/2016-10-20T01:06:09-ef8e85e/openbsd-386-gce58

The program in question does:

func SignalIgnoreSIGTRAP() {
    signal.Ignore(syscall.SIGTRAP)
    syscall.Kill(syscall.Getpid(), syscall.SIGTRAP)
    println("OK")
}

signal.Ignore sets the 1<<SIGTRAP bit in runtime's sig.ignored, which can be queried by signal_ignored:

// Must only be called from a single goroutine at a time.
//go:linkname signal_ignore os/signal.signal_ignore
func signal_ignore(s uint32) {
    if s >= uint32(len(sig.wanted)*32) {
        return
    }
    sig.wanted[s/32] &^= 1 << (s & 31)
    sig.ignored[s/32] |= 1 << (s & 31)
    sigignore(s)
}

// Checked by signal handlers.
func signal_ignored(s uint32) bool {
    return sig.ignored[s/32]&(1<<(s&31)) != 0
}

Then syscall.Kill sends the SIGTRAP, which ends up in the signal handler, which does:

if c.sigcode() == _SI_USER && signal_ignored(sig) {
    return
}

Clearly that condition is false and should be true.

  • Is c.sigcode() == _SI_USER false?
    • Maybe the kernel has a bug and sends the wrong siginfo sometimes?
    • Maybe we misinterpret the siginfo sometimes?
  • Is signal_ignored(sig) false?
    • Maybe multiple calls to signal_ignore step on each other? No, os/signal is holding a lock when calling it.
    • Maybe memory corruption due to the kernel?
    • Maybe signal.Ignore happens on one thread/CPU and the signal handler is invoked on a separate thread/CPU,
      and maybe there's no synchronization between the signaling CPU and the handling CPU, so that the handling CPU
      has not yet seen the update to sig.ignored. How? The same thread that set the sig.ignored bit triggered the signal. How could the existence of the signal propagate to the other CPU without carrying the sig.ignored memory update with it?

I built testprognet on an openbsd-amd64-gce58 gomote and am running 'stress -p 50 testprognet SignalIgnoreSIGTRAP' in hopes of getting even 1 failure. Nothing so far.

@gopherbot

This comment has been minimized.

gopherbot commented Oct 27, 2016

CL https://golang.org/cl/32183 mentions this issue.

@rsc

This comment has been minimized.

Contributor

rsc commented Oct 27, 2016

CL 32183 added a print of sigcode in the SIGTRAP crash. That will at least let us see which half of the if statement is wrong. I can't reproduce this myself but it seems to be happening around once a month, so maybe we can just wait and see.

At some point we should think about using atomic loads/stores to synchronize signal_ignore with signal_ignored, but I'd prefer to gather more information before we stomp around where the bug might be. I talked to @aclements and he agrees that there doesn't seem to be any plausible way to connect the dots in the missed memory update theory.

gopherbot pushed a commit that referenced this issue Oct 27, 2016

runtime: print sigcode on signal crash
For #17496.

Change-Id: I671a59581c54d17bc272767eeb7b2742b54eca38
Reviewed-on: https://go-review.googlesource.com/32183
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@mdempsky

This comment has been minimized.

Member

mdempsky commented Oct 27, 2016

Aside, there seems to be a theoretical race: we're setting sig.ignored in thread A, then sending a process-directed SIGTRAP signal, which can be handled in thread B on a different CPU, where the sig.ignored write isn't visible yet.

Reviewing OpenBSD's kill system call implementation, it looks like the signal should always be handled during this test by A (see "If the current thread can process the signal immediately (it's unblocked) then have it take it." in ptsignal). So I don't think that's the issue here.

Experimentally, I can see A!=B occasionally happens on Linux though:

$ strace -f ./testprognet SignalIgnoreSIGTRAP 2>&1 | grep SIGTRAP
execve("./testprognet", ["./testprognet", "SignalIgnoreSIGTRAP"], [/* 44 vars */]) = 0
rt_sigaction(SIGTRAP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTRAP, {0x45ab30, ~[], SA_RESTORER|SA_STACK|SA_RESTART|SA_SIGINFO, 0x45abf0}, NULL, 8) = 0
[pid  4106] kill(4106, SIGTRAP <unfinished ...>
[pid  4108] --- SIGTRAP {si_signo=SIGTRAP, si_code=SI_USER, si_pid=4106, si_uid=123644} ---

(I.e., SIGTRAP is being sent by thread 4106, but being handled by thread 4108.)

There also seems to be a theoretical race that when SIGTRAP is handled on a separate thread, there's no synchronization to ensure the main thread doesn't just immediately print "OK" and exit, before the SIGTRAP signal handler runs. But that would only cause sporadic success, not sporadic failure.

@aclements

This comment has been minimized.

Member

aclements commented Oct 27, 2016

Aside, there seems to be a theoretical race: we're setting sig.ignored in thread A, then sending a process-directed SIGTRAP signal, which can be handled in thread B on a different CPU, where the sig.ignored write isn't visible yet.

At least on x86, I don't believe this is possible. Thread A on CPU A writes sig.ignored and then must write something to inform thread B on CPU B that it wants to send a signal. If CPU B observes the second write, then it must also observe the first write. Reads can be reordered with writes, but writes cannot be reordered with other writes, and reads cannot be reordered with other reads.

@mdempsky

This comment has been minimized.

Member

mdempsky commented Oct 27, 2016

Good point, I agree, x86's memory model doesn't seem to allow that particular race.

@rsc rsc modified the milestones: Go1.9, Go1.8 Nov 11, 2016

@josharian

This comment has been minimized.

Contributor

josharian commented Mar 10, 2017

FWIW, there's also been an openbsd/386 failure: https://build.golang.org/log/0056d2a00dc647d543f6f56f14d1056c5828581c

@aclements

This comment has been minimized.

Member

aclements commented Jun 7, 2017

Recent failures with the sigcode output:

2016-12-01T20:08:56-b42d4a8/openbsd-386-gce58 PC=0x8097237 m=0 sigcode=418098389
2016-12-01T23:11:15-e3a1d0c/openbsd-arm PC=0x6e0f4 m=0 sigcode=282571616
2017-03-03T21:02:17-fbf4dd9/openbsd-386-60 PC=0x80971a7 m=0 sigcode=415316774
2017-03-06T15:05:42-06a6b3a/openbsd-386-60 PC=0x8097537 m=0 sigcode=33
2017-03-10T01:02:14-678f35b/openbsd-386-60 PC=0x8097b77 m=0 sigcode=1713534066
2017-03-22T17:33:27-cfb3c8d/openbsd-386-60 PC=0x8098a47 m=0 sigcode=4096
2017-03-31T20:05:33-8ab7130/openbsd-386-60 PC=0x8099a17 m=0 sigcode=8

The sigcodes are all over the place and look like they're just corrupted. The PCs seem reasonable, though, so it's not like the whole signal context is bad (the PC isn't exactly where I see syscall.Kill when I build the binary, but it's within a few KB and maybe the cross-compile is doing something that throws it off).

Re-assigning to @ianlancetaylor now that we have more debugging info.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 28, 2017

I ran the test over 10,000 times on an OpenBSD 386 gomote, but failed to reproduce it. I checked things like the placement of the siginfo struct (on the signal stack) and the arguments. I read through all the code I could think of checking, and didn't see anything at all. I confirmed that the PC value in a real program is correct.

I have no suggestions for how to proceed.

@ianlancetaylor ianlancetaylor removed their assignment Jun 28, 2017

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9 Jun 28, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@gopherbot gopherbot modified the milestones: Go1.11, Unplanned May 23, 2018

@4a6f656c

This comment has been minimized.

Contributor

4a6f656c commented Dec 15, 2018

As another data point, this also flakes on openbsd/arm:

https://build.golang.org/log/387c70cdbbfdf65f3672e3434ad0f843ffcabe6b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment