Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime, net: SIGSEGV crashes along netpoll codepath on platforms using kqueue #14127

Closed
mikioh opened this issue Jan 28, 2016 · 9 comments

Comments

Projects
None yet
5 participants
@mikioh
Copy link
Contributor

commented Jan 28, 2016

This might be a long standing issue on all kqueue supported platforms since Go 1.2.

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 1]
runtime.netpollunblock (pd=0x0, mode=114, ioready=true, ~r3=0xc820148480)

and

(gdb) p/x *ev
$5 = {ident = 0x0, filter = 0x0, flags = 0x8000, fflags = 0x0, pad_cgo_0 = {0x0, 0x0, 0x0, 0x0}, data = 0x0, udata = 0x0}

We should not assume that kevent returns the user-defined & stored data in any condition.

@mikioh mikioh added this to the Go1.6 milestone Jan 28, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2016

Details? Where are you seeing this?

@mikioh

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

On netbsd7-amd64. I just resurrected the vm for celebrating 1.6rc1 release (actually e970966) and got the following during the first all.bash:

ok      mime/quotedprintable    0.739s
ok      net     8.549s
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x0 pc=0x429e98]

runtime stack:
runtime.throw(0xa0b880, 0x2a)
        /go/src/runtime/panic.go:530 +0x90
runtime.sigpanic()
        /go/src/runtime/sigpanic_unix.go:12 +0x5a
runtime.netpollunblock(0x0, 0x100000072, 0xc8203f4000)
        /go/src/runtime/netpoll.go:352 +0xe8
runtime.netpollready(0x7f7ff67fd428, 0x0, 0x7f7f00000072)
        /go/src/runtime/netpoll.go:285 +0x112
runtime.netpoll(0xc820019500, 0x0)
        /go/src/runtime/netpoll_kqueue.go:94 +0x1cc
runtime.findrunnable(0xc820018000, 0x0)
        /go/src/runtime/proc.go:1826 +0x189
runtime.schedule()
        /go/src/runtime/proc.go:2060 +0x24f
runtime.park_m(0xc8203f4480)
        /go/src/runtime/proc.go:2125 +0x18b
runtime.mcall(0x7f7ff7b01080)
        /go/src/runtime/asm_amd64.s:233 +0x5b
(snip)

then run gdb with "-test.v -test.run=TestRoundTripGzip -test.count=10000".

@mikioh

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

Nice, "http.test -test.v -test.run=TestRoundTripGzip" can crash the kernel.

/netbsd: uvm_fault(0xfffffe8079bc8e80, 0x0, 1) -> e
netbsd-amd64 /netbsd: fatal page fault in supervisor mode
netbsd-amd64 /netbsd: trap type 6 code 0 rip ffffffff805bf9d5 cs 8 rflags 10246 cr2 18 ilevel 0 rsp fffffe80443f5c88
netbsd-amd64 /netbsd: curlwp 0xfffffe805bc46a80 pid 23983.2 lowest kstack 0xfffffe80443f32c0
netbsd-amd64 /netbsd: panic: trap
netbsd-amd64 /netbsd: cpu0: Begin traceback...
netbsd-amd64 /netbsd: vpanic() at netbsd:vpanic+0x13c
netbsd-amd64 /netbsd: snprintf() at netbsd:snprintf
netbsd-amd64 /netbsd: startlwp() at netbsd:startlwp
netbsd-amd64 /netbsd: alltraps() at netbsd:alltraps+0x96
netbsd-amd64 /netbsd: sys___kevent50() at netbsd:sys___kevent50+0x33
netbsd-amd64 /netbsd: syscall() at netbsd:syscall+0x9a
netbsd-amd64 /netbsd: --- syscall (number 435) ---
netbsd-amd64 /netbsd: 45f643:
netbsd-amd64 /netbsd: cpu0: End traceback...
netbsd-amd64 /netbsd: 
netbsd-amd64 /netbsd: dumping to dev 0,1 (offset=0, size=0): not possible
netbsd-amd64 /netbsd: rebooting...

The result of disasm /netbsd says:

ffffffff805bf959:       00 
ffffffff805bf95a:       48 8b 43 08             mov    0x8(%rbx),%rax
ffffffff805bf95e:       49 89 45 18             mov    %rax,0x18(%r13)
ffffffff805bf962:       48 8b 43 08             mov    0x8(%rbx),%rax
ffffffff805bf966:       4c 89 28                mov    %r13,(%rax)
ffffffff805bf969:       49 8d 45 10             lea    0x10(%r13),%rax
ffffffff805bf96d:       48 89 43 08             mov    %rax,0x8(%rbx)
ffffffff805bf971:       83 43 70 01             addl   $0x1,0x70(%rbx)
ffffffff805bf975:       41 83 4d 50 02          orl    $0x2,0x50(%r13)
ffffffff805bf97a:       e9 72 ff ff ff          jmpq   ffffffff805bf8f1 <kevent1+0x4b2>
ffffffff805bf97f:       48 8b 7c 24 50          mov    0x50(%rsp),%rdi
ffffffff805bf984:       e8 07 a8 07 00          callq  ffffffff8063a190 <mutex_spin_exit>
ffffffff805bf989:       48 8b 7c 24 10          mov    0x10(%rsp),%rdi
ffffffff805bf98e:       e8 6d a7 07 00          callq  ffffffff8063a100 <mutex_enter>
ffffffff805bf993:       ba 01 00 00 00          mov    $0x1,%edx
ffffffff805bf998:       48 8b 74 24 18          mov    0x18(%rsp),%rsi
ffffffff805bf99d:       4c 89 ef                mov    %r13,%rdi
ffffffff805bf9a0:       e8 42 e9 ff ff          callq  ffffffff805be2e7 <knote_detach>
ffffffff805bf9a5:       48 8b 7c 24 50          mov    0x50(%rsp),%rdi
ffffffff805bf9aa:       e8 91 a7 07 00          callq  ffffffff8063a140 <mutex_spin_enter>
ffffffff805bf9af:       e9 3d ff ff ff          jmpq   ffffffff805bf8f1 <kevent1+0x4b2>
ffffffff805bf9b4:       48 89 14 24             mov    %rdx,(%rsp)
ffffffff805bf9b8:       48 8b 7c 24 50          mov    0x50(%rsp),%rdi
ffffffff805bf9bd:       e8 ce a7 07 00          callq  ffffffff8063a190 <mutex_spin_exit>
ffffffff805bf9c2:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff805bf9c7:       e8 a7 9c 00 00          callq  ffffffff805c9673 <_kernel_lock>
ffffffff805bf9cc:       49 8b 45 68             mov    0x68(%r13),%rax
ffffffff805bf9d0:       31 f6                   xor    %esi,%esi
ffffffff805bf9d2:       4c 89 ef                mov    %r13,%rdi
ffffffff805bf9d5:       ff 50 18                callq  *0x18(%rax) // rip
ffffffff805bf9d8:       41 89 c4                mov    %eax,%r12d
ffffffff805bf9db:       31 f6                   xor    %esi,%esi
ffffffff805bf9dd:       bf 01 00 00 00          mov    $0x1,%edi
ffffffff805bf9e2:       e8 0b 9e 00 00          callq  ffffffff805c97f2 <_kernel_unlock>
ffffffff805bf9e7:       48 8b 7c 24 50          mov    0x50(%rsp),%rdi
ffffffff805bf9ec:       e8 4f a7 07 00          callq  ffffffff8063a140 <mutex_spin_enter>
ffffffff805bf9f1:       41 8b 75 50             mov    0x50(%r13),%esi
ffffffff805bf9f5:       40 f6 c6 02             test   $0x2,%sil
ffffffff805bf9f9:       48 8b 14 24             mov    (%rsp),%rdx
ffffffff805bf9fd:       0f 85 75 fd ff ff       jne    ffffffff805bf778 <kevent1+0x339>
ffffffff805bfa03:       45 85 e4                test   %r12d,%r12d
ffffffff805bfa06:       0f 85 8f fe ff ff       jne    ffffffff805bf89b <kevent1+0x45c>
ffffffff805bfa0c:       83 e6 fe                and    $0xfffffffe,%esi
ffffffff805bfa0f:       41 89 75 50             mov    %esi,0x50(%r13)
ffffffff805bfa13:       e9 60 fd ff ff          jmpq   ffffffff805bf778 <kevent1+0x339>
ffffffff805bfa18:       49 8b 45 18             mov    0x18(%r13),%rax
ffffffff805bfa1c:       48 89 43 08             mov    %rax,0x8(%rbx)
ffffffff805bfa20:       e9 47 fe ff ff          jmpq   ffffffff805bf86c <kevent1+0x42d>
ffffffff805bfa25:       44 8b 44 24 38          mov    0x38(%rsp),%r8d
ffffffff805bfa2a:       48 8b 4c 24 28          mov    0x28(%rsp),%rcx
ffffffff805bfa2f:       48 8b 54 24 48          mov    0x48(%rsp),%rdx
ffffffff805bfa34:       48 8d b4 24 90 00 00    lea    0x90(%rsp),%rsi
ffffffff805bfa3b:       00

so the crash happened around

static int
kqueue_scan(...)
{
            (snip)
            if (nkev == kevcnt) {
                /* do copyouts in kevcnt chunks */
                mutex_spin_exit(&kq->kq_lock);
                error = (*keops->keo_put_events)
                    (keops->keo_private,
                    kevbuf, ulistp, nevents, nkev);
                mutex_spin_enter(&kq->kq_lock);
                nevents += nkev;
                nkev = 0;
                kevp = kevbuf;
            }

in http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_event.c?rev=1.80.2.1&content-type=text/x-cvsweb-markup&only_with_tag=netbsd-7-0-RELEASE.

@mikioh

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

Looks like there's a possibility of kernel fault.

@mikioh mikioh modified the milestones: Go1.6Maybe, Go1.6 Jan 28, 2016

@mikioh mikioh added the OS-NetBSD label Jan 28, 2016

@mikioh

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

Okay, both darwin and dragonfly survived the torture "http.test -test.v -test.run=TestRoundTripGzip -test,count=10000." It might be a NetBSD kernel issue.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jan 29, 2016

Thanks for investigating. Postponing to Go 1.7 unless more evidence appears that Go is at fault.

@rsc rsc modified the milestones: Go1.7, Go1.6Maybe Jan 29, 2016

@bsiegert

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2016

I filed the kernel panic on the NetBSD side as http://gnats.netbsd.org/50730.

@bsiegert

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2016

Some more details and a partial fix in the other bug:

  • use a marker knote from the stack instead of allocating and freeing on
    each scan.
  • add more KASSERTS
  • introduce a KN_BUSY bit that indicates that the knote is currently being
    scanned, so that knote_detach does not end up deleting it when the file
    descriptor gets closed and we don't end up using/trashing free memory from
    the scan.

There is a deadlock that can happen when one thread is exiting, one is closing a socket and the third is trying to lock that socket.

@mikioh mikioh changed the title runtime: SIGSEGV crashes along netpoll codepath on platforms using kqueue runtime, net: SIGSEGV crashes along netpoll codepath on platforms using kqueue Feb 28, 2016

@rsc

This comment has been minimized.

Copy link
Contributor

commented May 18, 2016

It's been a full cycle and still no evidence this is anything but a NetBSD kernel bug. Reopen with more details if you think otherwise.

@rsc rsc closed this May 18, 2016

@golang golang locked and limited conversation to collaborators May 18, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.