New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: unknown caller pc #20846

Open
karalabe opened this Issue Jun 29, 2017 · 13 comments

Comments

Projects
None yet
8 participants
@karalabe
Contributor

karalabe commented Jun 29, 2017

We've received a GC crash report from a Windows 10 user, running our Ethereum client built with Go 1.8.3 (amd64). We've never seen this error ever, so I doubt we can make a repro out of it, but here's the crash logs if you can make heads or tails of it (you can find the full crash dump at ethereum/go-ethereum#14725):

runtime: unexpected return pc for net.(*netFD).writeTo called from 0x80002000720013
fatal error: unknown caller pc

runtime stack:
runtime.throw(0xea8762, 0x11)
        C:/go/src/runtime/panic.go:596 +0x9c
runtime.gentraceback(0xffffffffffffffff, 0xc044903698, 0x0, 0xc0440544e0, 0x0, 0x0, 0x7fffffff, 0xf405e8, 0x45c2f978, 0x0, ...)
        C:/go/src/runtime/traceback.go:317 +0x13f8
runtime.copystack(0xc0440544e0, 0x2000, 0x0)
        C:/go/src/runtime/stack.go:932 +0x377
runtime.shrinkstack(0xc0440544e0)
        C:/go/src/runtime/stack.go:1186 +0x18c
runtime.scanstack(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/mgcmark.go:762 +0x386
runtime.scang(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/proc.go:866 +0x210
runtime.markroot.func1()
        C:/go/src/runtime/mgcmark.go:264 +0x7b
runtime.systemstack(0x45c2fda0)
        C:/go/src/runtime/asm_amd64.s:343 +0xb5
runtime.markroot(0xc042015228, 0xc00000006b)
        C:/go/src/runtime/mgcmark.go:269 +0x1e9
runtime.gcDrain(0xc042015228, 0x6)
        C:/go/src/runtime/mgcmark.go:1051 +0xcc
runtime.gcBgMarkWorker.func2()
        C:/go/src/runtime/mgc.go:1530 +0x101
runtime.systemstack(0xc0429f9700)
        C:/go/src/runtime/asm_amd64.s:327 +0x7e
runtime.mstart()
        C:/go/src/runtime/proc.go:1132

goroutine 19 [GC worker (idle)]:
runtime.systemstack_switch()
        C:/go/src/runtime/asm_amd64.s:281 fp=0xc042439f50 sp=0xc042439f48
runtime.gcBgMarkWorker(0xc042014000)
        C:/go/src/runtime/mgc.go:1537 +0x218 fp=0xc042439fd8 sp=0xc042439f50
runtime.goexit()
        C:/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc042439fe0 sp=0xc042439fd8
created by runtime.gcBgMarkStartWorkers
        C:/go/src/runtime/mgc.go:1412 +0x9f
@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 29, 2017

@bradfitz bradfitz added this to the Go1.9Maybe milestone Jun 29, 2017

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Jun 29, 2017

The only slightly odd thing I see in the full backtrace is

goroutine 84 [runnable, locked to thread]:
syscall.Syscall6(0x7ffcc6d84ab0, 0x4, 0x54, 0xc044539884,                 0x1, 0xc044539780, 0x0, 0x0, 0x1, 0x0, ...)
        C:/go/src/runtime/syscall_windows.go:174 +0x6b
syscall.(*Proc).Call(0xc0440062a0, 0xc042dd2180, 0x4, 0x4,                 0x20, 0xd36820, 0x1483201, 0xc042dd2180)
        C:/go/src/syscall/dll_windows.go:153 +0x4dc
syscall.(*LazyProc).Call(0xc0421d1e60, 0xc042dd2180, 0x4,                 0x4, 0x1, 0x0, 0x14832e0, 0xc0423560b0)
        C:/go/src/syscall/dll_windows.go:303 +0x66
github.com/ethereum/go-ethereum/vendor/github.com/peterh/l                iner.(*State).readNext(0xc0421daa80, 0xc044539ae0, 0x2, 0x                20, 0xc044539a60)
        C:/gopath/src/github.com/ethereum/go-ethereum/vend                or/github.com/peterh/liner/input_windows.go:178 +0x120

Why is a goroutine running in syscall6 marked as runnable?

We should probably force GOTRACEBACK=system for this crash, so that we can see the runtime functions.

In fact, since this problem seems somewhat repeatable, can we see a stack trace with GOTRACEBACK=system set in the environment?

@aclements

This comment has been minimized.

Member

aclements commented Jun 29, 2017

I asked for the GOTRACEBACK=system traceback on the original issue.

For reference, here's the trace of the goroutine with the bad return PC:

goroutine 140524 [runnable (scan)]:
net.runtime_pollWait(0x4c442b8, 0x77, 0x0)
        C:/go/src/runtime/netpoll.go:164 +0x60
net.(*pollDesc).wait(0xc0421cefa0, 0x77, 0x0, 0x0)
        C:/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc0420fa698, 0xc0421cef00, 0xe91b9f, 0x9, 0xf40140, 0x4, 0x4, 0x765f)
        C:/go/src/net/fd_windows.go:196 +0xfe
net.(*netFD).writeTo(0xc0421cee00, 0xc04697a1a0, 0x8b, 0xc3, 0x1481420, 0xc042bd6b40, 0x0, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:561 +0x17f
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

It, of course, stops when it hits the bad return PC from writeTo (since it's just printing the traceback, it moves on instead of panicking).

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Jun 30, 2017

here's the trace of the goroutine with the bad return PC:

FYI. To muddy the waters even more

548 func (fd *netFD) writeTo(buf []byte, sa syscall.Sockaddr) (int, error) {
549         if len(buf) == 0 {
550                 return 0, nil
551         }
552         if err := fd.writeLock(); err != nil {
553                 return 0, err
554         }
555         defer fd.writeUnlock()
556         o := &fd.wop
557         o.InitBuf(buf)
558         o.sa = sa
559         n, err := wsrv.ExecIO(o, "WSASendto", func(o *operation) error {
560                 return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
561         })
562         if _, ok := err.(syscall.Errno); ok {
563                 err = os.NewSyscallError("wsasendto", err)
564         }
565         return n, err
566 }

Normal windows syscalls use memory passed to it as parameters during the call only. But syscall.WSASendto here just makes record of "&o.buf" and "&o.o" and returns immediately. Then Windows actually uses memory later to send its contents to the network. Windows signals when it finishes via GetQueuedCompletionStatusEx. So the "o.buf" and "o.o" should not be moved until after line 561. So, hopefully, they are allocated on the heap.

Alex

@aclements

This comment has been minimized.

Member

aclements commented Jun 30, 2017

@alexbrainman, good call. However, while we're clearly skating on thin ice, it appears to be okay right now:

net/fd_windows.go:548: leaking param: fd
net/fd_windows.go:548:  from fd (passed to call[argument escapes]) at net/fd_windows.go:552
@aclements

This comment has been minimized.

Member

aclements commented Jun 30, 2017

Got an update with GOTRACEBACK=system on the original thread. The full traceback is in ethereum/go-ethereum#14725 (comment).

This time it's a different, though clearly related failure. It looks like netFD.writeTo is actually attempting to return and crashing. Though, oddly, while the crashing PC is a RET, the faulting PC looks to be valid. If something clobbered the saved return PC, I would have expected for the faulting address and PC to be the same.

unexpected fault address 0xffffffffffffffff
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x5cb4f6]

goroutine 7117 [running]:
runtime.throw(0xe8c58f, 0x5)
        C:/go/src/runtime/panic.go:596 +0x9c fp=0xc042c83860 sp=0xc042c83840
runtime.sigpanic()
        C:/go/src/runtime/signal_windows.go:164 +0xff fp=0xc042c83890 sp=0xc042c83860
net.(*netFD).writeTo(0xc042221dc0, 0xc042468340, 0x8b, 0xc3, 0x1481420, 0xc04908e4c0, 0x8b, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:565 +0x1e6 fp=0xc042c83898 sp=0xc042c83890
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

Leading up to fd_windows.go:565 +0x1e6:

        0x01bf 00447 (.../net/fd_windows.go:565)     MOVQ    CX, "".~r2+176(FP)
        0x01c7 00455 (.../net/fd_windows.go:565)     MOVQ    AX, "".~r3+184(FP)
        0x01cf 00463 (.../net/fd_windows.go:565)     MOVQ    DX, "".~r3+192(FP)
        0x01d7 00471 (.../net/fd_windows.go:565)     PCDATA  $0, $9
        0x01d7 00471 (.../net/fd_windows.go:565)     XCHGL   AX, AX
        0x01d8 00472 (.../net/fd_windows.go:565)     CALL    runtime.deferreturn(SB)
        0x01dd 00477 (.../net/fd_windows.go:565)     MOVQ    112(SP), BP
        0x01e2 00482 (.../net/fd_windows.go:565)     ADDQ    $120, SP
        0x01e6 00486 (.../net/fd_windows.go:565)     RET
@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Aug 3, 2017

Too late for 1.9.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9Maybe Aug 3, 2017

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.10 Nov 22, 2017

@aclements

This comment has been minimized.

Member

aclements commented Nov 28, 2017

I'm not making much progress here, but a few observations:

  1. In the original crash we're shrinking the stack with the net.(*netFD).writeTo call. We crash when walking the new stack to adjust pointers, so the return IP is already corrupted by this point. However, we don't know if it was corrupted by the stack copy, or already corrupted on the original stack (we've already installed the new stack at this point, so the panic traceback is also walking the new stack). I imagine net.(*netFD).writeTo isn't on the stack for very long, which would suggest that the stack copy itself did the corruption, but the second crash suggests that it's not the stack copy.

  2. In the original crash (the one that happened while shrinking net.(*netFD).writeTo's stack), the return PC is 0x80002000720013. That looks an awful lot like an array of small uint16s: [0x13, 0x72, 0x20, 0x80]. Probably not UTF-16 text given the 0x80. Unfortunately, in the second crash it's completely unclear what the corrupted value is, or how we can even crash in that manner on the RET (unless Windows is doing something really fancy like detecting that we just returned to a bad PC and using the branch trace buffer to back us up to a useful PC.. but then failing to tell us the bad PC).

  3. @alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Nov 28, 2017

@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

I don't see how it is possible to use WSASendTo with lpNumberOfBytesSent set to NULL. From https://msdn.microsoft.com/en-us/library/windows/desktop/ms741693(v=vs.85).aspx - "... Overlapped Socket I/O
If an overlapped operation completes immediately, WSASendTo returns a value of zero and the lpNumberOfBytesSent parameter is updated with the number of bytes sent. If the overlapped operation is successfully initiated and will complete later, WSASendTo returns SOCKET_ERROR and indicates error code WSA_IO_PENDING. In this case, lpNumberOfBytesSent is not updated. When the overlapped operation completes the amount of data transferred is indicated either through the cbTransferred parameter in the completion routine (if specified), or through the lpcbTransfer parameter in WSAGetOverlappedResult. ..." Also see "Example Code" at the end of the doco - it uses lpNumberOfBytesSent just as we do.

Alex

@gopherbot

This comment has been minimized.

gopherbot commented Dec 28, 2017

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@os2tux

This comment has been minimized.

os2tux commented Apr 30, 2018

Not sure if this right place to put this and my apologies if not. I'm seeing a similar error on my machine anytime I attempt to execute "go run". Specifically, the error I'm seeing is runtime: unexpected return pc for runtime.goexit called from 0x0. If this is something that would be helpful I'm happy to provide whatever details you'd like. Thanks!

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Apr 30, 2018

@jcshelby That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks.

@os2tux

This comment has been minimized.

os2tux commented May 3, 2018

@gopherbot gopherbot removed this from the Go1.11 milestone May 23, 2018

@gopherbot gopherbot added this to the Unplanned milestone May 23, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment