Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: unknown caller pc #20846

Closed
karalabe opened this issue Jun 29, 2017 · 24 comments
Closed

runtime: fatal error: unknown caller pc #20846

karalabe opened this issue Jun 29, 2017 · 24 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@karalabe
Copy link
Contributor

We've received a GC crash report from a Windows 10 user, running our Ethereum client built with Go 1.8.3 (amd64). We've never seen this error ever, so I doubt we can make a repro out of it, but here's the crash logs if you can make heads or tails of it (you can find the full crash dump at ethereum/go-ethereum#14725):

runtime: unexpected return pc for net.(*netFD).writeTo called from 0x80002000720013
fatal error: unknown caller pc

runtime stack:
runtime.throw(0xea8762, 0x11)
        C:/go/src/runtime/panic.go:596 +0x9c
runtime.gentraceback(0xffffffffffffffff, 0xc044903698, 0x0, 0xc0440544e0, 0x0, 0x0, 0x7fffffff, 0xf405e8, 0x45c2f978, 0x0, ...)
        C:/go/src/runtime/traceback.go:317 +0x13f8
runtime.copystack(0xc0440544e0, 0x2000, 0x0)
        C:/go/src/runtime/stack.go:932 +0x377
runtime.shrinkstack(0xc0440544e0)
        C:/go/src/runtime/stack.go:1186 +0x18c
runtime.scanstack(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/mgcmark.go:762 +0x386
runtime.scang(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/proc.go:866 +0x210
runtime.markroot.func1()
        C:/go/src/runtime/mgcmark.go:264 +0x7b
runtime.systemstack(0x45c2fda0)
        C:/go/src/runtime/asm_amd64.s:343 +0xb5
runtime.markroot(0xc042015228, 0xc00000006b)
        C:/go/src/runtime/mgcmark.go:269 +0x1e9
runtime.gcDrain(0xc042015228, 0x6)
        C:/go/src/runtime/mgcmark.go:1051 +0xcc
runtime.gcBgMarkWorker.func2()
        C:/go/src/runtime/mgc.go:1530 +0x101
runtime.systemstack(0xc0429f9700)
        C:/go/src/runtime/asm_amd64.s:327 +0x7e
runtime.mstart()
        C:/go/src/runtime/proc.go:1132

goroutine 19 [GC worker (idle)]:
runtime.systemstack_switch()
        C:/go/src/runtime/asm_amd64.s:281 fp=0xc042439f50 sp=0xc042439f48
runtime.gcBgMarkWorker(0xc042014000)
        C:/go/src/runtime/mgc.go:1537 +0x218 fp=0xc042439fd8 sp=0xc042439f50
runtime.goexit()
        C:/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc042439fe0 sp=0xc042439fd8
created by runtime.gcBgMarkStartWorkers
        C:/go/src/runtime/mgc.go:1412 +0x9f
@bradfitz
Copy link
Contributor

@aclements?

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 29, 2017
@bradfitz bradfitz added this to the Go1.9Maybe milestone Jun 29, 2017
@ianlancetaylor
Copy link
Contributor

The only slightly odd thing I see in the full backtrace is

goroutine 84 [runnable, locked to thread]:
syscall.Syscall6(0x7ffcc6d84ab0, 0x4, 0x54, 0xc044539884,                 0x1, 0xc044539780, 0x0, 0x0, 0x1, 0x0, ...)
        C:/go/src/runtime/syscall_windows.go:174 +0x6b
syscall.(*Proc).Call(0xc0440062a0, 0xc042dd2180, 0x4, 0x4,                 0x20, 0xd36820, 0x1483201, 0xc042dd2180)
        C:/go/src/syscall/dll_windows.go:153 +0x4dc
syscall.(*LazyProc).Call(0xc0421d1e60, 0xc042dd2180, 0x4,                 0x4, 0x1, 0x0, 0x14832e0, 0xc0423560b0)
        C:/go/src/syscall/dll_windows.go:303 +0x66
github.com/ethereum/go-ethereum/vendor/github.com/peterh/l                iner.(*State).readNext(0xc0421daa80, 0xc044539ae0, 0x2, 0x                20, 0xc044539a60)
        C:/gopath/src/github.com/ethereum/go-ethereum/vend                or/github.com/peterh/liner/input_windows.go:178 +0x120

Why is a goroutine running in syscall6 marked as runnable?

We should probably force GOTRACEBACK=system for this crash, so that we can see the runtime functions.

In fact, since this problem seems somewhat repeatable, can we see a stack trace with GOTRACEBACK=system set in the environment?

@aclements
Copy link
Member

I asked for the GOTRACEBACK=system traceback on the original issue.

For reference, here's the trace of the goroutine with the bad return PC:

goroutine 140524 [runnable (scan)]:
net.runtime_pollWait(0x4c442b8, 0x77, 0x0)
        C:/go/src/runtime/netpoll.go:164 +0x60
net.(*pollDesc).wait(0xc0421cefa0, 0x77, 0x0, 0x0)
        C:/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc0420fa698, 0xc0421cef00, 0xe91b9f, 0x9, 0xf40140, 0x4, 0x4, 0x765f)
        C:/go/src/net/fd_windows.go:196 +0xfe
net.(*netFD).writeTo(0xc0421cee00, 0xc04697a1a0, 0x8b, 0xc3, 0x1481420, 0xc042bd6b40, 0x0, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:561 +0x17f
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

It, of course, stops when it hits the bad return PC from writeTo (since it's just printing the traceback, it moves on instead of panicking).

@alexbrainman
Copy link
Member

here's the trace of the goroutine with the bad return PC:

FYI. To muddy the waters even more

548 func (fd *netFD) writeTo(buf []byte, sa syscall.Sockaddr) (int, error) {
549         if len(buf) == 0 {
550                 return 0, nil
551         }
552         if err := fd.writeLock(); err != nil {
553                 return 0, err
554         }
555         defer fd.writeUnlock()
556         o := &fd.wop
557         o.InitBuf(buf)
558         o.sa = sa
559         n, err := wsrv.ExecIO(o, "WSASendto", func(o *operation) error {
560                 return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
561         })
562         if _, ok := err.(syscall.Errno); ok {
563                 err = os.NewSyscallError("wsasendto", err)
564         }
565         return n, err
566 }

Normal windows syscalls use memory passed to it as parameters during the call only. But syscall.WSASendto here just makes record of "&o.buf" and "&o.o" and returns immediately. Then Windows actually uses memory later to send its contents to the network. Windows signals when it finishes via GetQueuedCompletionStatusEx. So the "o.buf" and "o.o" should not be moved until after line 561. So, hopefully, they are allocated on the heap.

Alex

@aclements
Copy link
Member

@alexbrainman, good call. However, while we're clearly skating on thin ice, it appears to be okay right now:

net/fd_windows.go:548: leaking param: fd
net/fd_windows.go:548:  from fd (passed to call[argument escapes]) at net/fd_windows.go:552

@aclements
Copy link
Member

Got an update with GOTRACEBACK=system on the original thread. The full traceback is in ethereum/go-ethereum#14725 (comment).

This time it's a different, though clearly related failure. It looks like netFD.writeTo is actually attempting to return and crashing. Though, oddly, while the crashing PC is a RET, the faulting PC looks to be valid. If something clobbered the saved return PC, I would have expected for the faulting address and PC to be the same.

unexpected fault address 0xffffffffffffffff
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x5cb4f6]

goroutine 7117 [running]:
runtime.throw(0xe8c58f, 0x5)
        C:/go/src/runtime/panic.go:596 +0x9c fp=0xc042c83860 sp=0xc042c83840
runtime.sigpanic()
        C:/go/src/runtime/signal_windows.go:164 +0xff fp=0xc042c83890 sp=0xc042c83860
net.(*netFD).writeTo(0xc042221dc0, 0xc042468340, 0x8b, 0xc3, 0x1481420, 0xc04908e4c0, 0x8b, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:565 +0x1e6 fp=0xc042c83898 sp=0xc042c83890
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

Leading up to fd_windows.go:565 +0x1e6:

        0x01bf 00447 (.../net/fd_windows.go:565)     MOVQ    CX, "".~r2+176(FP)
        0x01c7 00455 (.../net/fd_windows.go:565)     MOVQ    AX, "".~r3+184(FP)
        0x01cf 00463 (.../net/fd_windows.go:565)     MOVQ    DX, "".~r3+192(FP)
        0x01d7 00471 (.../net/fd_windows.go:565)     PCDATA  $0, $9
        0x01d7 00471 (.../net/fd_windows.go:565)     XCHGL   AX, AX
        0x01d8 00472 (.../net/fd_windows.go:565)     CALL    runtime.deferreturn(SB)
        0x01dd 00477 (.../net/fd_windows.go:565)     MOVQ    112(SP), BP
        0x01e2 00482 (.../net/fd_windows.go:565)     ADDQ    $120, SP
        0x01e6 00486 (.../net/fd_windows.go:565)     RET

@ianlancetaylor
Copy link
Contributor

Too late for 1.9.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9Maybe Aug 3, 2017
@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.10 Nov 22, 2017
@aclements aclements added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 28, 2017
@aclements
Copy link
Member

I'm not making much progress here, but a few observations:

  1. In the original crash we're shrinking the stack with the net.(*netFD).writeTo call. We crash when walking the new stack to adjust pointers, so the return IP is already corrupted by this point. However, we don't know if it was corrupted by the stack copy, or already corrupted on the original stack (we've already installed the new stack at this point, so the panic traceback is also walking the new stack). I imagine net.(*netFD).writeTo isn't on the stack for very long, which would suggest that the stack copy itself did the corruption, but the second crash suggests that it's not the stack copy.

  2. In the original crash (the one that happened while shrinking net.(*netFD).writeTo's stack), the return PC is 0x80002000720013. That looks an awful lot like an array of small uint16s: [0x13, 0x72, 0x20, 0x80]. Probably not UTF-16 text given the 0x80. Unfortunately, in the second crash it's completely unclear what the corrupted value is, or how we can even crash in that manner on the RET (unless Windows is doing something really fancy like detecting that we just returned to a bad PC and using the branch trace buffer to back us up to a useful PC.. but then failing to tell us the bad PC).

  3. @alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

@alexbrainman
Copy link
Member

@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

I don't see how it is possible to use WSASendTo with lpNumberOfBytesSent set to NULL. From https://msdn.microsoft.com/en-us/library/windows/desktop/ms741693(v=vs.85).aspx - "... Overlapped Socket I/O
If an overlapped operation completes immediately, WSASendTo returns a value of zero and the lpNumberOfBytesSent parameter is updated with the number of bytes sent. If the overlapped operation is successfully initiated and will complete later, WSASendTo returns SOCKET_ERROR and indicates error code WSA_IO_PENDING. In this case, lpNumberOfBytesSent is not updated. When the overlapped operation completes the amount of data transferred is indicated either through the cbTransferred parameter in the completion routine (if specified), or through the lpcbTransfer parameter in WSAGetOverlappedResult. ..." Also see "Example Code" at the end of the doco - it uses lpNumberOfBytesSent just as we do.

Alex

@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@bradfitz bradfitz removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 29, 2017
@bradfitz bradfitz reopened this Dec 29, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018
@os2tux
Copy link

os2tux commented Apr 30, 2018

Not sure if this right place to put this and my apologies if not. I'm seeing a similar error on my machine anytime I attempt to execute "go run". Specifically, the error I'm seeing is runtime: unexpected return pc for runtime.goexit called from 0x0. If this is something that would be helpful I'm happy to provide whatever details you'd like. Thanks!

@ianlancetaylor
Copy link
Contributor

@JCShelby That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks.

@os2tux
Copy link

os2tux commented May 3, 2018 via email

@gopherbot gopherbot removed this from the Go1.11 milestone May 23, 2018
@gopherbot gopherbot added this to the Unplanned milestone May 23, 2018
@agnivade
Copy link
Contributor

Circling back on this. The original issue was closed by OP. @karalabe - have you seen any more of these crashes with newer Go versions like 1.12 ?

@karalabe
Copy link
Contributor Author

Haven't seen this since then, so I guess something must have fixed it either on our side or Go side.

@bradfitz
Copy link
Contributor

I'm going to close this then. Feel free to re-ping/re-open it if the problem comes back.

@klauspost
Copy link
Contributor

klauspost commented Oct 12, 2019

Just got a crash from "fuzzit.dev" where I am running continuous fuzz testing of my compression packages. Go 1.12.10 was used for building.

There is no assembly or "unsafe" involved so there shouldn't be any reasonable way for memory corruption. This fuzzing is also strictly running in a single goroutine, so races also seems unlikely.

That said I have no idea about the hardware stability of the servers running the tests.

Also, a lot of new code has just been added here, so there is a chance of something bad, though I don't know how I would be able to trigger this error

(crash removed, see next comment)

I will see if it reproduces before pinging anyone :)

@klauspost
Copy link
Contributor

klauspost commented Oct 24, 2019

I now have a second crash 10 days later (see above). The tested binary is the same.

Both crashes are here: https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39

It is completely different functions that were pre-empted (flate.(*fastGen).matchlenLong) vs. flate.(*decompressor).Read - completely different code. Both crashes were in mgcmark.go:711. I will upgrade the fuzz test to Go 1.13.

@klauspost
Copy link
Contributor

klauspost commented Oct 25, 2019

@bradfitz Just had a Fuzz crash on Go 1.13.3 - https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39#file-orgs_klauspost_targets_compress-flate_jobs_k0lulnasftbj78a8iqxf_log-txt

So 3 crashes in total in the same fuzz test, but otherwise seemingly unrelated.

@agnivade
Copy link
Contributor

This looks slightly different. The original issue had problems originating from the syscall in net.(*netFD).writeTo. This seems to bubble up from runtime.gcBgMarkWorker and runtime.gopark.

Perhaps you want to open a new issue ?

@bradfitz
Copy link
Contributor

@klauspost, can you file a new bug with that?

@agnivade
Copy link
Contributor

Gah .. crossed responses.

@randall77
Copy link
Contributor

This looks like the stack has been trashed somehow.
Not only the return address for gopark. gopark's arguments also look trashed. The gcBgMarkWorker failure looks similar, hard to tell for sure if its args are trashed as it has only one arg.

Not sure what might cause this. Could be misuse of unsafe, could be runtime bug (use after free of stack memory?).

It would be really nice to have something more repeatable.

@klauspost
Copy link
Contributor

Added #35158

@randall77 This does not use unsafe and the only assembler would be in the stdlib. Imports here: https://godoc.org/github.com/klauspost/compress/flate?imports

"sync" is only used for a sync.Once and there are no goroutines, so no races should happen either.

Fuzz test imports: https://godoc.org/github.com/klauspost/compress-fuzz/flate?imports

@golang golang locked and limited conversation to collaborators Oct 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests