runtime: fatal error: unknown caller pc #20846

karalabe · 2017-06-29T08:05:27Z

We've received a GC crash report from a Windows 10 user, running our Ethereum client built with Go 1.8.3 (amd64). We've never seen this error ever, so I doubt we can make a repro out of it, but here's the crash logs if you can make heads or tails of it (you can find the full crash dump at ethereum/go-ethereum#14725):

runtime: unexpected return pc for net.(*netFD).writeTo called from 0x80002000720013
fatal error: unknown caller pc

runtime stack:
runtime.throw(0xea8762, 0x11)
        C:/go/src/runtime/panic.go:596 +0x9c
runtime.gentraceback(0xffffffffffffffff, 0xc044903698, 0x0, 0xc0440544e0, 0x0, 0x0, 0x7fffffff, 0xf405e8, 0x45c2f978, 0x0, ...)
        C:/go/src/runtime/traceback.go:317 +0x13f8
runtime.copystack(0xc0440544e0, 0x2000, 0x0)
        C:/go/src/runtime/stack.go:932 +0x377
runtime.shrinkstack(0xc0440544e0)
        C:/go/src/runtime/stack.go:1186 +0x18c
runtime.scanstack(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/mgcmark.go:762 +0x386
runtime.scang(0xc0440544e0, 0xc042015228)
        C:/go/src/runtime/proc.go:866 +0x210
runtime.markroot.func1()
        C:/go/src/runtime/mgcmark.go:264 +0x7b
runtime.systemstack(0x45c2fda0)
        C:/go/src/runtime/asm_amd64.s:343 +0xb5
runtime.markroot(0xc042015228, 0xc00000006b)
        C:/go/src/runtime/mgcmark.go:269 +0x1e9
runtime.gcDrain(0xc042015228, 0x6)
        C:/go/src/runtime/mgcmark.go:1051 +0xcc
runtime.gcBgMarkWorker.func2()
        C:/go/src/runtime/mgc.go:1530 +0x101
runtime.systemstack(0xc0429f9700)
        C:/go/src/runtime/asm_amd64.s:327 +0x7e
runtime.mstart()
        C:/go/src/runtime/proc.go:1132

goroutine 19 [GC worker (idle)]:
runtime.systemstack_switch()
        C:/go/src/runtime/asm_amd64.s:281 fp=0xc042439f50 sp=0xc042439f48
runtime.gcBgMarkWorker(0xc042014000)
        C:/go/src/runtime/mgc.go:1537 +0x218 fp=0xc042439fd8 sp=0xc042439f50
runtime.goexit()
        C:/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc042439fe0 sp=0xc042439fd8
created by runtime.gcBgMarkStartWorkers
        C:/go/src/runtime/mgc.go:1412 +0x9f

The text was updated successfully, but these errors were encountered:

bradfitz · 2017-06-29T15:17:02Z

@aclements?

ianlancetaylor · 2017-06-29T15:37:42Z

The only slightly odd thing I see in the full backtrace is

goroutine 84 [runnable, locked to thread]:
syscall.Syscall6(0x7ffcc6d84ab0, 0x4, 0x54, 0xc044539884,                 0x1, 0xc044539780, 0x0, 0x0, 0x1, 0x0, ...)
        C:/go/src/runtime/syscall_windows.go:174 +0x6b
syscall.(*Proc).Call(0xc0440062a0, 0xc042dd2180, 0x4, 0x4,                 0x20, 0xd36820, 0x1483201, 0xc042dd2180)
        C:/go/src/syscall/dll_windows.go:153 +0x4dc
syscall.(*LazyProc).Call(0xc0421d1e60, 0xc042dd2180, 0x4,                 0x4, 0x1, 0x0, 0x14832e0, 0xc0423560b0)
        C:/go/src/syscall/dll_windows.go:303 +0x66
github.com/ethereum/go-ethereum/vendor/github.com/peterh/l                iner.(*State).readNext(0xc0421daa80, 0xc044539ae0, 0x2, 0x                20, 0xc044539a60)
        C:/gopath/src/github.com/ethereum/go-ethereum/vend                or/github.com/peterh/liner/input_windows.go:178 +0x120

Why is a goroutine running in syscall6 marked as runnable?

We should probably force GOTRACEBACK=system for this crash, so that we can see the runtime functions.

In fact, since this problem seems somewhat repeatable, can we see a stack trace with GOTRACEBACK=system set in the environment?

aclements · 2017-06-29T16:25:26Z

I asked for the GOTRACEBACK=system traceback on the original issue.

For reference, here's the trace of the goroutine with the bad return PC:

goroutine 140524 [runnable (scan)]:
net.runtime_pollWait(0x4c442b8, 0x77, 0x0)
        C:/go/src/runtime/netpoll.go:164 +0x60
net.(*pollDesc).wait(0xc0421cefa0, 0x77, 0x0, 0x0)
        C:/go/src/net/fd_poll_runtime.go:75 +0x3f
net.(*ioSrv).ExecIO(0xc0420fa698, 0xc0421cef00, 0xe91b9f, 0x9, 0xf40140, 0x4, 0x4, 0x765f)
        C:/go/src/net/fd_windows.go:196 +0xfe
net.(*netFD).writeTo(0xc0421cee00, 0xc04697a1a0, 0x8b, 0xc3, 0x1481420, 0xc042bd6b40, 0x0, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:561 +0x17f
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

It, of course, stops when it hits the bad return PC from writeTo (since it's just printing the traceback, it moves on instead of panicking).

alexbrainman · 2017-06-30T01:37:40Z

here's the trace of the goroutine with the bad return PC:

FYI. To muddy the waters even more

548 func (fd *netFD) writeTo(buf []byte, sa syscall.Sockaddr) (int, error) {
549         if len(buf) == 0 {
550                 return 0, nil
551         }
552         if err := fd.writeLock(); err != nil {
553                 return 0, err
554         }
555         defer fd.writeUnlock()
556         o := &fd.wop
557         o.InitBuf(buf)
558         o.sa = sa
559         n, err := wsrv.ExecIO(o, "WSASendto", func(o *operation) error {
560                 return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)
561         })
562         if _, ok := err.(syscall.Errno); ok {
563                 err = os.NewSyscallError("wsasendto", err)
564         }
565         return n, err
566 }

Normal windows syscalls use memory passed to it as parameters during the call only. But syscall.WSASendto here just makes record of "&o.buf" and "&o.o" and returns immediately. Then Windows actually uses memory later to send its contents to the network. Windows signals when it finishes via GetQueuedCompletionStatusEx. So the "o.buf" and "o.o" should not be moved until after line 561. So, hopefully, they are allocated on the heap.

Alex

aclements · 2017-06-30T14:18:21Z

@alexbrainman, good call. However, while we're clearly skating on thin ice, it appears to be okay right now:

net/fd_windows.go:548: leaking param: fd
net/fd_windows.go:548:  from fd (passed to call[argument escapes]) at net/fd_windows.go:552

aclements · 2017-06-30T14:27:21Z

Got an update with GOTRACEBACK=system on the original thread. The full traceback is in ethereum/go-ethereum#14725 (comment).

This time it's a different, though clearly related failure. It looks like netFD.writeTo is actually attempting to return and crashing. Though, oddly, while the crashing PC is a RET, the faulting PC looks to be valid. If something clobbered the saved return PC, I would have expected for the faulting address and PC to be the same.

unexpected fault address 0xffffffffffffffff
fatal error: fault
[signal 0xc0000005 code=0x0 addr=0xffffffffffffffff pc=0x5cb4f6]

goroutine 7117 [running]:
runtime.throw(0xe8c58f, 0x5)
        C:/go/src/runtime/panic.go:596 +0x9c fp=0xc042c83860 sp=0xc042c83840
runtime.sigpanic()
        C:/go/src/runtime/signal_windows.go:164 +0xff fp=0xc042c83890 sp=0xc042c83860
net.(*netFD).writeTo(0xc042221dc0, 0xc042468340, 0x8b, 0xc3, 0x1481420, 0xc04908e4c0, 0x8b, 0x0, 0x0)
        C:/go/src/net/fd_windows.go:565 +0x1e6 fp=0xc042c83898 sp=0xc042c83890
created by github.com/ethereum/go-ethereum/p2p/discover.(*Table).bondall
        C:/gopath/src/github.com/ethereum/go-ethereum/p2p/discover/table.go:432 +0xac

Leading up to fd_windows.go:565 +0x1e6:

        0x01bf 00447 (.../net/fd_windows.go:565)     MOVQ    CX, "".~r2+176(FP)
        0x01c7 00455 (.../net/fd_windows.go:565)     MOVQ    AX, "".~r3+184(FP)
        0x01cf 00463 (.../net/fd_windows.go:565)     MOVQ    DX, "".~r3+192(FP)
        0x01d7 00471 (.../net/fd_windows.go:565)     PCDATA  $0, $9
        0x01d7 00471 (.../net/fd_windows.go:565)     XCHGL   AX, AX
        0x01d8 00472 (.../net/fd_windows.go:565)     CALL    runtime.deferreturn(SB)
        0x01dd 00477 (.../net/fd_windows.go:565)     MOVQ    112(SP), BP
        0x01e2 00482 (.../net/fd_windows.go:565)     ADDQ    $120, SP
        0x01e6 00486 (.../net/fd_windows.go:565)     RET

ianlancetaylor · 2017-08-03T18:31:31Z

Too late for 1.9.

aclements · 2017-11-28T21:06:36Z

I'm not making much progress here, but a few observations:

In the original crash we're shrinking the stack with the net.(*netFD).writeTo call. We crash when walking the new stack to adjust pointers, so the return IP is already corrupted by this point. However, we don't know if it was corrupted by the stack copy, or already corrupted on the original stack (we've already installed the new stack at this point, so the panic traceback is also walking the new stack). I imagine net.(*netFD).writeTo isn't on the stack for very long, which would suggest that the stack copy itself did the corruption, but the second crash suggests that it's not the stack copy.
In the original crash (the one that happened while shrinking net.(*netFD).writeTo's stack), the return PC is 0x80002000720013. That looks an awful lot like an array of small uint16s: [0x13, 0x72, 0x20, 0x80]. Probably not UTF-16 text given the 0x80. Unfortunately, in the second crash it's completely unclear what the corrupted value is, or how we can even crash in that manner on the RET (unless Windows is doing something really fancy like detecting that we just returned to a bad PC and using the branch trace buffer to back us up to a useful PC.. but then failing to tell us the bad PC).
@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

alexbrainman · 2017-11-28T21:45:26Z

@alexbrainman, I was looking at the docs for WSASendTo and noticed they say "Use NULL for lpNumberOfBytesSent if the lpOverlapped parameter is not NULL to avoid potentially erroneous results." I'm not sure what the "erroneous results" could be, but it looks like we're not following this advice here:

return syscall.WSASendto(o.fd.sysfd, &o.buf, 1, &o.qty, 0, o.sa, &o.o, nil)

I don't see how it is possible to use WSASendTo with lpNumberOfBytesSent set to NULL. From https://msdn.microsoft.com/en-us/library/windows/desktop/ms741693(v=vs.85).aspx - "... Overlapped Socket I/O
If an overlapped operation completes immediately, WSASendTo returns a value of zero and the lpNumberOfBytesSent parameter is updated with the number of bytes sent. If the overlapped operation is successfully initiated and will complete later, WSASendTo returns SOCKET_ERROR and indicates error code WSA_IO_PENDING. In this case, lpNumberOfBytesSent is not updated. When the overlapped operation completes the amount of data transferred is indicated either through the cbTransferred parameter in the completion routine (if specified), or through the lpcbTransfer parameter in WSAGetOverlappedResult. ..." Also see "Example Code" at the end of the doco - it uses lpNumberOfBytesSent just as we do.

Alex

gopherbot · 2017-12-28T16:56:49Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

os2tux · 2018-04-30T16:56:58Z

Not sure if this right place to put this and my apologies if not. I'm seeing a similar error on my machine anytime I attempt to execute "go run". Specifically, the error I'm seeing is runtime: unexpected return pc for runtime.goexit called from 0x0. If this is something that would be helpful I'm happy to provide whatever details you'd like. Thanks!

ianlancetaylor · 2018-04-30T18:24:46Z

@JCShelby That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks.

os2tux · 2018-05-03T01:14:29Z

Thanks for the feedback and apologies for any inconvenience. I actually ended up resolving this by reinstalling my environment, thinking it might have been a corrupted installation. That seems to have fixed things but if I see this happen again I’ll be sure to open an issue as directed. Thanks again for the help! Respectfully, Joseph Shelby Sr. Infrastructure Engineer, IT Platform Architecture zoom: https://asurion.zoom.us/my/josephshelby | slack: @JCShelby voice: 615.445.1240 | mobile: 615.210.6139 mail: Asurion I 648 Grassmere Park, Nashville, TN 37211 www.asurion.com<http://www.asurion.com/> I Twitter<https://twitter.com/Asurion> I Facebook<https://www.facebook.com/Asurion> From: Ian Lance Taylor [mailto:notifications@github.com] Sent: Monday, April 30, 2018 13:26 To: golang/go <go@noreply.github.com> Cc: Shelby, Joseph <Joseph.Shelby@asurion.com>; Mention <mention@noreply.github.com> Subject: Re: [golang/go] runtime: fatal error: unknown caller pc (#20846) @JCShelby<https://github.com/jcshelby> That error can have many different root causes. If you're willing I would encourage you to open a new issue about it, ideally with instructions for how to reproduce the problem. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#20846 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AR1zoMRzloOcGD7Er-lq-AiTCuYse3jLks5tt1crgaJpZM4OI9NU>.

…

________________________________ This message (including any attachments) contains confidential and/or privileged information. It is intended for a specific individual and purpose and is protected by law. If you are not the intended recipient, please notify the sender immediately and delete this message. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited.

agnivade · 2019-03-15T05:14:23Z

Circling back on this. The original issue was closed by OP. @karalabe - have you seen any more of these crashes with newer Go versions like 1.12 ?

karalabe · 2019-03-15T09:29:38Z

Haven't seen this since then, so I guess something must have fixed it either on our side or Go side.

bradfitz · 2019-03-15T14:00:09Z

I'm going to close this then. Feel free to re-ping/re-open it if the problem comes back.

klauspost · 2019-10-12T14:31:30Z

Just got a crash from "fuzzit.dev" where I am running continuous fuzz testing of my compression packages. Go 1.12.10 was used for building.

There is no assembly or "unsafe" involved so there shouldn't be any reasonable way for memory corruption. This fuzzing is also strictly running in a single goroutine, so races also seems unlikely.

That said I have no idea about the hardware stability of the servers running the tests.

Also, a lot of new code has just been added here, so there is a chance of something bad, though I don't know how I would be able to trigger this error

(crash removed, see next comment)

I will see if it reproduces before pinging anyone :)

klauspost · 2019-10-24T10:33:56Z

I now have a second crash 10 days later (see above). The tested binary is the same.

Both crashes are here: https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39

It is completely different functions that were pre-empted (flate.(*fastGen).matchlenLong) vs. flate.(*decompressor).Read - completely different code. Both crashes were in mgcmark.go:711. I will upgrade the fuzz test to Go 1.13.

klauspost · 2019-10-25T16:30:08Z

@bradfitz Just had a Fuzz crash on Go 1.13.3 - https://gist.github.com/klauspost/d4ec7bd6ecefa1bec56dd8ca4ac8ec39#file-orgs_klauspost_targets_compress-flate_jobs_k0lulnasftbj78a8iqxf_log-txt

So 3 crashes in total in the same fuzz test, but otherwise seemingly unrelated.

agnivade · 2019-10-25T16:39:19Z

This looks slightly different. The original issue had problems originating from the syscall in net.(*netFD).writeTo. This seems to bubble up from runtime.gcBgMarkWorker and runtime.gopark.

Perhaps you want to open a new issue ?

bradfitz · 2019-10-25T16:39:26Z

@klauspost, can you file a new bug with that?

agnivade · 2019-10-25T16:40:34Z

Gah .. crossed responses.

randall77 · 2019-10-25T16:57:46Z

This looks like the stack has been trashed somehow.
Not only the return address for gopark. gopark's arguments also look trashed. The gcBgMarkWorker failure looks similar, hard to tell for sure if its args are trashed as it has only one arg.

Not sure what might cause this. Could be misuse of unsafe, could be runtime bug (use after free of stack memory?).

It would be really nice to have something more repeatable.

klauspost · 2019-10-25T17:07:36Z

Added #35158

@randall77 This does not use unsafe and the only assembler would be in the stdlib. Imports here: https://godoc.org/github.com/klauspost/compress/flate?imports

"sync" is only used for a sync.Once and there are no goroutines, so no races should happen either.

Fuzz test imports: https://godoc.org/github.com/klauspost/compress-fuzz/flate?imports

bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 29, 2017

bradfitz added this to the Go1.9Maybe milestone Jun 29, 2017

ianlancetaylor modified the milestones: Go1.10, Go1.9Maybe Aug 3, 2017

rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

ianlancetaylor modified the milestones: Go1.11, Go1.10 Nov 22, 2017

aclements added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 28, 2017

mvdan mentioned this issue Dec 14, 2017

plugin: net/http.Get fatals with "runtime: unexpected return pc for runtime.goexit called" #23133

Closed

gopherbot closed this as completed Dec 28, 2017

bradfitz removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Dec 29, 2017

bradfitz reopened this Dec 29, 2017

ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 3, 2018

eliasnaur mentioned this issue Mar 6, 2018

runtime: "runtime: unexpected return pc for testing.tRunner called from 0x0" #24267

Open

gopherbot removed this from the Go1.11 milestone May 23, 2018

gopherbot added this to the Unplanned milestone May 23, 2018

bradfitz closed this as completed Mar 15, 2019

klauspost mentioned this issue Nov 26, 2019

runtime: fatal error: unknown caller pc (libfuzz) #35158

Open

golang locked and limited conversation to collaborators Oct 24, 2020

gopherbot added the FrozenDueToAge label Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: fatal error: unknown caller pc #20846

runtime: fatal error: unknown caller pc #20846

karalabe commented Jun 29, 2017

bradfitz commented Jun 29, 2017

ianlancetaylor commented Jun 29, 2017

aclements commented Jun 29, 2017

alexbrainman commented Jun 30, 2017

aclements commented Jun 30, 2017

aclements commented Jun 30, 2017

ianlancetaylor commented Aug 3, 2017

aclements commented Nov 28, 2017

alexbrainman commented Nov 28, 2017

gopherbot commented Dec 28, 2017

os2tux commented Apr 30, 2018

ianlancetaylor commented Apr 30, 2018

os2tux commented May 3, 2018 via email

agnivade commented Mar 15, 2019

karalabe commented Mar 15, 2019

bradfitz commented Mar 15, 2019

klauspost commented Oct 12, 2019 •

edited

Loading

klauspost commented Oct 24, 2019 •

edited

Loading

klauspost commented Oct 25, 2019 •

edited

Loading

agnivade commented Oct 25, 2019

bradfitz commented Oct 25, 2019

agnivade commented Oct 25, 2019

randall77 commented Oct 25, 2019

klauspost commented Oct 25, 2019

runtime: fatal error: unknown caller pc #20846

runtime: fatal error: unknown caller pc #20846

Comments

karalabe commented Jun 29, 2017

bradfitz commented Jun 29, 2017

ianlancetaylor commented Jun 29, 2017

aclements commented Jun 29, 2017

alexbrainman commented Jun 30, 2017

aclements commented Jun 30, 2017

aclements commented Jun 30, 2017

ianlancetaylor commented Aug 3, 2017

aclements commented Nov 28, 2017

alexbrainman commented Nov 28, 2017

gopherbot commented Dec 28, 2017

os2tux commented Apr 30, 2018

ianlancetaylor commented Apr 30, 2018

os2tux commented May 3, 2018 via email

agnivade commented Mar 15, 2019

karalabe commented Mar 15, 2019

bradfitz commented Mar 15, 2019

klauspost commented Oct 12, 2019 • edited Loading

klauspost commented Oct 24, 2019 • edited Loading

klauspost commented Oct 25, 2019 • edited Loading

agnivade commented Oct 25, 2019

bradfitz commented Oct 25, 2019

agnivade commented Oct 25, 2019

randall77 commented Oct 25, 2019

klauspost commented Oct 25, 2019

klauspost commented Oct 12, 2019 •

edited

Loading

klauspost commented Oct 24, 2019 •

edited

Loading

klauspost commented Oct 25, 2019 •

edited

Loading