Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: infinite recursion on windows triggered by morestack #21382
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
So I'm stepping through the assembly and there's more fishy stuff.
However, at that point rax is 0x17, so trying to de-reference [rax-75h] throws an exception:
That doesn't make sense to me unless this is a trick to just trigger an exception.
Here's a what gets executed, according to windbg, when single-stepping from int 3 to calling morestack again:
I don't get how executing:
ends up going to
I'd like to understand how we wound up in morestack without any remaining system stack space in the first place. Once we hit the INT $3, it would be nice to fail more gracefully, but things are toast anyway.
If there any way MSHTML could be calling back into Go code while deep in the stack?
If not, and I'm grasping at straws here, but my guess is that the C "syscall" code (which runs on the system stack) is running out of stack space, which invokes a Windows exception handler registered by the runtime, which also attempts to run on the system stack and fails when it sees there's no stack left. @alexbrainman, I know very little about how Windows exception handlers work; does this seem like a plausible explanation?
(Notably, on UNIX platforms, the signal handler runs on yet another stack that's only for signal handling, so even if we run out of space on the system stack, we have a little more backup room in which to fail gracefully.)
Actually, this is sort of interesting, though I'm not sure what to make of it:
@kjk, can you put a breakpoint in
To summarize my guesses at this point:
It's not caused by running out of stack space.
It seems to be 64-bit only so I assume it's some of the arch-specific runtime assembly routines.
mshtml per se doesn't call Go but there are plenty of C->Go->C transitions because of how Windows message processing works.
Each window has a callback (called wndproc) responsible for handling message for that window. In Windows every control (a button, listview, browser view etc.) is a window.
To add custom handling of messages we need to provide our own wndproc callback, which must be called via C->Go trampoline. When that callback is not interested in the message, we need to call the original wndproc for that message, which is Go->C transition.
So every GUI windows program has a high rate of C->Go and Go->C transitions, especially those using https://github.com/lxn/walk/ library, as it hoooks wndproc for all windows it creates.
This also makes debugging with breakpoints impossible. I've spent several hours setting breakpoints at various points and stepping through the code but the same code works correctly the first 1000 times and then fails.
To summarize my beliefs:
The most promising approach would be to instrument
Why do you say that? It never makes sense to call morestack unconditionally, and, looking at the disassembly of
This isn't quite right. There is no separate "scheduler stack", there's just the user stack and the system stack (and the signal stack on UNIX). If
What makes you say there's plenty of stack when it calls
Can you point me to where your code is doing this? Normally this would go through the cgo callback paths, but since your application isn't using cgo, I'm curious how this is being done.
Given the C->Go callbacks, this is all precisely the behavior I would expect if C code were using up the system stack and then calling back into Go code.
(From my earlier post:)
Oops, I'd missed the
Like I said, those are guesses, you're more likely to be right than me.
I'm just parroting back terminology used by the code e.g.
I've tried the repro with ridiculously large (16 MB) stack and got the same thing.
In the debugger, I printed the callstack and it was relatively short from main().
Either way, this particular issue is due to
It's also consistent with being confused about which stack the code is on.
If the code is confused about which stack it is on, then we might be on a thread with plenty of stack but "needs to grow stack" check is done on the wrong stack, wrongly detects need to expand stack, calls
This is all done in the lnx/walk library:
Windows GUI code is roughly this (https://github.com/lxn/walk/blob/2d327b4a1aba7cda2a365bc566fd60ea6bd4c8bf/form.go#L365):
It's unavoidable to get Go -> C -> Go -> C in Windows GUI programs. Using cgo is not necessary for that.
Windows exception handler calls runtime.sigtramp. The runtime.sigtramp will run on scheduler stack. Also see _StackSystem is used to make sure we always have enough room to run exception handler.
If you are interested to see simple Windows GUI app, you can download d8b239ff60a62c3f50f7eb5994221b50ba055cf2 commit (initial commit) of https://github.com/alexbrainman/gowingui
The problem here is that after the runtime exceeds the g0 stack bounds (which happens because of #20975), it calls
Following up to #20975 (comment):
I assume that because it's true. :)
It's nice to have badmorestack fail in a "something went wrong, please debug me" way.
It would probably work fine to avoid trying to grow the stack until we know we're past the
I attempted to fix this with https://golang.org/cl/120857 but that didn't work. Unfortunately, I really have no way to debug this on Windows (I spent most of this afternoon trying to bring up a Windows VM to the point where I could debug this and didn't even get close). If someone else can apply that CL, set a breakpoint on
I got it to the point of being able to run
I know that all.bat eventually runs
If you were on linux, how would you run just that one test under gdb?
For what it's worth, here's how I did setup on Windows: https://www.notion.so/Building-Go-from-source-on-Windows-bf2363a561864fa58c08a3c6d2305f97
BTW: I tried:
Which would work in normal circumstances but I get a lengthy error message:
Why does it even talk about
@kjk, for this specific problem, run
Those are notes in progress.
I got to a point I can get run the test under gdb with (in powershell):
So far this is with
To run tests I do:
First anomaly: the initial signal is sigsev i.e. writing to unaccessible memory. That is unexpected as I would expect stack check to happen first. It seems like stack check doesn't work:
My assembly is rusty but: the line that crashes is
It should never get there because the stack check should detect lack of stack and jump to 0x45b7b4 to call
If my reading is correct then putting anything on stack is likely to cause another crash (i.e. sigsev).
Interestingly, adding nosplit makes things worse.
The code that is in tree only prints "fatal: morestack on g0" twice.
Here's the difference in the debugger when runtime.badmorestackg0 is hit:
Code in tree:
Adding more nosplit:
This seems consistent with my theory that the root of the problem is that we allow sigsev due to accessing outside of stack.
With more nosplit more is on the stack so we get into loop: sigsev -> call exception handler -> call badmorestack -> sigsev because stack is broken -> call exception handler etec.
referenced this issue
Jun 29, 2018
@kjk are you sure you are debugging correct test? The test needs to have TEST_G0_STACK_OVERFLOW environment variable set:
Unfortunetly, my gdb does not work properly when I do stepi command at critical moment.
@kjk perhaps you will have better luck with your gdb version. Or maybe you can use different debugger here.
@aclements let me know what else I can try here.
@alexbrainman I'm pretty sure
You can see the exact commands I ran: https://www.notion.so/Building-Go-from-source-on-Windows-bf2363a561864fa58c08a3c6d2305f97
I did set TEST_G0_STACK_OVERFLOW=1
And more notes from inside gdb: https://gist.github.com/kjk/962a3ac824ef5492921dad4ad2619080
I retested with https://go-review.googlesource.com/c/go/+/120195 i.e. windows stack check fix.
The root cause is still that stack zeroing causes SIGSEV aka. ACCESS VIOLATION. I now understand the code better, so here's what I found.
First, annotated disassembly of stackOverflow function:
At the time of the crash, %rcx is g:
i.e. stack is ~2MB, which checks out.
At the time of the crash:
We crash trying to write to address
Here's the weird thing: 0xd73ff0 seems to be a legit address within stack bounds:
Furthermore, I can access this memory under gdb:
If this address was really in-accessible, I would get something like:
So here's my wild theory: somehow memory mapping when extending stack is racy. So when the code is just running, it hits Access Violation because the memory is not yet accessible. But when we land in the debugger, the memory mapping is finalized so we can access the memory.
Unfortunately gdb on Windows doesn't support
I was able to use WinDBG so I narrowed down the issue.
The exception that happens is actually stack overflow. All other info I got from gdb is correct.
I also narrowed it down to delayed stack expansion.
I changed the repro to use 1048 bytes of stack (instead of 16) to speed up the repro. This changes the stack clearing code a bit.
I managed to figure out a conditional breakpoint that triggers right before crash:
When breakpoint happens:
Notice that at this point rsp is invalid. Rsp is d73dc0. The stack base in g is d7200 which is legit but stack base in TEB is d74000. Memory mapping shows that area in d71000-d74000 has PAGE_GUARD bit set.
Stepping through next 2 instructions is still fine:
Next instructions will cause stack overflow by trying to access 00d73dc8:
At this point the stack is properly mapped:
Continuing will eventually crash with AV somewhere in runtime but without symbols can't tell where.
Another theory: there is one-off error in the code that maps system stack (which I cannot find) because this happens reliably when accessing what looks like the lowest page of the stack.
When we have the following mapping:
I assume the part with PAGE_GUARD is there by default. At some point we detect we need to commit more stack and we call mmap() to extend stack.
It looks like we fail to do that for the lowest page.
Another theory: this might be caused by mismatch of g.stack.lo, which is 0x00d72000 and TEB.StackLimit which is 0d71000 which seems to be 4096 bytes which looks like 1 page.
Another observation: memoryBasicInformation.baseAddress and memoryBasicInformation.allocationBase is shifted by 0x1000 from TEB StackBase and StackLimit (i.e. StackLimit is 0xa1000 and allocationBase is 0xa0000).
g0.stack.lo is allocationBase + 0x2000 which explains 0x1000 difference (0x2000 - 0x1000) difference between g0.stack.lo and TEB.StackBase.
But not sure if that's relevant. I patched minit() with:
to make them match and that didn't change anything.
Changing slack value from 8*1024 to 16*1024 (
However, things then get recursive i.e. code in exception handler will call
And here's a fix:
Instead of trying to remove implicit calls to
With this change I get the proper clean exit with callstacks printed:
Thanks for the great debugging @kjk! You've definitely found the root of the problem: the initial INT3 traps fine and we detect that there's a problem, but since the stack bounds aren't quite right we wind up walking off the edge of the actual stack and the subsequent failures are a different exception, which we don't handle so carefully.
It seems like we should perhaps use the TIB instead of VirtualQuery to get the stack bounds (I'm not sure why I didn't come across that when originally figuring out how to get the stack bounds), and do something to make room for handling the exception (like the slack you added in
To answer some of your other questions, which you may have already found the answers to:
Unlike goroutine stacks, this stack is allocated by Windows itself when we create the thread.
There's nothing in the Go runtime that commits more stack or in any way extends a system stack. The OS commits more stack memory as we touch it, but that's transparent to Go.
That's not a bad idea, though it needs to be done a little more carefully. :)
I've been trying to figure out what stack the vectored exception handler runs on when it's handling a stack overflow exception without much luck. The closest I've come is https://stackoverflow.com/questions/1897301/vectored-exception-handling-during-stackoverflowexception, but that could mean the OS reserves some small dedicated stack for this purpose, or that it lets the stack grow into the
Sigh. Apparently the StackLimit field in the TIB gives the limit of the committed stack, not the reserved stack, so that's not useful. There's a later field with the "Address of memory allocated for stack" but that returns the same base address as
Based on https://docs.microsoft.com/en-us/windows/desktop/Memory/creating-guard-pages, this is bit a different from the guard pages I'm used to. Apparently Windows will let you use that memory, but only after the process has handled a