-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fatal error: mcall called on m->g0 stack on Windows #67108
Comments
Below is my go.mod file (with a few redactions)
|
This seems to be related to: #56774 But in that issue, the poster said updating to 1.21.3 fixed it. However, we're using 1.21.4, and still seeing the issue. |
Slightly different panic:
|
A new type of panic:
|
Just to be clear, you're cross compiling inside docker, but running the program on real Windows machines? Do you see it only on a specific type of Windows machines? Thanks. |
Correct. We cross compile in docker. But we're running in native Windows.
Yes. We only see it on our AMD servers. Example specs of the machines is listed above in the original post. None of our Intel-based servers are seeing these panics Let me know if you have any other questions. :) |
Thanks. Have you tried running the program under the race detector? Is the program using cgo? Is there a way we can reproduce the issue ourselves? |
cc @golang/windows |
No, but that's a good idea. I'll give that a shot today
No, we disable it with
Unfortunately, it's a proprietary program. But given it's panic'ing on init() and in standard libraries, I'm going to see if I can create a small "hello world" test program and just run it continuously on the machines to see if I can reproduce it. |
I'm wondering if there is some small incorrect assumption with AMD hardware in the golang runtime code? I see this other issue, which isn't the same stack trace, but is also limited to AMD hardware: #62440 |
Some more callstacks from our servers:
|
Another callstack:
|
|
It could be. It could also be a kernel issue. A reproducer would be very helpful. Thanks. |
I'm attempting to get a minimal repro. I created something and I'm running it on all our workers periodically to try to get a failure. I'll report back here if I can get it to trigger. |
Oh, note that the restic binary is built with go1.22.5. Perhaps try that version as well as the latest release (go1.23.1)? You can install any version of Go and then just switch with
|
@prattmic Yes. I installed go1.23.1.windows-amd64.msi and ran
I can try rebuilding restic, but since I'm seeing this with a simple |
Agreed. If this reproduces with |
Here's another:
|
The common theme behind these crashes is they are upset about values ultimately loaded from the G, which is ultimately stored in TLS. I could imagine this being an OS or VMM bug corrupting the TLS, but there are some other possibilities we can investigate:
cc @qmuntal since I mention your CLs and you probably know more about Windows TLS anyway. :) |
Before CL 431775 loading the TLS was only one instruction. That CL added a second instruction to resolve a newly added indirection: MOVQ TLS, BX
// is tranformed into
MOVQ runtime.tls_g(SB), BX
MOVQ 0(BX)(GS*1), BX It is probably unsafe to add a preemption point in between those instructions. If that's the case, both instructions should be marked as unsafe somehow. I'm not sure if that's already done, @prattmic could you check? I don't fully understand how is async preemption implemented 😅 |
Great question! Async preemption won't ever preempt assembly: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=408;drc=af0c40311e2ee33ecd24971257606f42a49cf593;bpv=1;bpt=1 |
Good point, @qmuntal . https://cs.opensource.google/go/go/+/master:src/cmd/internal/obj/x86/asm6.go;l=2249-2262 marks two-instruction TLS access nonpreemptible, and |
Before register ABI, we load G from TLS in almost all function prologues. With register ABI, usually we load G from the register (R14), so TLS accesses are mostly from assembly functions. But there are still a small amount of ABI0 Go functions, which will load G from TLS. Maybe they are all in the runtime package therefore not async preemptible anyway? And at call sites in normal Go functions to ABI0 functions, we generate code sequence to reload G from TLS to R14. These could be preemptible, if the marking didn't work as expected. |
Yeah, for a call from a normal Go function to an ABI0 function, we generate a load from TLS https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/amd64/ssa.go;l=1080 . After the expansion in CL 431775, and the unsafe point marking, the code looks like
The "second", actually third after expansion, instruction, |
Can't seem to trigger the error with go1.23.1 with
Also have not been able to trigger the error with go1.19.13:
|
Thanks all, sounds like we found the bug (or at least a bug!). @qmuntal would you like to send a CL. I remain confused why this fails so readily on these machines but very rarely otherwise. My best guess would be that the load from GS is particularly slow for some reason, making signals more likely to land there, but that is a bit of a stretch. Inside the Bhyve VM, perhaps the load from GS is causing a VM exit, which would be very slow? But that doesn't explain the bare metal cases. |
Change https://go.dev/cl/612535 mentions this issue: |
Could you try if CL 612535 makes a difference? Thanks. |
@distancesprinter The easiest way to test a change like @cherrymui's is probably to use
Except this gets weird since your current version of Go crashes. So I think you'll need I think (Thanks for all the help testing despite not being a Go developer!) |
@prattmic @cherrymui So far, so good.
Installed Git for Windows, then retried
Rebooted and tried in PowerShell and Command Prompt:
|
@distancesprinter Great, thanks! I suppose for good measure you should double-check that tip is affected without the patch. |
@prattmic @cherrymui I have some new information to report. Based on the discussion about the TLS changes introduced in Go 1.20, and because I didn't see the crash after rebooting and running Turns out everything worked fine last night, but when I checked this morning, several crashes (output below). I will try again this evening to do a However, since I see the problem this morning, running restic with Go 1.19.13, I am concerned, as my understanding was that the bug we're dealing with here wasn't introduced until Go 1.20. Also, it remains suspicious that I'm only seeing this on my AMD boxes and there doesn't seem to be a good explanation for why that would be the case. @prattmic Is it possible to/can you provide instructions to build restic with Cherry's patch? That would give me a chance to monitor a lot more executions with a scheduled backup job than just running Here's the output (the .cmd file simply invokes the restic binary with all the flags necessary to specify the proper repository location, options, and secret) and this is running
|
Or see the more complicated final command in #67108 (comment) if this simple one doesn't work for some reason (just replace |
@prattmic Sad news. After reboot,
|
I've been sitting here rebooting and trying to generate a bunch of these. Sometimes things seem really stable--watched 10,000 iterations of gotip version without an error. Then, I'll reboot and immediate failure. Once the error is thrown, things seem pretty stable for a while. Pardon the imprecise language, I can't make any sense of what I'm seeing. morestack on g0:
unexpected signal during runtime execution:
|
Several more today. Output below. As I've stated, I am of course running bhyve and would be happy to make an attempt to reach out to that community if there is evidence that this is a VM issue, but we haven't seen any other issues running Windows 10 or Windows Server 2019 on bhyve on FreeBSD for as long as we've done it. Do any of the Go engineers have access to Zen2/Zen3 hardware to see if you can reproduce on metal? I don't have that hardware but part of me wants to build a PC just to try to reproduce this for myself to rule out bhyve. What are the chances this is an AMD firmware or hardware issue? Also haven't had any trouble with FreeBSD. Extremely stable. I only reboot these boxes when there is a kernel update. Several panics; software eventually ran successfully after repeatedly invoking it:
|
This bug could be at multiple different layers:
Honestly, with the minimal information we have, I have a hard time estimating which is most likely. That said, Windows is a popular Go port, and Zen2/Zen3 are popular processors, so IMO something about this bug must be rare, otherwise we'd have a lot more than 2 user reports of crashes. I don't know if any of us have a Zen2/Zen3, but GCE has both (N2D instances), so we can try to reproduce in a Windows VM there when we get a chance. [1] @RichieSams's case would need to be a different bug in this case. Unless bare metal Windows is actually running applications/parts of itself under virtualization. I feel like I read about this a few years ago, but now can't find references. Even if so, this would be a completely different VMM. |
I have encountered a similar error over the past almost a year as follows: Expand to view details
My application is distributed across at least hundreds of Windows, macOS, and Linux machines, with potentially at least a few hundred runs daily. After analyzing the few errors that occurred, they all happened on the Windows + AMD CPU platform. I feel helpless because, other than retrying, I have no idea how to resolve this low-probability but occasionally occurring issue. |
#56774 (comment) reported a fix with 1.21.3. However, for me, this error is still happening as of 1.22.2. |
I can try updating to latest and see if we still reproduce |
For what it is worth, I am still experiencing this in go 1.23.1 (even for
"go version"). I was not able to reproduce in GCP. The processor model is
slightly different in GCP--seems like a special SKU for cloud providers--so
it's technically not the same hardware. I have details related to my tests
saved; not immediately available to me right now.
I did notice that the frequency of the issues for me was reduced
significantly when I switched my VM from 4 vcpu to 1 vcpu. At first, I
thought that the issue had gone away. But, alas, it is still present with
1 vcpu, although less frequent.
…On Tue, Dec 17, 2024 at 1:36 PM Adrian Astley ***@***.***> wrote:
I can try updating to latest and see if we still reproduce
—
Reply to this email directly, view it on GitHub
<#67108 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADFATVERU4FPRHYNGCUO4DT2GBVLVAVCNFSM6AAAAABG6TTIXGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNBZGI4TMOBVGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@prattmic I attempted to test in GCP, However, this is of course using GCP's special SKU: Whereas my hardware is: Same OS except that GCP is Datacenter and mine is Standard.
I was thinking of hiring a consultant to try to track down whether this is a bug in bhyve, but I am reluctant since others seem to be reporting this as occurring on metal. @choyri Can you confirm this is happening on physical machines? What are your processor SKUs and Windows versions? Go versions? My other thought was to see if I could reproduce in Server 2022. I would be happy to work with a qualified engineer who wants to debug remotely. I would need to supervise, but could set up a meeting to screen share. Thanks |
Go version
go1.21.4
Output of
go env
in your module/workspace:NOTE: We compile inside docker. The output below is from running
go env
within the docker container that we use. We cross-compile for linux, windows, and darwin. The workers having a problem in this case are running windows/amd64.What did you do?
We have a cli program that runs on many thousands of VMs / bare metal workers as part of our larger CI setup. Recently, we've been seeing a non-trivial number of panics on a specific set of workers. These workers are bare metal AMD Threadripper machines running Windows 10. Below is some info from DXDiag. (I can get additional information if needed).
We're seeing two types of panics:
fatal error: runtime: mcall called on m->g0 stack
I have included multiple call stacks of these panics below. NOTE: these callstacks all happen on different machines. But they only happen on the same "series" of machine, which is documented above.
All the panics seem to happen during the
init()
phase of the runtime start-up. The panics are not super reproduce-able. I'm only really seeing it due to our scale.What did you see happen?
Below are the callstacks from a number of panics across multiple machines.
Then here are some callstacks from init's that are seemingly not following the right init order:
What did you expect to see?
The program finishes the
init()
phase without panics@gabyhelp's overview of this issue: #67108 (comment)
The text was updated successfully, but these errors were encountered: