-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: fatal error: mcall called on m->g0 stack on Windows #67108
Comments
Below is my go.mod file (with a few redactions)
|
This seems to be related to: #56774 But in that issue, the poster said updating to 1.21.3 fixed it. However, we're using 1.21.4, and still seeing the issue. |
Slightly different panic:
|
A new type of panic:
|
Just to be clear, you're cross compiling inside docker, but running the program on real Windows machines? Do you see it only on a specific type of Windows machines? Thanks. |
Correct. We cross compile in docker. But we're running in native Windows.
Yes. We only see it on our AMD servers. Example specs of the machines is listed above in the original post. None of our Intel-based servers are seeing these panics Let me know if you have any other questions. :) |
Thanks. Have you tried running the program under the race detector? Is the program using cgo? Is there a way we can reproduce the issue ourselves? |
cc @golang/windows |
No, but that's a good idea. I'll give that a shot today
No, we disable it with
Unfortunately, it's a proprietary program. But given it's panic'ing on init() and in standard libraries, I'm going to see if I can create a small "hello world" test program and just run it continuously on the machines to see if I can reproduce it. |
I'm wondering if there is some small incorrect assumption with AMD hardware in the golang runtime code? I see this other issue, which isn't the same stack trace, but is also limited to AMD hardware: #62440 |
Some more callstacks from our servers:
|
Another callstack:
|
|
It could be. It could also be a kernel issue. A reproducer would be very helpful. Thanks. |
I'm attempting to get a minimal repro. I created something and I'm running it on all our workers periodically to try to get a failure. I'll report back here if I can get it to trigger. |
Indeed, the mismatch is suspicious, but there could be different issues. e.g., it looks like you have only gotten "mcall called on m->g0 stack" and "morestack on g0", while @RichieSams has gotten a slightly wider variety of other corruption-looking crashes. |
In summary, we have: AMD Ryzen Threadripper 3990X -> Zen 2 -> bare metal crashes |
@prattmic The frequency of my specific crash might be related to the fact that I was trying to invoke it (it tended to happen reliably after a fresh boot) and probably also related to the fact that all I was doing in my tests was printing the version string for restic over and over again. I got the following today, which occurred on a host with 10 day uptime:
I'm not sure where to go next since I'm not a Go developer. I was thinking of trying to compile on Windows instead of Linux docker container, just as an experiment, but I'm not sure whether the restic folks have a good recipe for me, given cross-compiling in docker seems to be the preferred methodology for obvious reasons. I might also try to spin a Server 2022 Evaluation VM to see if I encounter the same behavior. I think Server 2022 supports DTrace, which could be helpful to the right person. |
@prattmic Correct. They're bare metal servers running Windows 10. Below is the dxdiag report of one of the machines:
Google says |
Yesterday, I also got a crash on a user's desktop. (I'll ask them for their CPU model number and post it here when they respond). Stack trace:
|
User's desktop CPU: |
Someone on https://smartos.topicbox.com/groups/smartos-discuss/Tf8f450fd133908fb mentioned that they would get similar crashes just running
Go has quite strong cross-compilation support, so I would be surprised if the build host mattered, especially as restic isn't using cgo. That said, their build process is minorly non-standard . Here's the final build configuration:
The only non-standard thing I notice here is If you install Go, you can build restic with:
(FWIW, note there is a v0.17.1 now) This should install a restic binary to To get the exact same build as restic's config:
I'd be curious if you can reproduce with this, as well as with the extra flags dropped. (If it is easier to cross-compile from another OS, then |
Oh, note that the restic binary is built with go1.22.5. Perhaps try that version as well as the latest release (go1.23.1)? You can install any version of Go and then just switch with
|
@prattmic Yes. I installed go1.23.1.windows-amd64.msi and ran
I can try rebuilding restic, but since I'm seeing this with a simple |
Agreed. If this reproduces with |
Here's another:
|
The common theme behind these crashes is they are upset about values ultimately loaded from the G, which is ultimately stored in TLS. I could imagine this being an OS or VMM bug corrupting the TLS, but there are some other possibilities we can investigate:
cc @qmuntal since I mention your CLs and you probably know more about Windows TLS anyway. :) |
Before CL 431775 loading the TLS was only one instruction. That CL added a second instruction to resolve a newly added indirection: MOVQ TLS, BX
// is tranformed into
MOVQ runtime.tls_g(SB), BX
MOVQ 0(BX)(GS*1), BX It is probably unsafe to add a preemption point in between those instructions. If that's the case, both instructions should be marked as unsafe somehow. I'm not sure if that's already done, @prattmic could you check? I don't fully understand how is async preemption implemented 😅 |
Great question! Async preemption won't ever preempt assembly: https://cs.opensource.google/go/go/+/master:src/runtime/preempt.go;l=408;drc=af0c40311e2ee33ecd24971257606f42a49cf593;bpv=1;bpt=1 |
Good point, @qmuntal . https://cs.opensource.google/go/go/+/master:src/cmd/internal/obj/x86/asm6.go;l=2249-2262 marks two-instruction TLS access nonpreemptible, and |
Before register ABI, we load G from TLS in almost all function prologues. With register ABI, usually we load G from the register (R14), so TLS accesses are mostly from assembly functions. But there are still a small amount of ABI0 Go functions, which will load G from TLS. Maybe they are all in the runtime package therefore not async preemptible anyway? And at call sites in normal Go functions to ABI0 functions, we generate code sequence to reload G from TLS to R14. These could be preemptible, if the marking didn't work as expected. |
Yeah, for a call from a normal Go function to an ABI0 function, we generate a load from TLS https://cs.opensource.google/go/go/+/master:src/cmd/compile/internal/amd64/ssa.go;l=1080 . After the expansion in CL 431775, and the unsafe point marking, the code looks like
The "second", actually third after expansion, instruction, |
Can't seem to trigger the error with go1.23.1 with
Also have not been able to trigger the error with go1.19.13:
|
Thanks all, sounds like we found the bug (or at least a bug!). @qmuntal would you like to send a CL. I remain confused why this fails so readily on these machines but very rarely otherwise. My best guess would be that the load from GS is particularly slow for some reason, making signals more likely to land there, but that is a bit of a stretch. Inside the Bhyve VM, perhaps the load from GS is causing a VM exit, which would be very slow? But that doesn't explain the bare metal cases. |
Change https://go.dev/cl/612535 mentions this issue: |
Could you try if CL 612535 makes a difference? Thanks. |
@distancesprinter The easiest way to test a change like @cherrymui's is probably to use
Except this gets weird since your current version of Go crashes. So I think you'll need I think (Thanks for all the help testing despite not being a Go developer!) |
@prattmic @cherrymui So far, so good.
Installed Git for Windows, then retried
Rebooted and tried in PowerShell and Command Prompt:
|
@distancesprinter Great, thanks! I suppose for good measure you should double-check that tip is affected without the patch. |
@prattmic @cherrymui I have some new information to report. Based on the discussion about the TLS changes introduced in Go 1.20, and because I didn't see the crash after rebooting and running Turns out everything worked fine last night, but when I checked this morning, several crashes (output below). I will try again this evening to do a However, since I see the problem this morning, running restic with Go 1.19.13, I am concerned, as my understanding was that the bug we're dealing with here wasn't introduced until Go 1.20. Also, it remains suspicious that I'm only seeing this on my AMD boxes and there doesn't seem to be a good explanation for why that would be the case. @prattmic Is it possible to/can you provide instructions to build restic with Cherry's patch? That would give me a chance to monitor a lot more executions with a scheduled backup job than just running Here's the output (the .cmd file simply invokes the restic binary with all the flags necessary to specify the proper repository location, options, and secret) and this is running
|
Or see the more complicated final command in #67108 (comment) if this simple one doesn't work for some reason (just replace |
@prattmic Sad news. After reboot,
|
I've been sitting here rebooting and trying to generate a bunch of these. Sometimes things seem really stable--watched 10,000 iterations of gotip version without an error. Then, I'll reboot and immediate failure. Once the error is thrown, things seem pretty stable for a while. Pardon the imprecise language, I can't make any sense of what I'm seeing. morestack on g0:
unexpected signal during runtime execution:
|
Several more today. Output below. As I've stated, I am of course running bhyve and would be happy to make an attempt to reach out to that community if there is evidence that this is a VM issue, but we haven't seen any other issues running Windows 10 or Windows Server 2019 on bhyve on FreeBSD for as long as we've done it. Do any of the Go engineers have access to Zen2/Zen3 hardware to see if you can reproduce on metal? I don't have that hardware but part of me wants to build a PC just to try to reproduce this for myself to rule out bhyve. What are the chances this is an AMD firmware or hardware issue? Also haven't had any trouble with FreeBSD. Extremely stable. I only reboot these boxes when there is a kernel update. Several panics; software eventually ran successfully after repeatedly invoking it:
|
This bug could be at multiple different layers:
Honestly, with the minimal information we have, I have a hard time estimating which is most likely. That said, Windows is a popular Go port, and Zen2/Zen3 are popular processors, so IMO something about this bug must be rare, otherwise we'd have a lot more than 2 user reports of crashes. I don't know if any of us have a Zen2/Zen3, but GCE has both (N2D instances), so we can try to reproduce in a Windows VM there when we get a chance. [1] @RichieSams's case would need to be a different bug in this case. Unless bare metal Windows is actually running applications/parts of itself under virtualization. I feel like I read about this a few years ago, but now can't find references. Even if so, this would be a completely different VMM. |
Go version
go1.21.4
Output of
go env
in your module/workspace:NOTE: We compile inside docker. The output below is from running
go env
within the docker container that we use. We cross-compile for linux, windows, and darwin. The workers having a problem in this case are running windows/amd64.What did you do?
We have a cli program that runs on many thousands of VMs / bare metal workers as part of our larger CI setup. Recently, we've been seeing a non-trivial number of panics on a specific set of workers. These workers are bare metal AMD Threadripper machines running Windows 10. Below is some info from DXDiag. (I can get additional information if needed).
We're seeing two types of panics:
fatal error: runtime: mcall called on m->g0 stack
I have included multiple call stacks of these panics below. NOTE: these callstacks all happen on different machines. But they only happen on the same "series" of machine, which is documented above.
All the panics seem to happen during the
init()
phase of the runtime start-up. The panics are not super reproduce-able. I'm only really seeing it due to our scale.What did you see happen?
Below are the callstacks from a number of panics across multiple machines.
Then here are some callstacks from init's that are seemingly not following the right init order:
What did you expect to see?
The program finishes the
init()
phase without panicsThe text was updated successfully, but these errors were encountered: