-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: rare stack corruption on Go 1.20 and later #64781
Comments
cc @golang/runtime |
Hi @klauspost, others will likely have more useful comments, but two initial questions that spring to mind — are you able to gather exact kernel versions for different machines where this is observed, and is there an opportunity to try some triage debug settings (for example, along the lines of those outlined in #58212 (comment))? |
hi go Team, thanks for looking into this, I've able to reproduce it (rarely), here is what kernel used right now:
|
Customer 1: CGO disabled for build FWIW. Updated disassembly. Was accidentally Windows compiled. Now shows Linux. |
Kernel: |
Two things where I might make mistakes at: One thing I did encounter was that races aren't instrumented in assembly; so if you have a data-race that may cause some pain. Annotate your assembly (or anything that might call assembly in stdlib) with The other is that assembly code might write outside of it's bounds -- in which case, write exhaustive tests across some range of data sizes / offsets, with memory protected byte slices. |
@egonelbre As I wrote, I am pretty sure this isn't a buffer write issue. I have fuzz tests in place that already checks out-of-buffer writes - allocating a bigger buffer, writing random data there - and then truncates the capacity in the buffer given to compression as destination. And after the compression checking if the data after the buffer has been touched. Since it (to me at least) appears to be a stack issue, I don't think there are races going on, since of course stacks are never shared. I do run all tests through the race detector as part of CI tests. However instrumenting sounds like a good idea anyway - do you have any doc/examples? Google doesn't really turn up anything useful. |
Here's a quick demo https://github.com/egonelbre/compress/commit/c7df670870dc509127572281d79075a7faac8f13. Instrumenting around buffers and pools is also a good idea. |
I added |
@dctrwatson, that's great, thanks for trying that! What do you think about trying those one at a time, maybe in the reverse order they are listed in #58212 (comment)? |
Switching to next flag |
So looking at the generated assembly one things that sticks out is that
Since I don't know anything about the GC preemption it is hard for me to see if this is a problem with my code or the runtime. |
What generated assembly, in which function? Thanks. |
@cherrymui The avo generated assembly - example - meaning the assembly being called and presumably pre-empted. |
Oh, that is fine. The Go assembler itself will insert prologue and epilogue to adjust SP and BP, as long as the frame size (589848 in |
Then unfortunately that isn't where the problem lies. |
I'll switch to |
Well that was fast... got one already with
|
Interesting finding! I see that
But I see here that we are not changing the status to Gcopystack https://go.googlesource.com/go/+/adec22b9f7c9d9cfb95ff6af1c63ec489d6e9bb8/src/runtime/stack.go#1214 Can it be the case? If it is not in _Gcopystack the concurrent gc might try to scan the stack while we are doing the copy? cc @cherrymui |
@mauri870 Thanks for catching that. I agree that it seems inconsistent with the comment. But I don't obviously see what is wrong. The G (and its stack) is either owned by the scan bit, or it is the running G itself. So other goroutines shouldn't be doing anything with its stack. Also, #64781 (comment) seems to suggest that stack shrinking isn't the cause. That said, if you'd like, you can try if this patch makes any difference.
|
From what I understood, setting This paired with the comment that we should be changing the status to _Gcopystack raised some eyebrows. Plus this comment in the other usage of copystack:
|
Okay, I misread the comment #64781 (comment) -- it is gccheckmark, not gcshrinkstackoff... Then try if that patch makes any difference. Thanks. |
@dctrwatson If you are able to, can you compile your binary with a custom Go toolchain with cherrymui's patch and then run the program without any GODEBUG env set to see if that fixes the issue? Thanks |
@dctrwatson As far as I can tell the workflow would be:
AFAICT it shouldn't be necessary to recompile the go compiler, but someone can correct me if I am wrong. |
We are also seeing this corruption issue on our production, building with go 1.20.11, using Klauspost v1.17.3 on Debian 10 running kernel 5.10.0-25-amd64 on dual CPU Intel Xeon Gold 6348/6230R. |
@klauspost This ended up back in the compiler/runtime triage queue, do you have any updates from your end on trying it out with debug settings enabled? |
@mknyszek We also see the issue being worked around with the It was shortly looking like "Intel SDE" could reproduce the issue, but it turned out that it was probably just emulation quirks. So I still don't have a reliable reproducer. It does match quite well with the code using quite a bit of stack. Probably more than most other code out there. |
Hi all, I want to add one more case to the issue with some additional details if it helps for the debugging. Here are the last log lines in 3 different occasions of fatal errors (without GODEBUG): Pod 1:
Pod 2:
Pod 3:
I hope this helps. |
@karsov , thanks! Are you able to share the stack tracebacks as well? You can strip out the symbols, paths, and arguments, though if you could keep the runtime functions that's probably helpful. I mostly want to see the goroutines and the sp/fp printed on each frame. |
That's the strange part (at least for me). |
Oh, hmm. I guess I would expect there to be no stack trace for the "fatal: bad g in signal handler" ones. I would have expected a traceback from pod 2, though. The fact that two of them are printing "fatal: bad g in signal handler" is really strange. We don't manipulate workbufs or scan stacks from signal handlers. This suggests we somehow 1) crashed for one reason and then 2) somehow set g to nil, which would immediately cause a signal and print that message. I don't see how we get from 1 to 2. |
@karsov We also encountered the same error, can you tell me how to set GODEBUG=gcshrinkstackoff=1 environment variable in k8s deployment? thx |
If someone has the possibility to test the latest release without the GODEBUG switch. v1.17.7 contains an OOB write fix. It isn't on the stack however, so I am unsure if it resolves this issue. We have been unable to confirm either way internally. |
I have a service run on 3 hosts. The panic occurred about every day. And I set |
@klauspost I just happen to be running a version of thanos (which I believe is suffering from this issue) that uses v1.17.7 but just had the crash happen:
Kernel Version: 6.1.91-99.172.amzn2023.x86_64 |
@RodrigoMenezes-Vantage Thanks! The release above is compiled with Go 1.21.10
FWIW, we have not observed this issue at MinIO for quite a while. If it isn't the fix above (which also makes sense, since it would be heap rather than stack), it could be that Go 1.22 fixes (we switched pretty close after release), but that is of course just speculation. With thousands of deploys it seems to be something machine specific. Especially since it seems to be the same machines reporting it - and semi-reliable being able to reproduce. This is of course not a great situation. Is there anything I can try to move this towards a resolution? |
@klauspost: Can you check the CPU versions (including steppings!) of machines where it crashes and doesn’t crash and see if the crashing machines are always running certain CPUs? The reason I ask is that while CPU bugs are rare, they are not unheard of, and they can have completely unpredictable consequences, up to and including being able to escalate privileges. Furthermore, these bugs can require very specific conditions to trigger, so it is quite possible that generic CPU tests did not trigger the bug. This issue reminds me of https://gallium.inria.fr/blog/intel-skylake-bug/, which turned out to be an errata in Intel Skylake and Kaby Lake parts. |
@klauspost: If your customers are using hyperthreading, can you ask them to disable hyperthreading and see if the crashes go away? Edit: another thing to check would be if there is a microcode update available, and if so, if that solves the problem. |
I have added the CPU info from 3 crashing systems. Most (if not all) are virtualized so CPUID can be unreliable. Also this is their current (or close to current) microcode level. These crashes happened almost a year ago, so microcode levels may not reflect the level at the time of the crash. We do not have any access to these systems. ---- SEE CPU INFO ----Case 1:
Case 2:
Case 3 - these are the 2 CPU variants in the cluster - I cannot tell on which the crash happened. But both have similar microcode version:
All 3 are different customers. Another crash appears to have happened on an AMD EPYC system But I was unable to confirm this. Note that all of these are from externally running systems, so I have very limited access to information. In all cases only one of several identical servers would see the crashes happen - but at a regular frequency. But all happened at Go 1.19 -> 1.20 compiler upgrade and was either solved by a similar build with Go 1.19 or |
Were these servers running on identical hosts? |
@DemiMarie If you mean VM hosts, I heavily expect so, but have no solid proof. The hardware itself should also be fairly uniform. We checked for faulty/flaky memory as one of the first debugging steps. |
Needless to say in all cases |
Getting an rr trace might help, but that comes with significant caveats:
|
Go version
Go 1.20.x and later
What operating system and processor architecture are you using (
go env
)?What did you do?
We have at MinIO been experiencing runtime crashes since the release of Go 1.20
The issues only appear to occur for a very small number of our customers, and supplying them with a Go 1.19.x compiled binary always solves the issue.
The issue appear slightly different each time, but all indicate some sort of corruption. Since none of them had any "smoking gun", I held off on submitting an issue.
Here are some
samples of customer crashes (click to expand)
.. Same hardware:
Second customer:
Third customer:
It does however seem like the "Thanos" project has supplied what looks to be smoking gun. (issue contains additional traces)
So the crash occurs in a goroutine that compresses a block.
The only really interesting thing that goes on is that calls assembler to compress the block. Looking at the assembly caller
it is pretty straightforward. This is the disassembly for the caller:
This is the assembly function called signature:
... and the function definition:
The reason I included the stack size check is that it uses quite a bit of stack, so there is a chance that it is called.
The stack is used for a dynamic lookup table. I am fairly sure there are no writes outside the stack, and I also am pretty confident there are no writes outside the provided slices (this would likely also give different errors).
I do not use the
BP
register - so it is not clobbered, and only SSE2 registers are used - so noVZEROUPPER
weirdness. The stack is managed by avo, so less likely there is a bug with that.So my questions are:
A) Am I doing something obviously wrong?
B) What would be a typical reason for this error to show up?
C) This seems releated to GC, so is there a window where the goroutine could be preempted in an unsafe state?
D) Are there any Go 1.20 changes that seem likely to be triggering this?
Keep in mind that this doesn't appear to happen on too many machines. Talos reported that it seem to happen more often if a lot of memory is allocated.
I will of course assist with any information that may be needed - but I feel at this point I need some pointers from people deeper understanding of the runtime to get much further.
Also note we have tested CPU+RAM on some of these customer systems extensively since that seemed like a possibility at first. Also note that crashes can be completely unrelated - but the coincidence seems to big.
What did you expect to see?
No crash
What did you see instead?
Rare, random runtime crashes
Edit: Assembly is now linux compiled.
The text was updated successfully, but these errors were encountered: