-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: ReadMemStats fatal error: mappedReady and other memstats are not equal #64401
Comments
cc @golang/runtime |
I added this crash because The number being off by exactly 64 KiB in both cases is very suspicious. I'll poke around and see if I can find anything. If this seems like it's just the cause of a benign race or something I'll just send a CL to remove the crash by default. |
Interesting observation! I looked at a few more crashes from the last week. They are all off by 64 KiB:
One other observation: We have another set of programs that use this same shared monitoring code that are still using Go 1.20. I have not yet found a program built with Go 1.20 in these crashes yet. This is absolutely no guarantee, but there is a chance this is new to Go 1.21 somehow. |
Thanks, that's helpful! |
A few notes:
|
The execution trace buffers are exactly 64 KiB in size, but observing the skew would require a preemption in the trace flush path, but the flush path always happens on the system stack, so that's not possible. Alternatively a trace event could be emitted by the thread that stopped the world between the reading of I think that's technically not possible today, but that actually is now possible on tip. Bare threads that don't synchronize with a stop-the-world can freely emit events now... hm... That might be a good enough reason to put these checks behind a debug mode, but the idea that there's a miscalculation somewhere still bothers me. Also, this point is entirely moot if you don't have any kind of automated execution trace collection in production. (Do you?) |
We actually have automated execution tracing widely enabled in prod, and it's possible that this is what started this problem (I'll try to see if we have enough historic data to proof this). Also: The go1.20 services mentioned by @evanj don't have execution tracing enabled, so this might be a promising direction to investigate further, even if it's not clear yet how this is possible. |
Digging a little deeper on the tracer theory, I don't think it can possibly be a preemption at a bad time. This is the entire window in which it could get preempted to cause skew, and that window definitely has zero preemption points (those method calls are inlined and intrsinified, and on top of that all this is always called on the system stack). However, it's definitely possible for sysmon to generate an event while the world is stopped, for example via I'll send a CL to put these checks behind a double-check mode. |
It seems like some of the historic data for this has moved outside its retention window 😞. When the issue was first noticed internally, we actually suspected execution tracing, but ruled it out due to finding a crash that occurred 1 day prior to enabling execution tracing by default. But given your comments above, and the fact that we don't see this issue for our go1.20 services that don't have execution tracing enabled, I suspect that we incorrectly ruled out execution tracing. Maybe the service had manually opted into this feature, or timezones, or *waves hand*. So yeah, 👍 to putting these checks behind a double-check mode (is this the same as a debug mode?). Thank you so much for the quick help with this! PS: If it's possible to back-port this to go1.21 that'd be awesome. Let me know I can help with it. |
@gopherbot Please open backport issues for Go 1.20 and Go 1.21. I think it should just be backported. This can cause crashes in correct programs without a workaround, and the fix is very safe. |
Backport issue(s) opened: #64409 (for 1.20), #64410 (for 1.21). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/545277 mentions this issue: |
@felixge Makes sense. Even if there is some other super-rare issue that we're not seeing, this is at least one real one and we should fix it. :) |
Change https://go.dev/cl/545556 mentions this issue: |
Change https://go.dev/cl/545557 mentions this issue: |
…hind a double-check mode ReadMemStats has a few assertions it makes about the consistency of the stats it's about to produce. Specifically, how those stats line up with runtime-internal stats. These checks are generally useful, but crashing just because some stats are wrong is a heavy price to pay. For a long time this wasn't a problem, but very recently it became a real problem. It turns out that there's real benign skew that can happen wherein sysmon (which doesn't synchronize with a STW) generates a trace event when tracing is enabled, and may mutate some stats while ReadMemStats is running its checks. Fix this by synchronizing with both sysmon and the tracer. This is a bit heavy-handed, but better that than false positives. Also, put the checks behind a debug mode. We want to reduce the risk of backporting this change, and again, it's not great to crash just because user-facing stats are off. Still, enable this debug mode during the runtime tests so we don't lose quite as much coverage from disabling these checks by default. For #64401. Fixes #64409. Change-Id: I9adb3e5c7161d207648d07373a11da8a5f0fda9a Reviewed-on: https://go-review.googlesource.com/c/go/+/545277 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> (cherry picked from commit b2efd1d) Reviewed-on: https://go-review.googlesource.com/c/go/+/545556 Auto-Submit: Matthew Dempsky <mdempsky@google.com> TryBot-Bypass: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Matthew Dempsky <mdempsky@google.com>
…hind a double-check mode ReadMemStats has a few assertions it makes about the consistency of the stats it's about to produce. Specifically, how those stats line up with runtime-internal stats. These checks are generally useful, but crashing just because some stats are wrong is a heavy price to pay. For a long time this wasn't a problem, but very recently it became a real problem. It turns out that there's real benign skew that can happen wherein sysmon (which doesn't synchronize with a STW) generates a trace event when tracing is enabled, and may mutate some stats while ReadMemStats is running its checks. Fix this by synchronizing with both sysmon and the tracer. This is a bit heavy-handed, but better that than false positives. Also, put the checks behind a debug mode. We want to reduce the risk of backporting this change, and again, it's not great to crash just because user-facing stats are off. Still, enable this debug mode during the runtime tests so we don't lose quite as much coverage from disabling these checks by default. For #64401. Fixes #64410. Change-Id: I9adb3e5c7161d207648d07373a11da8a5f0fda9a Reviewed-on: https://go-review.googlesource.com/c/go/+/545277 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com> (cherry picked from commit b2efd1d) Reviewed-on: https://go-review.googlesource.com/c/go/+/545557 Auto-Submit: Matthew Dempsky <mdempsky@google.com> TryBot-Bypass: Matthew Dempsky <mdempsky@google.com>
Go version
go version go1.21.4 linux/amd64 (also arm64)
Reproducibility
What operating system and processor architecture are you using (
go env
)?What did you do?
We have a set of processes that periodically calls
expvar.Do
, which callsruntime.ReadMemStats
to collect Go memory statistics. We are seeing occasional crashes with the message "mappedReady and other memstats are not equal" across multiple separate programs, on both amd64 and arm64. This comes from the following line of code in the Go runtime: https://github.com/golang/go/blob/master/src/runtime/mstats.go#L487 .These programs share some common metrics/monitoring type of code, so I suspect there is something that all these processes are doing which triggers this problem. We have been unable to figure out what it may be. It seems to happen only after a process has been running for a few hours. Some of these processes use Cgo code libraries, but some should be only Go code. Any suggestions for how to help track this down would be appreciated.
What did you expect to see?
No crashes.
What did you see instead?
Example crash from amd64
Example crash from arm64
The text was updated successfully, but these errors were encountered: