Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: ReadMemStats fatal error: mappedReady and other memstats are not equal #64401

Closed
1 task done
evanj opened this issue Nov 27, 2023 · 15 comments
Closed
1 task done
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@evanj
Copy link
Contributor

evanj commented Nov 27, 2023

Go version

go version go1.21.4 linux/amd64 (also arm64)

Reproducibility

  • Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

(NOTE: This is our build configuration; we also build for arm64 and see this crash there too)

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/gocache'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY=''
GOROOT='/root/.gimme/versions/go1.21.4.linux.amd64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/root/.gimme/versions/go1.21.4.linux.amd64/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD=''
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build2851638515=/tmp/go-build -gno-record-gcc-switches'

What did you do?

We have a set of processes that periodically calls expvar.Do, which calls runtime.ReadMemStats to collect Go memory statistics. We are seeing occasional crashes with the message "mappedReady and other memstats are not equal" across multiple separate programs, on both amd64 and arm64. This comes from the following line of code in the Go runtime: https://github.com/golang/go/blob/master/src/runtime/mstats.go#L487 .

These programs share some common metrics/monitoring type of code, so I suspect there is something that all these processes are doing which triggers this problem. We have been unable to figure out what it may be. It seems to happen only after a process has been running for a few hours. Some of these processes use Cgo code libraries, but some should be only Go code. Any suggestions for how to help track this down would be appreciated.

What did you expect to see?

No crashes.

What did you see instead?

Example crash from amd64

runtime: mappedReady=109281560
runtime: totalMapped=127983896
runtime: released=18767872
runtime: totalMapped-released=109216024
fatal error: mappedReady and other memstats are not equal

runtime stack:
runtime.throw({0x14be1ed?, 0x7fae80a11f88?})
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/panic.go:1077 +0x5c fp=0x7faea7ff9b70 sp=0x7faea7ff9b40 pc=0x4844dc
runtime.readmemstats_m(0xc0017a3d88)
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:493 +0x585 fp=0x7faea7ffa718 sp=0x7faea7ff9b70 pc=0x47e7e5
runtime.ReadMemStats.func1()
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:364 +0x17 fp=0x7faea7ffa730 sp=0x7faea7ffa718 pc=0x47e237
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/asm_amd64.s:509 +0x4a fp=0x7faea7ffa740 sp=0x7faea7ffa730 pc=0x4b8d6a

goroutine 14991247 [running]:
runtime.systemstack_switch()
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0017a3d50 sp=0xc0017a3d40 pc=0x4b8d08
runtime.ReadMemStats(0xc0017a3d88)
	/root/.gimme/versions/go1.21.4.linux.amd64/src/runtime/mstats.go:363 +0x49 fp=0xc0017a3d78 sp=0xc0017a3d50 pc=0x47e1e9
expvar.memstats()
	/root/.gimme/versions/go1.21.4.linux.amd64/src/expvar/expvar.go:365 +0x36 fp=0xc0017a5428 sp=0xc0017a3d78 pc=0x967d96
expvar.Func.String(0xc000078508?)
	/root/.gimme/versions/go1.21.4.linux.amd64/src/expvar/expvar.go:268 +0x16 fp=0xc0017a5450 sp=0xc0017a5428 pc=0x967516
fmt.(*pp).handleMethods(0xc003fee000, 0x14915a5?)
	/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:673 +0x2b2 fp=0xc0017a56a8 sp=0xc0017a5450 pc=0x5404f2
fmt.(*pp).printArg(0xc003fee000, {0x12b5b00?, 0x1556250}, 0x73)
	/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:756 +0x630 fp=0xc0017a5748 sp=0xc0017a56a8 pc=0x540f50
fmt.(*pp).doPrintf(0xc003fee000, {0x148e303, 0x6}, {0xc0017a5928?, 0x2, 0x2})
	/root/.gimme/versions/go1.21.4.linux.amd64/src/fmt/print.go:1077 +0x39e fp=0xc0017a5840 sp=0xc0017a5748 pc=0x543a7e
fmt.Fprintf({0x165c6e0, 0xc000ea9500}, {0x148e303, 0x6}, {0xc003f3d928, 0x2, 0x2})

... our code calling expvar.Do omitted ...

Example crash from arm64

runtime: mappedReady=323105112
runtime: totalMapped=373518680
runtime: released=50479104
runtime: totalMapped-released=323039576
fatal error: mappedReady and other memstats are not equal

runtime stack:
runtime.throw({0x1a24762?, 0x40c34?})
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/panic.go:1077 +0x40 fp=0xffffcdb4ebe0 sp=0xffffcdb4ebb0 pc=0x4cd10
runtime.readmemstats_m(0x400f193d38)
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:493 +0x530 fp=0xffffcdb4f790 sp=0xffffcdb4ebe0 pc=0x46f60
runtime.ReadMemStats.func1()
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:364 +0x20 fp=0xffffcdb4f7b0 sp=0xffffcdb4f790 pc=0x46a10
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/asm_arm64.s:243 +0x6c fp=0xffffcdb4f7c0 sp=0xffffcdb4f7b0 pc=0x81fec

goroutine 42133645 [running]:
runtime.systemstack_switch()
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/asm_arm64.s:200 +0x8 fp=0x400f193cf0 sp=0x400f193ce0 pc=0x81f68
runtime.ReadMemStats(0x400f193d38)
  /root/.gimme/versions/go1.21.3.linux.arm64/src/runtime/mstats.go:363 +0x4c fp=0x400f193d20 sp=0x400f193cf0 pc=0x469cc
expvar.memstats()
  /root/.gimme/versions/go1.21.3.linux.arm64/src/expvar/expvar.go:365 +0x4c fp=0x400f1953d0 sp=0x400f193d20 pc=0x665adc
expvar.Func.String(0x4008f9bce0?)
  /root/.gimme/versions/go1.21.3.linux.arm64/src/expvar/expvar.go:268 +0x24 fp=0x400f195400 sp=0x400f1953d0 pc=0x665294
fmt.(*pp).handleMethods(0x400710a000, 0x19ddd30?)
  /root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:673 +0x21c fp=0x400f195660 sp=0x400f195400 pc=0x10ce8c
fmt.(*pp).printArg(0x400710a000, {0x16c0600?, 0x1ae4fc8}, 0x73)
  /root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:756 +0x5fc fp=0x400f195700 sp=0x400f195660 pc=0x10d8bc
fmt.(*pp).doPrintf(0x400710a000, {0x19d9db8, 0x6}, {0x400f195908?, 0x2, 0x2})
  /root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:1077 +0x2dc fp=0x400f195810 sp=0x400f195700 pc=0x1100bc
fmt.Fprintf({0x1ccc2c0, 0x400503c0e0}, {0x19d9db8, 0x6}, {0x4008afd908, 0x2, 0x2})
  /root/.gimme/versions/go1.21.3.linux.arm64/src/fmt/print.go:224 +0x54 fp=0x400f195870 sp=0x400f195810 pc=0x10a404

... our code calling expvar.Do omitted ...
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 27, 2023
@mauri870 mauri870 added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 27, 2023
@mauri870
Copy link
Member

cc @golang/runtime

@mknyszek mknyszek self-assigned this Nov 27, 2023
@mknyszek mknyszek added this to the Backlog milestone Nov 27, 2023
@mknyszek
Copy link
Contributor

I added this crash because mappedReady is the source-of-truth for a whole bunch of things in the runtime (like the stats ingested to maintain the memory limit) and it's important to keep it aligned with outward facing stats. But in hindsight a crash is maybe too strong of a signal. It's also possible this is just due to a race. mappedReady should line up with everything else when the world is stopped, but I wonder if there's a case where it can get updated even with the world stopped.

The number being off by exactly 64 KiB in both cases is very suspicious. I'll poke around and see if I can find anything. If this seems like it's just the cause of a benign race or something I'll just send a CL to remove the crash by default.

@evanj
Copy link
Contributor Author

evanj commented Nov 27, 2023

Interesting observation! I looked at a few more crashes from the last week. They are all off by 64 KiB:

program_a go1.21.4.linux.amd64 mappedReady=109281560 totalMapped-released=109216024 diff=65536
program_b go1.21.3.linux.arm64 mappedReady=323105112 totalMapped-released=323039576 diff=65536
program_b go1.21.3.linux.amd64 mappedReady=88793368 totalMapped-released=88727832 diff=65536
program_c go1.21.4.linux.amd64 mappedReady=417425784 totalMapped-released=417360248 diff=65536
program_d go1.21.4.linux.amd64 mappedReady=373972328 totalMapped-released=373906792 diff=65536
program_e go1.21.1.linux.arm64 mappedReady=569349944 totalMapped-released=569284408 diff=65536

One other observation: We have another set of programs that use this same shared monitoring code that are still using Go 1.20. I have not yet found a program built with Go 1.20 in these crashes yet. This is absolutely no guarantee, but there is a chance this is new to Go 1.21 somehow.

@mknyszek
Copy link
Contributor

One other observation: We have another set of programs that use this same shared monitoring code that are still using Go 1.20. I have not yet found a program built with Go 1.20 in these crashes yet. This is absolutely no guarantee, but there is a chance this is new to Go 1.21 somehow.

Thanks, that's helpful!

@mknyszek
Copy link
Contributor

mknyszek commented Nov 27, 2023

A few notes:

  • Given that all of the checks made before the failing check passed, we can rule out a stray heap allocation in the runtime and a preemption point between the update to mappedReady and the rest of the stats.
  • We can rule out GC work bufs because they're allocated in 32 KiB increments. I would expect to see failures with that increment. (Not that I think there's really an opportunity to allocate those in the relevant locations AFAICT.) I suppose it could be two work bufs, but they're not usually allocated in pairs.
  • persistentalloc blocks are 256 KiB, so probably not that. Notably the threshold for allocating directly from the OS is 64 KiB, but I don't see anything that calls at exactly that threshold.
  • It can't be a stack growth because (1) there is no opportunity for a stack growth in ReadMemStats or for the STW to stop in the middle of a stack growth and (2) if that happened, then the only way mappedReady would be visibly different is if the allocated memory came from released memory, but the released check passed.

@mknyszek
Copy link
Contributor

The execution trace buffers are exactly 64 KiB in size, but observing the skew would require a preemption in the trace flush path, but the flush path always happens on the system stack, so that's not possible. Alternatively a trace event could be emitted by the thread that stopped the world between the reading of totalMapped and mappedReady.

I think that's technically not possible today, but that actually is now possible on tip. Bare threads that don't synchronize with a stop-the-world can freely emit events now... hm... That might be a good enough reason to put these checks behind a debug mode, but the idea that there's a miscalculation somewhere still bothers me.

Also, this point is entirely moot if you don't have any kind of automated execution trace collection in production. (Do you?)

@felixge
Copy link
Contributor

felixge commented Nov 27, 2023

Also, this point is entirely moot if you don't have any kind of automated execution trace collection in production. (Do you?)

We actually have automated execution tracing widely enabled in prod, and it's possible that this is what started this problem (I'll try to see if we have enough historic data to proof this).

Also: The go1.20 services mentioned by @evanj don't have execution tracing enabled, so this might be a promising direction to investigate further, even if it's not clear yet how this is possible.

@mknyszek
Copy link
Contributor

Digging a little deeper on the tracer theory, I don't think it can possibly be a preemption at a bad time. This is the entire window in which it could get preempted to cause skew, and that window definitely has zero preemption points (those method calls are inlined and intrsinified, and on top of that all this is always called on the system stack).

However, it's definitely possible for sysmon to generate an event while the world is stopped, for example via injectglist. In the old tracer the event will get added to the global trace buffer. Given all the evidence, I'm fairly certain this is the root cause and the skew is transient.

I'll send a CL to put these checks behind a double-check mode.

@felixge
Copy link
Contributor

felixge commented Nov 27, 2023

It seems like some of the historic data for this has moved outside its retention window 😞.

When the issue was first noticed internally, we actually suspected execution tracing, but ruled it out due to finding a crash that occurred 1 day prior to enabling execution tracing by default.

But given your comments above, and the fact that we don't see this issue for our go1.20 services that don't have execution tracing enabled, I suspect that we incorrectly ruled out execution tracing. Maybe the service had manually opted into this feature, or timezones, or *waves hand*.

So yeah, 👍 to putting these checks behind a double-check mode (is this the same as a debug mode?). Thank you so much for the quick help with this!

PS: If it's possible to back-port this to go1.21 that'd be awesome. Let me know I can help with it.

@mknyszek
Copy link
Contributor

@gopherbot Please open backport issues for Go 1.20 and Go 1.21.

I think it should just be backported. This can cause crashes in correct programs without a workaround, and the fix is very safe.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #64409 (for 1.20), #64410 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/545277 mentions this issue: runtime: put ReadMemStats debug assertions behind a double-check mode

@mknyszek
Copy link
Contributor

@felixge Makes sense. Even if there is some other super-rare issue that we're not seeing, this is at least one real one and we should fix it. :)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/545556 mentions this issue: [release-branch.go1.20] runtime: put ReadMemStats debug assertions behind a double-check mode

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/545557 mentions this issue: [release-branch.go1.21] runtime: put ReadMemStats debug assertions behind a double-check mode

gopherbot pushed a commit that referenced this issue Jan 4, 2024
…hind a double-check mode

ReadMemStats has a few assertions it makes about the consistency of the
stats it's about to produce. Specifically, how those stats line up with
runtime-internal stats. These checks are generally useful, but crashing
just because some stats are wrong is a heavy price to pay.

For a long time this wasn't a problem, but very recently it became a
real problem. It turns out that there's real benign skew that can happen
wherein sysmon (which doesn't synchronize with a STW) generates a trace
event when tracing is enabled, and may mutate some stats while
ReadMemStats is running its checks.

Fix this by synchronizing with both sysmon and the tracer. This is a bit
heavy-handed, but better that than false positives.

Also, put the checks behind a debug mode. We want to reduce the risk of
backporting this change, and again, it's not great to crash just because
user-facing stats are off. Still, enable this debug mode during the
runtime tests so we don't lose quite as much coverage from disabling
these checks by default.

For #64401.
Fixes #64409.

Change-Id: I9adb3e5c7161d207648d07373a11da8a5f0fda9a
Reviewed-on: https://go-review.googlesource.com/c/go/+/545277
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
(cherry picked from commit b2efd1d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/545556
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
TryBot-Bypass: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
gopherbot pushed a commit that referenced this issue Jan 4, 2024
…hind a double-check mode

ReadMemStats has a few assertions it makes about the consistency of the
stats it's about to produce. Specifically, how those stats line up with
runtime-internal stats. These checks are generally useful, but crashing
just because some stats are wrong is a heavy price to pay.

For a long time this wasn't a problem, but very recently it became a
real problem. It turns out that there's real benign skew that can happen
wherein sysmon (which doesn't synchronize with a STW) generates a trace
event when tracing is enabled, and may mutate some stats while
ReadMemStats is running its checks.

Fix this by synchronizing with both sysmon and the tracer. This is a bit
heavy-handed, but better that than false positives.

Also, put the checks behind a debug mode. We want to reduce the risk of
backporting this change, and again, it's not great to crash just because
user-facing stats are off. Still, enable this debug mode during the
runtime tests so we don't lose quite as much coverage from disabling
these checks by default.

For #64401.
Fixes #64410.

Change-Id: I9adb3e5c7161d207648d07373a11da8a5f0fda9a
Reviewed-on: https://go-review.googlesource.com/c/go/+/545277
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Felix Geisendörfer <felix.geisendoerfer@datadoghq.com>
(cherry picked from commit b2efd1d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/545557
Auto-Submit: Matthew Dempsky <mdempsky@google.com>
TryBot-Bypass: Matthew Dempsky <mdempsky@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants