runtime: memory corruption on Linux 5.2+ #35777

aclements · 2019-11-22T15:24:36Z

We've had several reports of memory corruption on Linux 5.3.x (or later) kernels from people running tip since asynchronous preemption was committed. This is a super-bug to track these issues. I suspect they all have one root cause.

Typically these are "runtime error: invalid memory address or nil pointer dereference" or "runtime: unexpected return pc" or "segmentation violation" panics. They can also appear as self-detected data corruption.

If you encounter a crash that could be random memory corruption, are running Linux 5.3.x or later, and are running a recent tip Go (after commit 62e53b7), please file a new issue and add a comment here. If you can reproduce it, please try setting "GODEBUG=asyncpreemptoff=1" in your environment and seeing if you can still reproduce it.

Duplicate issues (I'll edit this comment to keep this up-to-date):

runtime: corrupt binary export data seen after signal preemption CL (#35326): Corruption in file version header observed by vet. Medium reproducible. Strong leads.

cmd/compile: panic during early copyelim crash (#35658): Invalid memory address in cmd/compile/internal/ssa.copyelim. Not reproducible. Nothing obvious in stack trace. Haven't dug into assembly.

runtime: SIGSEGV in mapassign_fast64 during cmd/vet (#35689): Invalid memory address in runtime.mapassign_fast64 in vet. Stack trace includes random pointers. Some assembly decoding work.

runtime: unexpected return pc for runtime.(*mheap).alloc (#35328): Unexpected return pc. Stack trace includes random pointers. Not reproducible.

cmd/dist: I/O error: read src/xxx.go: is a directory (#35776): Random misbehavior. Not reproducible.

runtime: "fatal error: mSpanList.insertBack" in mallocgc (#35771): Bad mspan next pointer (random and unaligned). Not reproducible.

cmd/compile: invalid memory address or nil pointer dereference in gc.convlit1 (#35621): Invalid memory address in cmd/compile/internal/gc.convlit1. Evidence of memory corruption, though no obvious random pointers. Not reproducible.

cmd/go: unexpected signal during runtime execution (#35783): Corruption in file version header observed by vet. Not reproducible.

runtime: unexpected return pc for runtime.systemstack_switch (#35592): Unexpected return pc. Stack trace includes random pointers. Not reproducible.

cmd/compile: random compile error running tests (#35760): Compiler data corruption. Not reproducible.

The text was updated successfully, but these errors were encountered:

mvdan · 2019-11-22T15:35:39Z

@aclements for your records, #35328 and #35776 might be related as well. Those two were on the same Linux 5.3.x machine of mine.

aclements · 2019-11-22T16:05:21Z

Thanks @mvdan. I've folded those into the list above.

zikaeroh · 2019-11-22T16:12:52Z

#35621 from me. One time, no repro.

myitcv · 2019-11-22T17:33:34Z

@aclements just saw #35783 for the record.

If you think we have enough "evidence" please say and I'll stop creating issues for now 😄

bradfitz · 2019-11-22T18:15:43Z

Have we roughly bisected which Linux versions are affected? Looking at the kernel changes in that region might yield a clue about where and whose the bug is.

5.3 = bad.
5.2 = ?

zikaeroh · 2019-11-22T18:32:34Z

In #35326 (comment), I used Arch's 4.19 LTS and could not reproduce the bexport corruption. However, the kernel configuration differs between 4.19 and 5.3, so that may be unscientific. (~~I'm letting my machine rebuild 5.3 without PREEMPT set to see if that's the problem, but I have doubts.~~ EDIT: It was not PREEMPT, so setting up a builder with a newer kernel would likely be good regardless.)

What set of kernels do the current Linux builders use? That might provide a lower bound, as I've never seen the issue there.

(I'd bring up #9505 to advocate for an Arch builder, but that issue is more about everything but the kernel version. I feel like there should be some builder which is at the latest Linux kernel, whatever that may be.)

bradfitz · 2019-11-22T18:38:13Z

The existing Go Linux builders use Container Optimized OS with a Linux kernel 4.19.72+.

aclements · 2019-11-22T20:27:01Z

Thanks @myitcv, I think we have enough reports. If you do happen to find another one that's reproducible, that would be very helpful, though.

dr2chase · 2019-11-25T15:11:26Z

To recap experiments last Friday (and I rechecked the test for the more mystifying of these Sunday afternoon), Cherry and I tried the following:

Double the size of the sigaltstack, just in case. Also sanity check the bounds within gdb, they were okay.

Modified the definition of fpstate to conform to what is defined in the linux header files.
Modified sigcontext to use the new Xstate:

fpstate *Xstate // *fpstate1

Wrote a method to allow us to store the ymm registers that were supplied (as registers) to the signal handler,

tried an experiment in the assembly language handler to trash the YMM registers (not the data structures) before return. We never saw any sign of the trash but this seemed to raise the rate of the failures (running "go vet all"). The trashing string stored was "This_is_a_test. "
tried printing the saved and current ymm registers in sigtrampgo.
The saved ones looked like memmove artifacts (source code while running vet all), and the current ones were always zero. The memmove artifacts stayed unchanged, a lot, between signals.
I rechecked the code that did this earlier today, just in case we got it wrong.
made a copy of the saved xmm and ymm registers on sigtrampgo entry, then checked the copy against the saved registers, to see if our code ever somehow modified them. That never fired.

I spent some time Saturday looking for "interesting" comments in the Linux git log, I have some to review. What I am wondering is if there was some attempt to optimize saving of the ymm registers and that got fouled up. One thing I wonder a little about was what they are doing for power management with AVX use, I saw some mention of that.
(I.e., what triggers AVX use, can they "save power" if they don't touch the registers, if they believe AVX is not being used? Suppose they rely on some hardware bit that isn't set under exactly the expected conditions?)

type Xstate struct {
   Fpstate Fpstate
   Hdr Header
   Ymmh Ymmh_state
}

type Fpstate struct {
   Cwd uint16
   Swd uint16
   Twd uint16
   Fop uint16
   Rip uint64
   Rdp uint64
   Mxcsr uint32
   Mxcsr_mask uint32
   St_space [32]uint32
   Xmm_space [64]uint32
   Reserved2 [12]uint32
   Reserved3 [12]uint32
}

type Header struct {
   Xfeatures uint64
   Reserved1 [2]uint64
   Reserved2 [5]uint64
}

type Ymmh_state struct {
   Space [64]uint32
}

TEXT runtime·getymm(SB),NOSPLIT,$0
    MOVQ    0(FP), AX
    c Y0,0(AX)
    VMOVDQU Y1,(1*32)(AX)
    VMOVDQU Y2,(2*32)(AX)
    VMOVDQU Y3,(3*32)(AX)
    VMOVDQU Y4,(4*32)(AX)
    VMOVDQU Y5,(5*32)(AX)
    VMOVDQU Y6,(6*32)(AX)
    VMOVDQU Y7,(7*32)(AX)
    VMOVDQU Y8,(8*32)(AX)
    VMOVDQU Y9,(9*32)(AX)
    VMOVDQU Y10,(10*32)(AX)
    VMOVDQU Y11,(11*32)(AX)
    VMOVDQU Y12,(12*32)(AX)
    VMOVDQU Y13,(13*32)(AX)
    VMOVDQU Y14,(14*32)(AX)
    VMOVDQU Y15,(15*32)(AX)
    RET

aclements · 2019-11-25T16:06:24Z

An update from over in #35326: I've bisected the issue to kernel commit torvalds/linux@d9c9ce3, which happened between v5.1 and v5.2. It also requires the kernel to be built with GCC 9 (GCC 8 does not reproduce the issue).

dr2chase · 2019-11-26T04:39:43Z

Not sure where Austin's reporting this or if he had time today, but:

he has a C program demonstrating the bug in Linux 5.3 (built with gcc 9) for purposes of filing a bug soonish;
there is a workaround on the Go implementation side (be sure the signal stack is mapped);
I managed to create a failing Linux 5.3 where the entire kernel is compiled with gcc 8, except for arch/x86/kernel/fpu/signal.c.

zikaeroh · 2019-11-26T04:42:09Z

All of the progress updates have been going on #35326. (Most recently, #35326 (comment).)

dvyukov · 2019-11-26T06:35:20Z

There is this commit that clams to be fixing something in the culprit commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b81ff1013eb8eef2934ca7e8cf53d553c1029e84
I don't know if it will help or not, but @aclements if you have test setup ready, may be worth cherry-picking and trying.

cherrymui · 2019-11-26T15:26:43Z

I think that commit is already included in 5.2 and 5.3 kernel, which still has the problem.

aclements · 2019-11-26T15:38:50Z

Thanks @dvyukov. I just re-confirmed that I can still reproduce it in the same way on 5.3, which includes that commit. I'll double check that I can still reproduce right at that commit, just in case it was somehow re-introduced later.

aclements · 2019-11-26T16:53:21Z

Reproduced at torvalds/linux@b81ff10, as well as v5.4, which was just released.

I've filed the upstream kernel bug here: https://bugzilla.kernel.org/show_bug.cgi?id=205663

ianlancetaylor · 2020-02-05T00:00:20Z

You can disable preemption by setting the environment variable GODEBUG=asyncpreemptoff=1.

But the key point here is that that doesn't avoid random corruption. The random corruption can occur with any program in any language. Using async preemption does make the random corruption more likely. But it can happen regardless.

Therefore, since the mlock call is the only way we know to avoid the corruption, there is no way to disable that mlock call. Other than upgrading or downgrading to a fixed kernel.

We put off moving to go1.14+ to give time things to settle related to the (largely patched) kernel bug that go1.14 tickles more due to the signals generated by the preemptive scheduler (see golang/go#35777). There is a small risk of unpatched kernels out there. Also, go1.15 comes out in roughly a month and we'll need to move to at least go1.14 by then to continue to get security updates (since go1.13.x will no longer be maintained). We've watched the ecosystem and waited for large infrastructure products to move to go1.14. Kubernetes and etcd, among others, have made the plunge. Now feels like a good time. Signed-off-by: Andrew Harding <andrew.harding@hpe.com>

gopherbot · 2020-07-20T19:17:31Z

Change https://golang.org/cl/243658 mentions this issue: runtime: let GODEBUG=mlock=0 disable mlock calls

gopherbot · 2020-07-22T04:39:53Z

Change https://golang.org/cl/244059 mentions this issue: runtime: don't mlock on Ubuntu 5.4 systems

For #35777 For #37436 Fixes #40184 Change-Id: I68561497d9258e994d1c6c48d4fb41ac6130ee3a Reviewed-on: https://go-review.googlesource.com/c/go/+/244059 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Austin Clements <austin@google.com>

gopherbot · 2020-07-31T20:21:48Z

Change https://golang.org/cl/246200 mentions this issue: runtime: revert signal stack mlocking

Go 1.14 included a (rather awful) workaround for a Linux kernel bug that corrupted vector registers on x86 CPUs during signal delivery (https://bugzilla.kernel.org/show_bug.cgi?id=205663). This bug was introduced in Linux 5.2 and fixed in 5.3.15, 5.4.2 and all 5.5 and later kernels. The fix was also back-ported by major distros. This workaround was necessary, but had unfortunate downsides, including causing Go programs to exceed the mlock ulimit in many configurations (#37436). We're reasonably confident that by the Go 1.16 release, the number of systems running affected kernels will be vanishingly small. Hence, this CL removes this workaround. This effectively reverts CLs 209597 (version parser), 209899 (mlock top of signal stack), 210299 (better failure message), 223121 (soft mlock failure handling), and 244059 (special-case patched Ubuntu kernels). The one thing we keep is the osArchInit function. It's empty everywhere now, but is a reasonable hook to have. Updates #35326, #35777 (the original register corruption bugs). Updates #40184 (request to revert in 1.15). Fixes #35979. Change-Id: Ie213270837095576f1f3ef46bf3de187dc486c50 Reviewed-on: https://go-review.googlesource.com/c/go/+/246200 Run-TryBot: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

aclements added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Nov 22, 2019

aclements added this to the Go1.14 milestone Nov 22, 2019

aclements self-assigned this Nov 22, 2019

aclements pinned this issue Nov 22, 2019

aclements mentioned this issue Nov 22, 2019

runtime: SIGSEGV in mapassign_fast64 during cmd/vet #35689

Closed

aclements mentioned this issue Nov 22, 2019

runtime: "fatal error: mSpanList.insertBack" in mallocgc #35771

Closed

zikaeroh mentioned this issue Nov 22, 2019

cmd/compile: invalid memory address or nil pointer dereference in gc.convlit1 #35621

Closed

OneOfOne mentioned this issue Nov 22, 2019

cmd/compile: unexpected return pc for runtime.deductSweepCredit #35780

Closed

myitcv mentioned this issue Nov 22, 2019

cmd/go: unexpected signal during runtime execution #35783

Closed

OneOfOne mentioned this issue Nov 22, 2019

runtime: goroutine stack exceeds 1000000000-byte limit #35784

Closed

bradfitz mentioned this issue Nov 22, 2019

x/build/env/linux: add bleeding edge kernel builder #35785

Open

aclements mentioned this issue Nov 22, 2019

runtime: unexpected return pc for runtime.systemstack_switch #35592

Closed

myitcv mentioned this issue Nov 25, 2019

cmd/compile: random compile error running tests #35760

Closed

fvoznika mentioned this issue Feb 5, 2020

Error creating mount with source ".../hostname": broken pipe: unknown. google/gvisor#1765

Closed

josharian mentioned this issue Feb 25, 2020

runtime: mlock of signal stack failed: 12 #37436

Closed

tianon mentioned this issue Feb 25, 2020

Go 1.14 is incompatible with the alpine images docker-library/golang#320

Closed

MagicalCow mentioned this issue Mar 5, 2020

Memory lock error on Debian 11 syncthing/syncthing#6395

Closed

chainhelen mentioned this issue Mar 14, 2020

*: Transfer 32-bit test into travis from cirrus go-delve/delve#1932

Merged

0987363 mentioned this issue Mar 17, 2020

编译出错 coolsnowwolf/lede#3817

Closed

1 task

saschagrunert mentioned this issue Mar 18, 2020

Update go to v1.13.8 kata-containers/runtime#2536

Merged

dani29 mentioned this issue Apr 18, 2020

Golang 1.14 panics on some Linux kernel versions, causing Kaniko to crash GoogleContainerTools/kaniko#1202

Open

juliantaylor mentioned this issue May 7, 2020

whitelist ubuntu 20.04 kernel for the mlock of signal stack failed issue #38915

Closed

azdagron mentioned this issue Jul 10, 2020

Bump to go1.14.4 spiffe/spire#1722

Merged

Jongy mentioned this issue Jul 12, 2020

Include "connection reset by peer" errors in the summary tsenart/vegeta#527

Closed

DanielShaulov mentioned this issue Jul 13, 2020

runtime: remove the mlock hack in Go 1.15 #40184

Closed

martisch mentioned this issue Aug 12, 2020

runtime: bad p & random crashes since 1.14 on linux 3.18.11 #40722

Open

dnephin mentioned this issue Aug 26, 2020

fatal error: unexpected signal during runtime execution hashicorp/consul#8558

Open

strekm mentioned this issue Sep 17, 2020

ory-hydra fatal error: mlock failed kyma-project/kyma#9395

Closed

jacobgkau mentioned this issue Sep 29, 2020

5.4.2 kernel update feasibility pop-os/pop#1260

Closed

szuecs mentioned this issue Dec 3, 2020

runtime: non CGO application segfaulted once #42977

Closed

golang locked and limited conversation to collaborators Jul 31, 2021

gopherbot added the FrozenDueToAge label Jul 31, 2021

rsc unassigned aclements Jun 23, 2022

dmitshur added the umbrella label Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: memory corruption on Linux 5.2+ #35777

runtime: memory corruption on Linux 5.2+ #35777

aclements commented Nov 22, 2019 •

edited

Loading

mvdan commented Nov 22, 2019 •

edited

Loading

aclements commented Nov 22, 2019

zikaeroh commented Nov 22, 2019

myitcv commented Nov 22, 2019

bradfitz commented Nov 22, 2019

zikaeroh commented Nov 22, 2019 •

edited

Loading

bradfitz commented Nov 22, 2019

aclements commented Nov 22, 2019

dr2chase commented Nov 25, 2019

aclements commented Nov 25, 2019

dr2chase commented Nov 26, 2019

zikaeroh commented Nov 26, 2019

dvyukov commented Nov 26, 2019

cherrymui commented Nov 26, 2019

aclements commented Nov 26, 2019

aclements commented Nov 26, 2019

ianlancetaylor commented Feb 5, 2020

gopherbot commented Jul 20, 2020

gopherbot commented Jul 22, 2020

gopherbot commented Jul 31, 2020

runtime: memory corruption on Linux 5.2+ #35777

runtime: memory corruption on Linux 5.2+ #35777

Comments

aclements commented Nov 22, 2019 • edited Loading

mvdan commented Nov 22, 2019 • edited Loading

aclements commented Nov 22, 2019

zikaeroh commented Nov 22, 2019

myitcv commented Nov 22, 2019

bradfitz commented Nov 22, 2019

zikaeroh commented Nov 22, 2019 • edited Loading

bradfitz commented Nov 22, 2019

aclements commented Nov 22, 2019

dr2chase commented Nov 25, 2019

aclements commented Nov 25, 2019

dr2chase commented Nov 26, 2019

zikaeroh commented Nov 26, 2019

dvyukov commented Nov 26, 2019

cherrymui commented Nov 26, 2019

aclements commented Nov 26, 2019

aclements commented Nov 26, 2019

ianlancetaylor commented Feb 5, 2020

gopherbot commented Jul 20, 2020

gopherbot commented Jul 22, 2020

gopherbot commented Jul 31, 2020

aclements commented Nov 22, 2019 •

edited

Loading

mvdan commented Nov 22, 2019 •

edited

Loading

zikaeroh commented Nov 22, 2019 •

edited

Loading