Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: memory corruption on OpenBSD/amd64 and NetBSD/amd64,386 when forking #34988

Open
jrick opened this issue Oct 18, 2019 · 42 comments
Open

syscall: memory corruption on OpenBSD/amd64 and NetBSD/amd64,386 when forking #34988

jrick opened this issue Oct 18, 2019 · 42 comments
Labels
NeedsInvestigation OS-OpenBSD
Milestone

Comments

@jrick
Copy link
Contributor

@jrick jrick commented Oct 18, 2019

What version of Go are you using (go version)?

$ go version
go version go1.13.2 openbsd/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/jrick/.cache/go-build"
GOENV="/home/jrick/.config/go/env"
GOEXE=""
GOFLAGS="-tags=netgo -ldflags=-extldflags=-static"
GOHOSTARCH="amd64"
GOHOSTOS="openbsd"
GONOPROXY=""
GONOSUMDB=""
GOOS="openbsd"
GOPATH="/home/jrick/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/jrick/src/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/jrick/src/go/pkg/tool/openbsd_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"

What did you do?

I observed these issues in one of my applications, and assumed it was a race or invalid unsafe.Pointer usage or some other fault of the application code. When the 1.13.2 release dropped yesterday I built it from source and observed a similar issue running the regression tests. The failed regression test does not look related to the memory corruption, but I can reproduce the problem by repeatedly running the test in a loop:

$ cd test # from go repo root
$ while :; do go run run.go -- fixedbugs/issue27829.go || break; done >go.panic 2>&1

It can take several minutes to observe the issue but here are some of the captured panics and fatal runtime errors:

https://gist.githubusercontent.com/jrick/f8b21ecbfbe516e1282b757d1bfe4165/raw/6cf0efb9ba47ba869f98817ce945971f2dff47d6/gistfile1.txt

https://gist.githubusercontent.com/jrick/9a54c085b918aa32910f4ece84e5aa21/raw/91ec29275c2eb1be49f62ad8a01a5317ad168c94/gistfile1.txt

https://gist.githubusercontent.com/jrick/8faf088593331c104cc0da0adb3f24da/raw/7c92e7e7d60d426b2156fd1bdff42e0717b708f1/gistfile1.txt

https://gist.githubusercontent.com/jrick/4645316444c12cd815fb71874f6bdfc4/raw/bffac2a448b07242a538b77a2823c9db34b6ef6f/gistfile1.txt

https://gist.githubusercontent.com/jrick/3843b180670811069319e4122d32507a/raw/0d1f897aa25d91307b04ae951f1b260f33246b61/gistfile1.txt

https://gist.githubusercontent.com/jrick/99b7171c5a49b4b069edf06884ad8e17/raw/740c7b9e8fa64d9ad149fd2669df94e89c466927/gistfile1.txt

Additionally, I observed go run hanging (no runtime failure due to deadlock) and it had to be killed with SIGABRT to get a trace: https://gist.githubusercontent.com/jrick/d4ae1e4355a7ac42f1910b7bb10a1297/raw/54e408c51a01444abda76dc32ac55c2dd217822b/gistfile1.txt

It may not matter which regression test is run as the errors also occur in run.go.

@jrick
Copy link
Contributor Author

@jrick jrick commented Oct 18, 2019

I missed that 1.13.3 was also released yesterday. Currently updating to that and will report whether this is still an issue.

@randall77
Copy link
Contributor

@randall77 randall77 commented Oct 18, 2019

This looks like cmd/go crashing while building the test, not the test itself.
The errors look heap realated. @mknyszek

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Oct 18, 2019

@jrick maybe you meant this in your original post, but I just want to be clear. Does this reproduce with Go 1.12.X or older versions of Go?

Since we have a reasonable reproducer, the next step to me would be to just bisect what went into Go 1.13, if we know it isn't reproducing in Go 1.12. I genuinely have no idea what this could be. I thought at first that it could be scavenging related but that's highly unlikely for a number of reasons. I won't rule it out yet, though.

@jrick
Copy link
Contributor Author

@jrick jrick commented Oct 18, 2019

I haven't tested 1.12.x but will follow up testing that next. Currently hammering this test with 1.13.3 and so far it has not failed, but my application built with 1.13.3 still fails with SIGBUS (could be unrelated).

@jrick
Copy link
Contributor Author

@jrick jrick commented Oct 18, 2019

@mknyszek it still hasn't failed on 1.13.3 (running close to an hour now) but quickly failed on 1.12.12.

https://gist.githubusercontent.com/jrick/bb5a493e6ebd88e1e846f1c5c09c9e9a/raw/e82b0136b0826581f6e591915d3a634112f323a1/gistfile1.txt

@dmitshur dmitshur added NeedsInvestigation OS-OpenBSD labels Oct 21, 2019
@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 6, 2019

This remains a problem in 1.13.5, so it's not addressed by the recent fixes to the go tool.

https://gist.githubusercontent.com/jrick/a2499b2ae10b4c63359174e26c0fd936/raw/b233f14a518ca828c4416d803f81b1e8ca34d073/gistfile1.txt

@jrick
Copy link
Contributor Author

@jrick jrick commented May 20, 2020

This may be fork/exec related. This program exhibits similar crashes on OpenBSD 6.7 and Go 1.14.3.

package main

import (
        "os/exec"
)

func main() {
        sem := make(chan struct{}, 100)
        for {
                sem <- struct{}{}
                go func() {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                        <-sem
                }()
        }
}

crash trace: https://gist.github.com/jrick/8d6ef72796a772668b891310a18dd805

Synchronizing the os/exec call with an additional mutex appears to remove the crash.

@ianlancetaylor ianlancetaylor changed the title Memory corruption on OpenBSD/amd64 syscall: memory corruption on OpenBSD/amd64 when forking May 20, 2020
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented May 20, 2020

Thanks for the stack trace. That looks very much like a forked child process is changing the memory seen by the parent process. Which should of course be impossible. Specifically it seems that sched.lock.key is being set to zero while the lock is held during goschedImpl.

@jrick
Copy link
Contributor Author

@jrick jrick commented May 22, 2020

I'm seeing another strange thing in addition to that crash. Sometimes the program will run forever, spinning cpu, but appears to be deadlocked because none of the pids of those true processes are ever changing. Here's the trace after sending sigquit: https://gist.github.com/jrick/74aaa63624961145b7bc7b9518da75e1

@jrick
Copy link
Contributor Author

@jrick jrick commented Sep 14, 2020

@bcmills
Copy link
Member

@bcmills bcmills commented May 14, 2021

https://build.golang.org/log/3f45171bc52a0a86435abb9f795c0e8a45c4a0b0 looks similar:

haserrors/haserrors.go:3:18: undeclared name: undefined
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x4d578d58 pc=0x804a257]

runtime stack:
runtime/internal/atomic.Xadd64(0x83de928, 0x20360)
	/tmp/workdir/go/src/runtime/internal/atomic/atomic_386.s:145 +0x27

…

goroutine 4371 [runnable]:
syscall.syscall(0x80b4b40, 0x12, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/sys_openbsd3.go:22 +0x20
syscall.Close(0x12)
	/tmp/workdir/go/src/syscall/zsyscall_openbsd_386.go:513 +0x39
syscall.forkExec({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cbc0)
	/tmp/workdir/go/src/syscall/exec_unix.go:227 +0x4cc
syscall.StartProcess(...)
	/tmp/workdir/go/src/syscall/exec_unix.go:264
os.startProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec_posix.go:55 +0x256
os.StartProcess({0xa630a408, 0x16}, {0xa6158a80, 0xe, 0xe}, 0x6bc0cc84)
	/tmp/workdir/go/src/os/exec.go:106 +0x57
os/exec.(*Cmd).Start(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:422 +0x588
os/exec.(*Cmd).Run(0xa3537b80)
	/tmp/workdir/go/src/os/exec/exec.go:338 +0x1b
golang.org/x/tools/go/internal/cgo.Run(0xa26a8b40, {0xa2764090, 0x17}, {0xa60778e0, 0x20}, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:172 +0xc74
golang.org/x/tools/go/internal/cgo.ProcessFiles(0xa26a8b40, 0x837eecc0, 0x0, 0x0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/internal/cgo/cgo.go:85 +0x1a1
golang.org/x/tools/go/loader.(*Config).parsePackageFiles(0x9d428420, 0xa26a8b40, 0x67)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:758 +0x232
golang.org/x/tools/go/loader.(*importer).load(0x837ed770, 0xa26a8b40)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:976 +0x68
golang.org/x/tools/go/loader.(*importer).startLoad.func1(0x837ed770, 0xa26a8b40, 0xa2afe0e0)
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:962 +0x23
created by golang.org/x/tools/go/loader.(*importer).startLoad
	/tmp/workdir/gopath/src/golang.org/x/tools/go/loader/loader.go:961 +0x174

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 1, 2021

https://storage.googleapis.com/go-build-log/abee19ae/openbsd-amd64-68_0f13ec3d.log (a TryBot) looks like it could plausibly be from a fork syscall.

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 1, 2021

I'm not sure when this changed but since returning to this issue I haven't been able to reproduce with my minimal test case again on the same hardware with OpenBSD 7.0-current and Go 1.17.3.

I suspect it's due to some OpenBSD fix if the 6.8 builders are still hitting this.

(also 6.8 is no longer a supported OpenBSD version; i don't think it makes much sense to continue testing with it)

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 1, 2021

spoke too soon:

package main

import "os/exec"

func main() {
	loop := func() {
		for {
			err := exec.Command("/usr/bin/true").Run()
			if err != nil {
				panic(err)
			}
		}
	}
	for i := 0; i < 100; i++ {
		go loop()
	}
	select {}
}

https://gist.githubusercontent.com/jrick/a071767cde2d2d71b210135cf8282b04/raw/6fcd814e5a93a6a1d204c2d00b0a1f4195664d61/gistfile1.txt

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 2, 2021

and it took far longer than 1.17.3 but a very similar crash (in scanstack) still occurs with

$ gotip version
go version devel go1.18-931d80ec1 Tue Nov 30 18:09:02 2021 +0000 openbsd/amd64

https://gist.githubusercontent.com/jrick/a13403d1a934f2cc5fedf7c2e2d50546/raw/7389fee0a5a35f40122a847206b6dbd7304b0fa0/gistfile1.txt

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 2, 2021

I can also reproduce crashes on netbsd-386 and netbsd-amd64 with #34988 (comment) on AMD, of the form:

buildlet-netbsd-386-9-0-n2d-rncb33943# ./loop 
fatal error: runtime·unlock: lock count   
<hang>

as well as #49453

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 3, 2021

Some observations I've made (from netbsd-amd64):

The crashes still seem to occur with GOMAXPROCS=1, however Go still has some background threads in this case. Disabling sysmon and GC makes this program truly single-threaded:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index a238ea77f3..ee18169920 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -170,10 +170,10 @@ func main() {
                // For runtime_syscall_doAllThreadsSyscall, we
                // register sysmon is not ready for the world to be
                // stopped.
-               atomic.Store(&sched.sysmonStarting, 1)
-               systemstack(func() {
-                       newm(sysmon, nil, -1)
-               })
+               //atomic.Store(&sched.sysmonStarting, 1)
+               //systemstack(func() {
+               //      newm(sysmon, nil, -1)
+               //})
        }
 
        // Lock the main goroutine onto this, the main OS thread,
@@ -211,7 +211,7 @@ func main() {
                }
        }()
 
-       gcenable()
+       //gcenable()
 
        main_init_done = make(chan bool)
        if iscgo {

Once the program is truly single-threaded, the crashes disappear. Setting GOMAXPROCS=2 with this patch brings the crashes back.

Here is a slightly simplified reproducer:

package main

import (
        "os/exec"
        "runtime"
)

func main() {
        go func() {
                for {
                        err := exec.Command("/usr/bin/true").Run()
                        if err != nil {
                                panic(err)
                        }
                }
        }()

        for {
                runtime.Gosched()
        }
}

This version has only a single forker, but crashes about as quickly. The Gosched is required. Neither an empty loop or a loop checking a package global atomic is sufficient to crash. (N.B. the original reproducer above was also occasionally effectively doing Gosched to context switch between the 100 forker goroutines).

(cc @aclements @mknyszek)

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 3, 2021

More observations:

  • Crashes still occur if the child exits almost immediately after fork, here.
  • Crashes do not occur if RawSyscall(FORK) is replaced with simply returning an error.

I've simplified that repro even further:

loop.go:

package main

import (
        //"runtime"
        "syscall"
)

func fork() int32

func main() {
        go func() {
                for {
                        pid := fork()
                        syscall.Syscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                        //syscall.RawSyscall6(syscall.SYS_WAIT4, uintptr(pid), 0, 0, 0, 0, 0)
                }
        }()

        for {
                syscall.Syscall(syscall.SYS_GETPID, 0, 0, 0)
                //runtime.Gosched()
        }
}

fork_netbsd_amd64.s:

#include "textflag.h"

#define SYS_EXIT        1
#define SYS_FORK        2

// func fork() int32
TEXT ·fork(SB),NOSPLIT,$0-4
        MOVQ    $SYS_FORK, AX
        SYSCALL

        CMPQ    AX, $0
        JNE     parent

        // Child.
        MOVQ    $0, DI
        MOVQ    $SYS_EXIT, AX
        SYSCALL
        HLT

parent:
        MOVL    AX, ret+0(FP)
        RET

The key parts here:

  • We are now making a direct fork system call without any of the extra runtime behavior inside os/exec. Barring some (AFAICT undocumented) requirement on how to use fork(), there is simply no way this assembly function should be able to cause corruption in the parent. So I think this has to an OS bug.
  • We don't need runtime.Gosched() anymore, however switching the GETPID back to Gosched does trigger crashes faster.
  • Note that syscall.Syscall goes through runtime.entersyscall / runtime.exitsyscall, so there is some level of runtime interaction still, though I've verified we don't go through the slow path into the full scheduler like Gosched does.
  • Both goroutines must use syscall.Syscall. Switching either side to syscall.RawSyscall (which avoids runtime interaction) makes the crashes go away.
  • By best guess of the interesting pattern here is that both threads are fiddling around with TLS variables (the g).

The crashes I get with this look like (source):

entersyscall inconsistent 0xc00003a778 [0xc00003a000,0xc00003a800]                                                                           
fatal error: entersyscall 

This is complaining that the assertion 0xc00003a000 < 0xc00003a778 < 0xc00003a800 fails (it does not).

The one case I've caught in GDB looks like (stopped just inside the failing branch):

   0x0000000000435c77 <+55>:      callq  0x435be0 <runtime.save>
   0x0000000000435c7c <+60>:      mov    0x60(%rsp),%rcx
   0x0000000000435c81 <+65>:      mov    0x20(%rsp),%rax
   0x0000000000435c86 <+70>:      mov    %rcx,0x70(%rax)
   0x0000000000435c8a <+74>:      mov    0x58(%rsp),%rdx
   0x0000000000435c8f <+79>:      mov    %rdx,0x78(%rax)
   0x0000000000435c93 <+83>:      mov    $0x2,%ebx
   0x0000000000435c98 <+88>:      mov    $0x3,%ecx
   0x0000000000435c9d <+93>:      nopl   (%rax)   
   0x0000000000435ca0 <+96>:      callq  0x4306e0 <runtime.casgstatus>                                                                       
   0x0000000000435ca5 <+101>:     mov    0x20(%rsp),%rcx
   0x0000000000435caa <+106>:     mov    (%rcx),%rdx
   0x0000000000435cad <+109>:     mov    %rdx,0x10(%rsp)
   0x0000000000435cb2 <+114>:     mov    0x8(%rcx),%rsi
   0x0000000000435cb6 <+118>:     mov    %rsi,0x18(%rsp)
   0x0000000000435cbb <+123>:     mov    0x70(%rcx),%rdi
   0x0000000000435cbf <+127>:     nop
   0x0000000000435cc0 <+128>:     cmp    %rdx,%rdi
   0x0000000000435cc3 <+131>:     jb     0x435cca <runtime.reentersyscall+138>
   0x0000000000435cc5 <+133>:     cmp    %rsi,%rdi
   0x0000000000435cc8 <+136>:     jbe    0x435d2a <runtime.reentersyscall+234>
=> 0x0000000000435cca <+138>:     mov    $0x6,%eax   
(gdb) i r
rax            0x2                 2
rbx            0x2                 2
rcx            0xc000002820        824633731104
rdx            0xc00003a000        824633958400
rsi            0xc00003a800        824633960448
rdi            0x0                 0
rbp            0xc00003a748        0xc00003a748
rsp            0xc00003a700        0xc00003a700
r8             0x1                 1
r9             0x0                 0
r10            0x0                 0
r11            0x212               530
r12            0xc000028a00        824633887232
r13            0x0                 0
r14            0xc000002820        824633731104
r15            0x7f7fd135111a      140186947490074
rip            0x435cca            0x435cca <runtime.reentersyscall+138>
eflags         0x287               [ CF PF SF IF ]
cs             0x47                71
ss             0x3f                63
ds             0x23                35
es             0x23                35
fs             0x0                 0
gs             0x0                 0
fs_base        <unavailable>
gs_base        <unavailable>

From the assembly, _g_.stack.lo and _g_.stack.hi should be rdx and rsi, which look OK. _g_.syscallsp should be rdi, which is 0. This value was recently loaded from rcx + 0x70, which looks fine:

(gdb) x/xg $rcx + 0x70
0xc000002890:     0x000000c00003a778

Of course, I can't really tell if that memory location read as zero, or if the register was cleared after the load somehow.

@prattmic prattmic changed the title syscall: memory corruption on OpenBSD/amd64 when forking syscall: memory corruption on OpenBSD/amd64 and NetBSD/amd64,386 when forking Dec 3, 2021
@prattmic
Copy link
Member

@prattmic prattmic commented Dec 3, 2021

I've filed a bug with NetBSD to get some more help looking at this. Still awaiting the problem ID for my bug report.

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 3, 2021

Hopefully we're chasing the same bug as on OpenBSD/amd64. I've been running the single-forker reproducer from #34988 (comment) for several hours now and it has not crashed yet (though my own reproducers were taking just as long).

I haven't yet tested OpenBSD/386 at all.

@riastradh
Copy link

@riastradh riastradh commented Dec 4, 2021

@coypoop and I have been trying to investigate this too in NetBSD/amd64. We've been using the exec.Command(...).Run reproducer along with a similarly narrowed one that just does fork and wait4 in the parent:

package main

import "syscall"

//go:nosplit
func test() {
	pid, _, err1 := syscall.RawSyscall(syscall.SYS_FORK, 0, 0, 0)
	if err1 != 0 { // parent error
		panic(err1)
	}
	if pid == 0 { // child
		syscall.RawSyscall(syscall.SYS_EXIT, 0, 0, 0)
	}

	_, err := syscall.Wait4(int(pid), nil, 0, nil)
	if err != nil {
		panic(err)
	}
}

func loop() {
	for {
		test()
	}
}

func main() {
	for i := 0; i < 100; i++ {
		go loop()
	}
	select {}
}

With both reproducers, the crashes are very quick, under a minute, short enough we haven't bothered to measure.

We tried finding a C program that does the same thing -- create 100 threads, fork and wait4, verify some invariants on the stack and in thread-local storage with __thread variables, which go through fs -- to no avail.

We tried setting GOGC=off, in case it was some activity in the garbage collector that somehow triggered the bug, but the same crashes persisted.

We tried removing the nontemporal stores in memclr and memmove, just using REP STOS and REP MOVS, in case Intel and AMD have any different semantics/bugs about nontemporal stores and store fences (which are used in Go only in memclr and memmove as far as I can tell), but the same crashes persisted. (We made sure to use STOSQ/MOVSQ when aligned and STOSB/MOVSB only when unaligned, and to do forward/backward copies as appropriate for overlapping regions in memmove; the tests passed with our changes.)

Prompted by https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/1a923442efb8ded428689c89162631d739cf0548 we hypothesized that perhaps the fs segment is not always being set correctly on return to the Go userland program on AMD CPUs after syscalls and/or context switches. We tried changing the way the fsbase and fs segment selector are set in the NetBSD kernel (avoiding null segment selectors), but the crashes persisted. We tried instrumenting Go to check for fs corruption -- verifying it against gs (by querying the OS for _lwp_self() in entersyscall and entersyscallblock, storing it in g, and verifying it in exitsyscall) and against sp (by pushing leaq %fs:0 before every syscall instruction, and then popping and comparing after syscall) -- but none of these diagnostics bore any fruit; the same crashes seemed to persist.

We dug into entersyscall and found that _g_.syscallsp transitions from a sensible value (lying within the lo and hi bounds) to null across the call to casgstatus in entersyscall, and then back to a sensible value, which explains why the error message is nonsense. However, _g_ itself does not change, and we couldn't figure out why _g_.syscallsp (or whatever register it was cached in) temporarily became null in the check _g_.syscallsp < _g_.stack.lo || _g_.stack.hi < _g_.syscallsp even though it was previously sensible and then somehow became sensible again to print the error message.

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 4, 2021

@riastradh i can confirm this test panics frequently in the child process on openbsd. it leaves an increasing number of zombie processes around as well.

edit: misread the ps output, the lingering processes are sleeping, not zombies

@bokunodev

This comment has been minimized.

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 6, 2021

FWIW, the NetBSD issue I filed is at http://gnats.netbsd.org/56535.

@bokunodev the issue you are describing is not related to this bug. You perhaps accidentally replied to this bug, or need to file a new issue.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Dec 14, 2021

After a week of noodling with timing and random delay, managed to arrive at this Go reproducer, where the failing code is simpler (does not involve G corruption). This is not clean; it's less than 90 minutes old. The observed failure is that the forking parent goroutine does a store to p.b, another parent goroutine does an atomic store to p.c, and the child (in the failing case) may or may not have exited (haven't explored that yet). The forking parent then observes the old value in the location that it stored, spins for a millisecond, and observes the new value in that location.

package main

import (
	"fmt"
	"math/rand"
	_ "runtime"
	"sync/atomic"
	"syscall"
	"time"
)

func fork() int32

type thing struct {
	a uint64
	b uint64
	C uint32
}

var p *thing

const count = 2000000 // total number of fork-wait tests
const nDelays = 200   // number of delays to try
const dots = 10       // number of dots per delay (after delay itself)
const scale = 1000    // delay scale.  1=nS, 1000=µS, 1000000=mS, delay up to 200 of these.

var delays [nDelays]uint64

var childDelay = 1 // accessed from fork.s.  Constant loop-running delay in child before exit.

func main() {
	var done uint64

	// As yet, have not been able to reproduce failure in our own data structures, something is missing, but keep trying.
	p = &thing{a: 1, b: 1}
	println("Addresses of p.a and p.b are", &p.a, &p.b)

	for i := range delays {
		if i & 1 == 1 {
		delays[i] = uint64(i)
	      }
	}
	rand.Seed(time.Now().UnixNano())
	rand.Shuffle(len(delays), func(i, j int) { delays[i], delays[j] = delays[j], delays[i] })
	print("Scale=", scale, ": ")
	go func() {

		for i := 0; i < nDelays; i++ {
			delay := delays[i]
			for j := 0; j < dots; j++ {
				if j == 0 {
					print(delay)
				} else {
					print(".")
				}
				for k := 0; k < count/dots/nDelays; k++ {
					p.a = 101
					p.b = 102
					pid := fork()
					p.b = 2
					DelayN(scale * delay)
					pb := p.b
					if pb != 2 {
						DelayN(1000000)
						print(" pb,p.b == ",pb,", ",p.b, " ")
						panic("AHA!")
					}
					var wstatus syscall.WaitStatus
					wpid, err := syscall.Wait4(int(pid), &wstatus, 0, nil)
					if err != nil {
						fmt.Printf("Wait error pid=%v, wpid=%v, err=%v, wstatus=%v\n", pid, wpid, err, wstatus)
						panic("Darn it!")
					}
				}
			}
		}
		atomic.StoreUint64(&done, 1)
	}()

	for atomic.LoadUint64(&done) == 0 {
		// runtime.Gosched()
		DelayN(scale * 20)
		//sink = atomic.LoadUint32(&p.C)
		sink++
		atomic.StoreUint32(&p.C, sink)
		//atomic.CompareAndSwapUint32(&p.C, p.C, p.C+1)
	}
}

var neverTrue bool
var neverAccessed int
var sink uint32

//go:noinline
//go:nosplit
func DelayN(spin uint64) {
	sum := 0
	if !neverTrue {
		sum += neverAccessed
	}
	for i := uint64(0); i < spin; i++ {
		if sum&4096 != 0 {
			sum = (sum + sum) ^ sum
		}
		sum++
	}
	if neverTrue {
		neverAccessed = sum
	}
}

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Dec 14, 2021

And fork.s, a minor edit that perhaps matters:

#include "textflag.h"

#define SYS_EXIT        1
#define SYS_FORK        2

// func fork() int32
TEXT ·fork(SB),NOSPLIT,$0-4

        MOVQ    $SYS_FORK, AX
        SYSCALL

        CMPQ    AX, $0
        JNE     parent

        MOVQ    main·childDelay(SB), AX
loop:
        SUBQ    $1, AX
        TESTQ   AX, AX
        JGE     loop

        MOVQ    $0, DI
        MOVQ    $SYS_EXIT, AX
        SYSCALL
        HLT

parent:
        MOVL    AX, ret+0(FP)
        RET

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 14, 2021

I managed to reproduce this on NetBSD with a C program!

forkstress.c:

#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/wait.h>
#include <unistd.h>

void __attribute((noinline)) spin(uint64_t loops) {
  for (volatile uint64_t i = 0; i < loops; i++) {
  }
}

struct thing {
  uint64_t b;
  uint32_t c;  // Making this (plus 'sink' below) uint64_t may make repro take longer?
};

volatile struct thing* page;

volatile uint32_t sink;

int ready;

void* thread(void* arg) {
  __atomic_store_n(&ready, 1, __ATOMIC_SEQ_CST);

  while (1) {
    // Spin not strictly required, but it speeds up repro in my case.
    spin(40*1000);

    // Atomic not required, this works too:
    // page->c = sink;
    __atomic_store_n(&page->c, sink, __ATOMIC_SEQ_CST);
    sink++;
  }
}

int main(void) {
  page = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
  if (page == MAP_FAILED) {
    perror("mmap");
    return 1;
  }

  pthread_t thread_id;
  int ret = pthread_create(&thread_id, NULL, &thread, NULL);
  if (ret != 0) {
    perror("pthread_create");
    return 1;
  }

  // Wait for child thread to start.
  //
  // This is not required to repro, but eliminates racing fork+thread create as
  // a possibility.
  while (!__atomic_load_n(&ready, __ATOMIC_SEQ_CST)) {
  }

  int64_t i = 0;
  while (1) {
    i++;
    if (i % 10000 == 0) {
      printf("Loop %d...\n", i);
    }

    page->b = 102;
    // Does not work with libc fork(). libc fork() is significantly slower,
    // which may be the problem.
    uint64_t pid = syscall(SYS_fork);
    if (pid == 0) {
      /* Child */
      _exit(0);
    }

    /* Parent */
    /* spin(40*1000); may speed up repro. */
    page->b = 2;
    uint64_t pb = page->b;
    if (pb != 2) {
      printf("Corruption! pb, page->b = %lu, %lu\n", pb, page->b);
      _exit(1);
    }

    int status;
    ret = waitpid(pid, &status, 0);
    if (ret < 0) {
      perror("waitpid");
      return 1;
    }
    if (WEXITSTATUS(status) != 0) {
      printf("Bad child status %#x\n", status);
      return 1;
    }
  }
}

Build and run:

$ cc -pthread forkstress.c
$ ./a.out & ./a.out & ./a.out  & ./a.out
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 10000...
Loop 10000...
Loop 10000...
Corruption! pb, page->b = 102, 2
Loop 20000...
Loop 20000...
Corruption! pb, page->b = 102, 2

Notes:

  • Anecdotally, running several instances at the same time seems to speed up repro more than a linear speedup (system under more load?). That said, a single instance will eventually fail.
  • Sometimes I get several failures immediately, sometimes it takes >60s to get a failure.
  • The most immediately interesting part here is that I call the fork syscall directly rather than using fork() from libc. I think this is simply because libc fork() is significantly slower to return than syscall(SYS_fork), and we seem to have a small race window.
    • I've looked through _malloc_prefork/postfork and (I believe) all of the registered atfork callbacks (here, here, and here), and none of them seem important, as neither thread is interacting with pthread or malloc.

The summarized behavior we see is:

  1. page->b = 102
  2. fork()
  3. page->b = 2
  4. Read page->b, observe 102 instead of 2.
  5. When logging the corruption, we load page->b again, which typically observes 2 again.

All while another thread is spinning writing to page->c (unrelated word in the same page).

@jrick
Copy link
Contributor Author

@jrick jrick commented Dec 15, 2021

Running the netbsd C reproducer on openbsd with the kernel patch below to allow the syscalls from the program's text segment, I'm not able to hit the corruption.

diff 7cf9856068a1d5a2615a65d75cb474b94382bcd1 /usr/src
blob - 76f34e0b0302d93e7c646f179256ce1081ab26a0
file + sys/sys/syscall_mi.h
--- sys/sys/syscall_mi.h
+++ sys/sys/syscall_mi.h
@@ -84,9 +84,10 @@ mi_syscall(struct proc *p, register_t code, const stru
 		return (EPERM);
 
 	/* PC must be in un-writeable permitted text (sigtramp, libc, ld.so) */
-	if (!uvm_map_inentry(p, &p->p_pcinentry, PROC_PC(p),
-	    "[%s]%d/%d pc=%lx inside %lx-%lx: bogus syscall\n",
-	    uvm_map_inentry_pc, p->p_vmspace->vm_map.wserial))
+	if (code != SYS_exit && code != SYS_fork &&
+	    !uvm_map_inentry(p, &p->p_pcinentry, PROC_PC(p),
+	    "[%s]%d/%d pc=%lx inside %lx-%lx: bogus syscall %d\n",
+	     uvm_map_inentry_pc, p->p_vmspace->vm_map.wserial))
 		return (EPERM);
 
 	if (lock)

(maybe now that i've commented, the corruption will happen again?)

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 15, 2021

@jrick your comment reminds me that I don't need to use assembly to make a direct syscall in C, I can just use syscall(2). I've updated my repro to use syscall(SYS_fork) instead, which works just as well on NetBSD.

I also haven't had luck reproducing with this program on openbsd-386. FWIW, it runs much slower than netbsd-amd64. The latter gets to 10k loops in a few seconds, the former takes minutes to do that same. I'm not sure if this is a difference in OpenBSD's fork, or a 386 difference.

Edit: seems to be mostly 386. openbsd-amd64 is faster (though not quite as fast as netbsd).

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 15, 2021

Change https://golang.org/cl/372355 mentions this issue: doc/go1.18: document OpenBSD & NetBSD memory corruption

@gopherbot
Copy link

@gopherbot gopherbot commented Dec 17, 2021

Change https://golang.org/cl/370874 mentions this issue: runtime: debugging print for netbsd crash

@riastradh
Copy link

@riastradh riastradh commented Dec 17, 2021

We found a similar mutiprocessor COW bug in NetBSD (same as https://reviews.freebsd.org/D14347) but I'm pretty sure it's not the same bug as this one, and although the fix for the similar bug empirically seems to suppress the symptoms of this one, I am pretty sure it's not a correct fix.

The crux of the similar bug is:

struct { long x; char pad[512]; int y; } *f = mmap(...);

/* thread A */
for (;;)
        f->y = 1;  /* trigger cow */

/* thread B */
for (long x = 0;; x++) {
        lock();
        f->x = x; /* write to new page after cow tlb flush */
        unlock();
}

/* thread C */
for (;;) {
        lock();
        long x0 = f->x; /* read from old page before tlb flush */
        long x1 = f->x; /* read from new page after tlb flush */
        unlock();
        assert(x0 == x1); /* fails if thread B wrote after the tlb flush triggered by C */
}

/* main thread */
for (;;) {
        pid_t pid = fork();
        if (pid == 0) _exit(0);
        waitpid(pid, NULL, 0);
}

The bug is that the new page may be exposed writably to thread B while thread C still has TLB entries for the old page, since multiprocessor TLB invalidation isn't simultaneous on all CPUs at once and TLB updates aren't ordered with respect to atomic operations.

The fix is either to expose the new page nonwritably to all CPUs first, and then expose it writably to all CPUs (as FreeBSD does now); or (simpler but maybe slower) to prohibit all access to the old page on all CPUs first when doing copy-on-write, and then expose the new page writably to all CPUs -- either way, B can begin to write to it only after C has witnessed the new page. We haven't been able to reproduce either problem with the following NetBSD kernel patch:

diff --git a/sys/uvm/uvm_fault.c b/sys/uvm/uvm_fault.c
index a35ea288a96b..7e4b6b2132b5 100644
--- a/sys/uvm/uvm_fault.c
+++ b/sys/uvm/uvm_fault.c
@@ -641,6 +641,7 @@ uvmfault_promote(struct uvm_faultinfo *ufi,
 
 	/* copy page [pg now dirty] */
 	if (opg) {
+		pmap_page_protect(opg, VM_PROT_NONE);
 		uvm_pagecopy(opg, pg);
 	}
 	KASSERT(uvm_pagegetdirty(pg) == UVM_PAGE_STATUS_DIRTY);

However, I'm pretty sure the golang issue can't be the same issue and I suspect that this change only papers over the symptoms. The difference is that with the golang bug, the same thread writes to page->b and then reads back two different values, whereas in the other bug, one thread writes to the location (B) and another thread reads it twice (C). So races between tlb flush on the writer, and tlb flush on the reader, don't make sense for the golang bug because the write and read are issued sequentially on the same CPU. Unless the golang thread migrated to another CPU between the two reads, I don't think the intermediate PTE state effected by this NetBSD patch should be necessary.

We've tried verifying which CPU it's on across the two reads using _lwp_ctl but haven't observed any migration in the event of the crash.

We've also tried measuring the time spent in executing the lines

    page->b = 2;
    uint64_t pb = page->b;

by surrounding them in lfence; rdtsc, subtracting, and reporting the difference when pb != 2. We also added char pad[1234]; between x and y to rule out any side effects of cache line sharing.

New code: forkstress.c.txt

On a couple different CPUs (one advertised as 3.4 GHz, one as 3.6 GHz, to be taken with a grain of salt owing to dynamic frequency scaling, of course), with the results scaled by the TSC frequency, we got about 10ns, 130ns, 40ns, and 50ns. These latencies are enough to cover a couple dozen or so CPU instructions, a few L1 or L2 cache fetches, and maybe a load (e.g., a page table entry) from main memory (but in 10ns -- which we observed at -O0! -- even that seems unlikely). Certainly even 130ns is nowhere near enough time for the OS's page fault handler to run and do the copy-on-write; it looks like the COW fault must happen in the other thread.

So exactly what is happening when executing

400cbb:       41 89 d3                mov    %edx,%r11d
400cbe:       41 89 c1                mov    %eax,%r9d
400cc1:       4c 8b 15 88 06 20 00    mov    0x200688(%rip),%r10        # 601350 <page>
400cc8:       49 c7 02 02 00 00 00    movq   $0x2,(%r10)
400ccf:       49 8b 32                mov    (%r10),%rsi

that leads to %rsi holding 102 instead of 2 is unclear, but there are very tight limits to what the CPU could possibly be doing.

I'm struggling to come up with hypotheses that don't involve some microarchitectural bug like a buffered store getting reordered around invlpg. (To rule that hypothesis out, we tried inserting mfence; lfence around pmap_tlb_invalidate, which is the function that issues tlbflush or invlpg instructions on each CPU in the TLB IPI handler -- not that it should be necessary architecturally because invlpg is supposed to be a serializing instruction anyway. That hypothesis is also ruled out by the latency: there is no way the TLB IPI handler could run in 10ns.)

@riastradh
Copy link

@riastradh riastradh commented Dec 17, 2021

We also tried pinning the threads to logical CPUs so they never run as multiple hyperthreads on the same core, or so that they always run as multiple hyperthreads on the same core, but the only difference this might have made was how long it took the problem to trigger.

(Of course, when pinned to a single logical CPU and time-shared there was no problem!)

@cheponis
Copy link

@cheponis cheponis commented Dec 18, 2021

I ran 4 copies of forkstress in the background yesterday and overnight , and this morning it's still chugging along

NetBSD arm64 9.99.92 NetBSD 9.99.92 aarch64 evbarm Raspberry Pi 4 / 8GB w/ 0.5T SSD

A few lines of output this morning, below. If we normalize by dividing those number by 10,000, the biggest difference is:
4399 - 4326 = 73 I'm not sure why there is this 'spread' in execution among the 4 processes. As well, this was executing while I was doing mmap() into a 2**32 array of short ints (cycling through) at the same time.

It would appear there is no problem with aarch64 NetBSD 9.99.92 (-current as of a few days ago) -Mike

Loop 43690000...
Loop 43260000...
Loop 43970000...
Loop 43790000...
Loop 43700000...
Loop 43270000...
Loop 43800000...
Loop 43980000...
Loop 43710000...
Loop 43810000...
Loop 43280000...
Loop 43990000...
Loop 43290000...
Loop 43820000...
Loop 43720000...

p.s. 73 / 4399 = 1.7 % It's unclear whether this 'non-uniformity' of allocating processor time among the RPi's 4 processors is a 'bug' or not.

@prattmic
Copy link
Member

@prattmic prattmic commented Dec 20, 2021

We've tried verifying which CPU it's on across the two reads using _lwp_ctl but haven't observed any migration in the event of the crash.

FWIW, I did this as well when the repro was still in Go, and I never saw a CPU migration there either.

bob-beck pushed a commit to openbsd/src that referenced this issue Feb 1, 2022
can't be written to while any thread can see the original version
of the page via a not-yet-flushed stale TLB entry: pmaps can indicate
they do this correctly by defining __HAVE_PMAP_MPSAFE_ENTER_COW;
uvm will force the initial CoW fault to be read-only otherwise.

Set that on amd64 and fix the problem case in pmap_enter() by putting
a read-only mapping in place, shooting the TLB entry, then fixing
it to the final read-write entry so this thread can continue without
re-faulting.

reported by jsing@ from golang/go#34988
assisted by discussion in https://reviews.freebsd.org/D14347
tweaks from jsing@ and kettenis@

ok jsing@ mpi@ kettenis@
@bcmills
Copy link
Member

@bcmills bcmills commented Jun 28, 2022

Two more occurrences on the openbsd-amd64-68 builder reported in #53176.

@bcmills
Copy link
Member

@bcmills bcmills commented Jun 28, 2022

greplogs -l -e 'fatal error: (?:.*\n\s*)*syscall\.forkExec' --since=2022-02-01 --omit=plan9
2022-05-25T16:16:01-4e4db1e/openbsd-amd64-68
2022-05-10T15:00:03-020fc47/solaris-amd64-oraclerel
2022-04-22T01:41:25-5d7ca8a-be1d738/openbsd-amd64-68
2022-04-15T16:25:52-082cfab/netbsd-386-9_0
2022-03-03T02:39:01-eeb9f09/aix-ppc64

...I wonder if the failures on aix and solaris are related in some way? I would be curious to see whether the C reproducer in #34988 (comment) also reproduces a failure on those platforms.

(CC @golang/runtime @golang/solaris @golang/aix @golang/netbsd @golang/openbsd)

@pmur
Copy link
Contributor

@pmur pmur commented Jun 28, 2022

I tried the linked C program on aix with a couple modifications. I had to modify it to use fork, I am not sure if aix can fork without going through libc, so the test might not really indicate anything in this case. It did not reproduce anything after running for 5 or so minutes.

@bcmills
Copy link
Member

@bcmills bcmills commented Jul 1, 2022

@bcmills
Copy link
Member

@bcmills bcmills commented Jul 1, 2022

Curiously, I'm not aware of any of these failures on the openbsd-amd64-70 TryBots. Is it possible that this was fixed in OpenBSD 7.0? (If so, maybe that could give us a clue about the other platforms as well. 🤔)

@riastradh
Copy link

@riastradh riastradh commented Jul 1, 2022

OpenBSD applied the following change:

openbsd/src@43687ba

Based on a cursory skim (without having thought much about the details of the change), I suspect that this change is designed to fix another copy-on-write bug just like https://reviews.freebsd.org/D14347 and like I described in #34988 (comment).

However, I have no theory for how the FreeBSD fix or the NetBSD fix could affect the problem we detected here, because that copy-on-write bug -- and the fix in FreeBSD and NetBSD -- makes sense only if the TLB IPI handler runs between the store and load of the memory location at issue, whereas our measurements on NetBSD indicate that that's physically implausible for the issue in this thread.

In NetBSD, we have not yet committed a fix for that COW bug because our draft fixes (like the one in #34988 (comment)) seem to have the side effect of suppressing the issue in this thread. I'm still hoping to hear from AMD with an idea about what could be going wrong in the 10ns window we observed but they haven't gotten back to me yet. However, if we don't hear anything before NetBSD 10.0 is ready we'll probably just apply the unrelated-COW fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation OS-OpenBSD
Projects
None yet
Development

No branches or pull requests