Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: goroutine in C code hangs with async preemption under macOS Big Sur #45558

Closed
erikgrinaker opened this issue Apr 14, 2021 · 14 comments
Closed

Comments

@erikgrinaker
Copy link

@erikgrinaker erikgrinaker commented Apr 14, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes, also with master at 49e933f.

What operating system and processor architecture are you using (go env)?

macOS 11.2.3 (Big Sur) amd64

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/erik/Library/Caches/go-build"
GOENV="/Users/erik/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/erik/Projects/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/erik/Projects/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.16.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g7/wr6l983n6zg_0wt3kl3jszlr0000gp/T/go-build1334461169=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We haven't been able to come up with a reduced test case that reproduces this yet, but we're working on it.

When running CockroachDB on macOS Big Sur with async preemption enabled and under some load, we occasionally see goroutines get stuck executing Cgo functions (specifically calloc). When this happens, the process pegs the CPU at 100% (one core), and the Cgo call never returns. It appears to be somewhat correlated with resource contention. We see this with Go 1.16.3, 1.15.11, and 1.14.15.

This does not happen with GODEBUG=asyncpreemptoff=1, nor does it happen with macOS Catalina (or Linux), nor if we disable Cgo calloc and use the Go memory allocator instead.

It can be reproduced practically every time by running a five-node cluster and generating some load as follows (see also macOS build instructions):

$ go get github.com/cockroachdb/cockroach
$ cd $GOPATH/src/github.com/cockroachdb/cockroach
$ make buildshort bin/roachprod
$ ./bin/roachprod create local -n 5
$ ./bin/roachprod start local
$ ./bin/roachprod sql local:1
> ALTER DATABASE defaultdb CONFIGURE ZONE USING num_replicas = 5, range_min_bytes = 1e6, range_max_bytes=10e6;
> CREATE TABLE data AS SELECT id, REPEAT('x', 1024) AS value FROM generate_series(1, 1e6) AS id;

(to tear down the cluster, run ./bin/roachprod destroy local)

Within a few minutes, one of the processes should have a goroutine stuck on calloc. This may or may not affect the running query, depending on which goroutine blocks. Blocked goroutines can be found with e.g.:

$ for URL in $(./bin/roachprod adminurl local); do echo $URL; curl -sSf "${URL}debug/pprof/goroutine?debug=2" | grep -B10 -A20 '_Cfunc_calloc('; done
http://127.0.0.1:26258/
[...]
goroutine 250 [syscall, 67 minutes]:
github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc(0x79, 0x1, 0x0)
	_cgo_gotypes.go:42 +0x49
github.com/cockroachdb/pebble/internal/manual.New(0x79, 0x73, 0x1056f1, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/manual/manual.go:40 +0x3d
github.com/cockroachdb/pebble/internal/cache.newValue(0x59, 0x2)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value_normal.go:38 +0x38
github.com/cockroachdb/pebble/internal/cache.(*Cache).Alloc(...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:696
github.com/cockroachdb/pebble/sstable.(*Reader).readBlock(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0, 0x3fd9384950000000, 0x6, 0x7)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1910 +0x165
github.com/cockroachdb/pebble/sstable.(*Reader).readMetaindex(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2012 +0x7d

The node URL that's output has a pprof endpoint at /debug/pprof. A CPU profile captured during a hung goroutine, while the process is pegged at 100% CPU, shows much of the time spent in the runtime:

flamegraph

The relevant Cgo code in the Pebble storage engine is fairly simple, consisting of calloc and free calls:

https://github.com/cockroachdb/pebble/blob/3d4c32f510a80f21e787caabf360edffe1431677/internal/manual/manual.go

What did you expect to see?

Cgo calls returning as normal.

What did you see instead?

Cgo calls never returning, blocking the goroutine forever.

@mknyszek mknyszek added this to the Backlog milestone Apr 14, 2021
@mknyszek mknyszek changed the title runtime: Cgo goroutine hangs with async preemption under macOS Big Sur runtime: goroutine in cgo hangs with async preemption under macOS Big Sur Apr 14, 2021
@mknyszek mknyszek changed the title runtime: goroutine in cgo hangs with async preemption under macOS Big Sur runtime: goroutine in C code hangs with async preemption under macOS Big Sur Apr 14, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Apr 14, 2021

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 14, 2021

CC @cherrymui

@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 15, 2021

This appears to have been caused by github.com/benesch/cgosymbolizer, removing that import solves the issue. We'll come up with a test case for this, although it may well be a bug in that project.

@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 15, 2021

Here's a simple reproduction that continuously allocates and frees 1 GB of memory in two goroutines, and outputs the goroutine stacks every 30 seconds. Both goroutines will eventually block, the first one within about 20 seconds.

package main

import (
	// #include <stdlib.h>
	"C"
	"fmt"
	"math"
	"runtime"
	"time"
	"unsafe"

	_ "github.com/benesch/cgosymbolizer"
)

const (
	concurrency = 2
	size        = 32768
	count       = int(1e9) / size // 1 GB per goroutine
)

func main() {
	for w := 0; w < concurrency; w++ {
		go func(w int) {
			for i := 0; ; i++ {
				fmt.Printf("goroutine:%d cycles:%d\n", w, i)

				allocs := make([][]byte, 0, count)
				for i := 0; i < count; i++ {
					allocs = append(allocs, alloc(size))
				}
				for _, a := range allocs {
					free(a)
				}
			}
		}(w)
	}

	buf := make([]byte, 65536)
	for {
		time.Sleep(30 * time.Second)
		n := runtime.Stack(buf, true)
		fmt.Printf("%s\n", buf[:n])
	}
}

func alloc(n int) []byte {
	ptr := C.calloc(C.size_t(n), 1)
	if ptr == nil {
		panic("out of memory")
	}
	return (*[math.MaxInt32]byte)(ptr)[:n:n]
}

func free(b []byte) {
	C.free(unsafe.Pointer(&b[0]))
}
@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 15, 2021

Here's an even smaller test case. Interestingly, this will have hung goroutines even with GODEBUG=asyncpreemptoff=1. Although the failure mode here may be different, as the stack traces don't show timing information for the syscalls.

package main

import (
	// #include <stdlib.h>
	"C"
	"fmt"
	"os"
	"runtime"
	"syscall"
	"time"

	_ "github.com/benesch/cgosymbolizer"
)

const concurrency = 2

func main() {
	envName := C.CString("GODEBUG")
	for w := 0; w < concurrency; w++ {
		go func(w int) {
			for i := 0; ; i++ {
				if i%1e6 == 0 {
					fmt.Printf("goroutine:%d calls:%d\n", w, i)
					runtime.Gosched()
				}
				C.getenv(envName)
			}
		}(w)
	}
	go func() {
		p, _ := os.FindProcess(os.Getpid())
		for {
			p.Signal(syscall.SIGUSR1)
			time.Sleep(time.Millisecond)
		}
	}()
	buf := make([]byte, 65536)
	for {
		time.Sleep(30 * time.Second)
		n := runtime.Stack(buf, true)
		fmt.Printf("%s\n", buf[:n])
	}
}
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 15, 2021

Now I'm curious whether github.com/ianlancetaylor/cgosymbolizer works any better.

@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 15, 2021

Now I'm curious whether github.com/ianlancetaylor/cgosymbolizer works any better.

Gave that a try too, same behavior.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 15, 2021

In the original issue report, the deadlock is happening because the program is trying to fetch a stack trace while calling the C function calloc. That can happen if a signal is received during the call to calloc. That will cause a deadlock if the cgosymbolizer code itsef calls calloc or malloc. The github.com/benesch/cgosymbolizer code uses macOS libunwind, and at least some versions of macOS libunwind call malloc. So it's not surprising that a deadlock can occur.

The github.com/ianlancetaylor/cgosymbolizer package, on the other hand, does not call malloc. So that must be failing in a different way.

In any case it's not clear to me that this is a bug in the Go toolchain, or that there is anything that the Go project can change to fix this.

You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function cgo_traceback, to simply return if arg->sig_ctx != 0.

@benesch
Copy link
Contributor

@benesch benesch commented Apr 16, 2021

Irrespective of this bug, @erikgrinaker, Cockroach should definitely switch to using Ian’s cgosymbolizer if it supports Mach-O now.

@benesch
Copy link
Contributor

@benesch benesch commented Apr 16, 2021

Oh, I see, you’re removing it entirely via cockroachdb/cockroach#63737. That seems reasonable.

@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 16, 2021

Thank you for the explanation @ianlancetaylor, that makes a lot of sense. We did see libunwind come up when looking at the hung threads in a debugger, which fits well with what you're saying.

You might be able to fix the problem by changing the file cgosymbolizer_darwin.c, the function cgo_traceback, to simply return if arg->sig_ctx != 0.

This does indeed solve the issue, as one would expect.

In any case it's not clear to me that this is a bug in the Go toolchain, or that there is anything that the Go project can change to fix this.

I agree, nothing here indicates a problem in Go, so I'll close the issue. Thanks for having a look!

Cockroach should definitely switch to using Ian’s cgosymbolizer if it supports Mach-O now.

We don't really have a big need for this now that we no longer use RocksDB, but otherwise that might have been worth exploring. Thanks for the suggestion!

@benesch
Copy link
Contributor

@benesch benesch commented Apr 16, 2021

Something is still unexplained here, though. If the issue was that benesch/cgosymbolizer can sometimes call malloc, then why didn’t switching to ianlancetaylor/cgosymbolizer, which supposedly does not call malloc, fix the issue?

@erikgrinaker
Copy link
Author

@erikgrinaker erikgrinaker commented Apr 18, 2021

Something is still unexplained here, though. If the issue was that benesch/cgosymbolizer can sometimes call malloc, then why didn’t switching to ianlancetaylor/cgosymbolizer, which supposedly does not call malloc, fix the issue?

That's true. I had a look, and it's blocked on _Unwind_Backtrace (from unwind.h):

https://github.com/ianlancetaylor/cgosymbolizer/blob/7cb7081f6b86962f7f0b974699a93a5ab1c6a1c1/traceback.c#L82

 0  0x00007fff2a5b18b0 in ???
    at ?:-1
 1  0x00007fff2a5b393c in ???
    at ?:-1
 2  0x00007fff2a5b5eba in ???
    at ?:-1
 3  0x00000000040ba861 in C.cgoTraceback
    at /Users/erik/go/pkg/mod/github.com/ianlancetaylor/cgosymbolizer@v0.0.0-20210303021718-7cb7081f6b86/traceback.c:82
 4  0x00000000040ab7b9 in C.x_cgo_callers
    at /usr/local/Cellar/go/1.16.3/libexec/src/runtime/cgo/gcc_traceback.c:22

On macOS, unwind.h is an Apple/LLVM-provided implementation that was introduced with Big Sur according to this comment in unwind.h:

#if defined(__APPLE__) && __has_include_next(<unwind.h>)
/* Darwin (from 11.x on) provide an unwind.h. If that's available,
 * use it. libunwind wraps some of its definitions in #ifdef _GNU_SOURCE,
 * so define that around the include.*/
 *

I think that ends up being the same implementation as libunwind.h (also provided via LLVM), which we think calls malloc, thus causing a deadlock. Either that, or I assume the unwind implementation is not async-signal-safe in some other way. Here, too, if (arg->sigContext != 0) return; fixes the issue.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Apr 18, 2021

It would be extremely unfortunate if the LLVM implementation of _Unwind_Backtrace calls malloc. But I don't suppose that there is much that we can do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants