Skip to content

runtime: goroutine in C code hangs with async preemption under macOS Big Sur #45558

@erikgrinaker

Description

@erikgrinaker

What version of Go are you using (go version)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes, also with master at 49e933f.

What operating system and processor architecture are you using (go env)?

macOS 11.2.3 (Big Sur) amd64

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/erik/Library/Caches/go-build"
GOENV="/Users/erik/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/erik/Projects/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/erik/Projects/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.16.3/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g7/wr6l983n6zg_0wt3kl3jszlr0000gp/T/go-build1334461169=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

We haven't been able to come up with a reduced test case that reproduces this yet, but we're working on it.

When running CockroachDB on macOS Big Sur with async preemption enabled and under some load, we occasionally see goroutines get stuck executing Cgo functions (specifically calloc). When this happens, the process pegs the CPU at 100% (one core), and the Cgo call never returns. It appears to be somewhat correlated with resource contention. We see this with Go 1.16.3, 1.15.11, and 1.14.15.

This does not happen with GODEBUG=asyncpreemptoff=1, nor does it happen with macOS Catalina (or Linux), nor if we disable Cgo calloc and use the Go memory allocator instead.

It can be reproduced practically every time by running a five-node cluster and generating some load as follows (see also macOS build instructions):

$ go get github.com/cockroachdb/cockroach
$ cd $GOPATH/src/github.com/cockroachdb/cockroach
$ make buildshort bin/roachprod
$ ./bin/roachprod create local -n 5
$ ./bin/roachprod start local
$ ./bin/roachprod sql local:1
> ALTER DATABASE defaultdb CONFIGURE ZONE USING num_replicas = 5, range_min_bytes = 1e6, range_max_bytes=10e6;
> CREATE TABLE data AS SELECT id, REPEAT('x', 1024) AS value FROM generate_series(1, 1e6) AS id;

(to tear down the cluster, run ./bin/roachprod destroy local)

Within a few minutes, one of the processes should have a goroutine stuck on calloc. This may or may not affect the running query, depending on which goroutine blocks. Blocked goroutines can be found with e.g.:

$ for URL in $(./bin/roachprod adminurl local); do echo $URL; curl -sSf "${URL}debug/pprof/goroutine?debug=2" | grep -B10 -A20 '_Cfunc_calloc('; done
http://127.0.0.1:26258/
[...]
goroutine 250 [syscall, 67 minutes]:
github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc(0x79, 0x1, 0x0)
	_cgo_gotypes.go:42 +0x49
github.com/cockroachdb/pebble/internal/manual.New(0x79, 0x73, 0x1056f1, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/manual/manual.go:40 +0x3d
github.com/cockroachdb/pebble/internal/cache.newValue(0x59, 0x2)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value_normal.go:38 +0x38
github.com/cockroachdb/pebble/internal/cache.(*Cache).Alloc(...)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:696
github.com/cockroachdb/pebble/sstable.(*Reader).readBlock(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0, 0x3fd9384950000000, 0x6, 0x7)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1910 +0x165
github.com/cockroachdb/pebble/sstable.(*Reader).readMetaindex(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0)
	/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2012 +0x7d

The node URL that's output has a pprof endpoint at /debug/pprof. A CPU profile captured during a hung goroutine, while the process is pegged at 100% CPU, shows much of the time spent in the runtime:

flamegraph

The relevant Cgo code in the Pebble storage engine is fairly simple, consisting of calloc and free calls:

https://github.com/cockroachdb/pebble/blob/3d4c32f510a80f21e787caabf360edffe1431677/internal/manual/manual.go

What did you expect to see?

Cgo calls returning as normal.

What did you see instead?

Cgo calls never returning, blocking the goroutine forever.

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.OS-Darwin

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions