-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
What version of Go are you using (go version
)?
$ go version go version go1.16.3 darwin/amd64
Does this issue reproduce with the latest release?
Yes, also with master
at 49e933f.
What operating system and processor architecture are you using (go env
)?
macOS 11.2.3 (Big Sur) amd64
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/erik/Library/Caches/go-build" GOENV="/Users/erik/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GOINSECURE="" GOMODCACHE="/Users/erik/Projects/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/erik/Projects/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/Cellar/go/1.16.3/libexec" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/Cellar/go/1.16.3/libexec/pkg/tool/darwin_amd64" GOVCS="" GOVERSION="go1.16.3" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/g7/wr6l983n6zg_0wt3kl3jszlr0000gp/T/go-build1334461169=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
We haven't been able to come up with a reduced test case that reproduces this yet, but we're working on it.
When running CockroachDB on macOS Big Sur with async preemption enabled and under some load, we occasionally see goroutines get stuck executing Cgo functions (specifically calloc
). When this happens, the process pegs the CPU at 100% (one core), and the Cgo call never returns. It appears to be somewhat correlated with resource contention. We see this with Go 1.16.3, 1.15.11, and 1.14.15.
This does not happen with GODEBUG=asyncpreemptoff=1
, nor does it happen with macOS Catalina (or Linux), nor if we disable Cgo calloc
and use the Go memory allocator instead.
It can be reproduced practically every time by running a five-node cluster and generating some load as follows (see also macOS build instructions):
$ go get github.com/cockroachdb/cockroach
$ cd $GOPATH/src/github.com/cockroachdb/cockroach
$ make buildshort bin/roachprod
$ ./bin/roachprod create local -n 5
$ ./bin/roachprod start local
$ ./bin/roachprod sql local:1
> ALTER DATABASE defaultdb CONFIGURE ZONE USING num_replicas = 5, range_min_bytes = 1e6, range_max_bytes=10e6;
> CREATE TABLE data AS SELECT id, REPEAT('x', 1024) AS value FROM generate_series(1, 1e6) AS id;
(to tear down the cluster, run ./bin/roachprod destroy local
)
Within a few minutes, one of the processes should have a goroutine stuck on calloc
. This may or may not affect the running query, depending on which goroutine blocks. Blocked goroutines can be found with e.g.:
$ for URL in $(./bin/roachprod adminurl local); do echo $URL; curl -sSf "${URL}debug/pprof/goroutine?debug=2" | grep -B10 -A20 '_Cfunc_calloc('; done
http://127.0.0.1:26258/
[...]
goroutine 250 [syscall, 67 minutes]:
github.com/cockroachdb/pebble/internal/manual._Cfunc_calloc(0x79, 0x1, 0x0)
_cgo_gotypes.go:42 +0x49
github.com/cockroachdb/pebble/internal/manual.New(0x79, 0x73, 0x1056f1, 0x0)
/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/manual/manual.go:40 +0x3d
github.com/cockroachdb/pebble/internal/cache.newValue(0x59, 0x2)
/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/value_normal.go:38 +0x38
github.com/cockroachdb/pebble/internal/cache.(*Cache).Alloc(...)
/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/internal/cache/clockpro.go:696
github.com/cockroachdb/pebble/sstable.(*Reader).readBlock(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0, 0x3fd9384950000000, 0x6, 0x7)
/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:1910 +0x165
github.com/cockroachdb/pebble/sstable.(*Reader).readMetaindex(0xc001e28a00, 0x1056f1, 0x54, 0x0, 0x0)
/Users/erik/Projects/go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/pebble/sstable/reader.go:2012 +0x7d
The node URL that's output has a pprof endpoint at /debug/pprof
. A CPU profile captured during a hung goroutine, while the process is pegged at 100% CPU, shows much of the time spent in the runtime:
The relevant Cgo code in the Pebble storage engine is fairly simple, consisting of calloc
and free
calls:
What did you expect to see?
Cgo calls returning as normal.
What did you see instead?
Cgo calls never returning, blocking the goroutine forever.