Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: StopCPUProfile occasionally stuck with 100% CPU and process hang #52912

Open
breezewish opened this issue May 15, 2022 · 3 comments
Open
Labels
NeedsInvestigation

Comments

@breezewish
Copy link

@breezewish breezewish commented May 15, 2022

What version of Go are you using (go version)?

go1.18.0

Does this issue reproduce with the latest release?

Not sure.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/Users/breezewish/Library/Caches/go-build"
GOENV="/Users/breezewish/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/breezewish/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/breezewish/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/Cellar/go/1.18.1/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/Cellar/go/1.18.1/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.18.1"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/6m/phy2_frd0vd1hb7g1py32qz40000gn/T/go-build2219610054=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version go1.18.1 darwin/arm64
GOROOT/bin/go tool compile -V: compile version go1.18.1
uname -v: Darwin Kernel Version 21.4.0: Fri Mar 18 00:46:32 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T6000
ProductName:	macOS
ProductVersion:	12.3.1
BuildVersion:	21E258
lldb --version: lldb-1316.0.9.41
Apple Swift version 5.6 (swiftlang-5.6.0.323.62 clang-1316.0.20.8)

What did you do?

In tidb-server there is a feature with a pattern that repeatedly run pprof CPU profiler for 1 second (StartCPUProfile -> wait 1 sec -> StopCPUProfile -> StartCPUProfile -> wait 1 sec -> ...).

Recently in my MacOS M1 with this feature enabled, I observed that the tidb-server process was hang with 100% (1 core) CPU and it cannot process any requests.

According to the CPU profiling data provided by Instruments, looks like StopCPUProfile was looping infinitely at

for !atomic.Cas(&prof.signalLock, 0, 1) {
:

image

I have no idea how this issue can be reliably reproduced. Hope the stack provided by the Instruments helps.

What did you expect to see?

StopCPUProfile should not cause process hang.

What did you see instead?

Process was hanging.

@seankhliao seankhliao changed the title pprof.StopCPUProfile occasionally stuck with 100% CPU and process hang runtime/pprof: StopCPUProfile occasionally stuck with 100% CPU and process hang May 16, 2022
@seankhliao seankhliao added the NeedsInvestigation label May 16, 2022
@seankhliao
Copy link
Contributor

@seankhliao seankhliao commented May 16, 2022

try killing it with SIGABRT and see what it's doing?

@breezewish
Copy link
Author

@breezewish breezewish commented May 19, 2022

@seankhliao Not reproduced recent days, will try with SIGABRT when it happened. Meanwhile do you have other ideas or suggested measurements I can take to keep some necessary information when it reproduce the next time?

@rhysh
Copy link
Contributor

@rhysh rhysh commented May 23, 2022

Could this be a memory-order / weakness problem with M1 / ARM64? The defense against this deadlock is setting m.profilehz = 0 (in https://github.com/golang/go/blob/go1.18/src/runtime/proc.go#L4569 / https://github.com/golang/go/blob/go1.18/src/runtime/signal_unix.go#L328), which gets checked at the top of the SIGPROF handler (at https://github.com/golang/go/blob/go1.18/src/runtime/proc.go#L4450).

The store and load of m.profilehz are non-atomic, which seems like it could be a problem on ARM64. But those take place on the same thread (store on the curg and load on the gsignal of a particular M), which seems like it could mitigate that problem. @ianlancetaylor , do you know if there's something to that, or would brokenness there show up as even bigger problems? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

3 participants