Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: livelock in cpuProfile.add #25785

Closed
philhofer opened this issue Jun 7, 2018 · 5 comments
Closed

runtime/pprof: livelock in cpuProfile.add #25785

philhofer opened this issue Jun 7, 2018 · 5 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@philhofer
Copy link
Contributor

philhofer commented Jun 7, 2018

Reproduced on linux/arm, go1.10.2, while investigating #24260

The relevant code snippets are as follows:

func (p *cpuProfile) add(gp *g, stk []uintptr) {
	// Simple cas-lock to coordinate with setcpuprofilerate.
	for !atomic.Cas(&prof.signalLock, 0, 1) {
		[[ this loops forever ]]
		osyield()
	}
...

and

func setcpuprofilerate(hz int32) {

	...

	// Stop profiler on this thread so that it is safe to lock prof.
	// if a profiling signal came in while we had prof locked,
	// it would deadlock.
	setThreadCPUProfiler(0)

	for !atomic.Cas(&prof.signalLock, 0, 1) {
		osyield()
	}

...

The trouble is that setThreadCPUProfiler(0) doesn't actually do what the comment above it says it is supposed to do. On Linux, SIGPROF is delivered by arming a timer with setitimer(2), and setThreadCPUProfiler disarms the timer by calling setitimer(ITIMER_PROF, &it, NULL) with an empty struct itimerval. setitimer(2) delivers signals to the process, not the thread. Disarming a timer on thread 0 doesn't mean that thread 1 hasn't already queued a SIGPROF that may or may not be delivered to thread 0. See #14434 (Also see the BUGS section of the man page where it points out that signal generation and signal delivery are distinct events. We might conclude that it's possible to receive at least one SIGPROF delivery after the timer has been disarmed; I'm currently spelunking in the linux source tree to see if that's actually true.)

A confounding issue is that atomic.Cas on arm (up through go1.10.2, but not at tip) calls a kernel helper that leads the kernel to call sched_yield, so the loop in question actually yields two time slices and not one. I don't know if this is necessarily related to the issue, but it may exacerbate it, since it means that the loop could actually sleep longer in each iteration than than the SIGPROF timer interval.

I'm trying to reproduce this at tip. I'm also going to capture the output of perf record so that there's a little more evidence that this is actually what's happening. Frustratingly, the fastest repro I have still takes 20+ minutes to hit, so debugging this issue is slow going.

@gopherbot
Copy link

Change https://golang.org/cl/117057 mentions this issue: runtime: fix pprof livelock on arm

@bcmills bcmills added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 7, 2018
@bcmills bcmills added this to the Go1.11 milestone Jun 7, 2018
@bcmills
Copy link
Member

bcmills commented Jun 7, 2018

Does this need a backport to 1.10 (and 1.9)?

@philhofer
Copy link
Contributor Author

CL 117057 might need a backport in order to fix #24260.

The fix to this issue would be just a backport, as I have been unable to reproduce this issue at tip. This suggests the issue has something to do with the kernel helper calls that are no longer present at tip.

@bcmills bcmills modified the milestones: Go1.11, Go1.10.4 Jun 7, 2018
@philhofer
Copy link
Contributor Author

The issue goes away if I cherry-pick the following commits (with light fixups) onto go1.10, plus the fix for #24260 :

14f929a
0a4b962
f4f41b9e18962b255b48e425ec8a2a75a3575d85

So whatever the issue is, it goes away if we stop using kernel helpers.

gopherbot pushed a commit that referenced this issue Jun 8, 2018
On 32-bit architectures without native 64-bit atomic instructions,
64-bit atomics are emulated using spinlocks. However,
the sigprof handling code expects to be able to perform
64-bit atomic operations in signal handlers. Spinning on an
acquired spinlock in a signal handler leads to a livelock.
This is issue #20146.

The original fix for #20146 did not include arm in
the list of architectures that need to work around portability
issues in the sigprof handler code. The unit test designed to
catch this issue does not fail on arm builds because arm uses
striped spinlocks, and thus the livelock takes many minutes
to reproduce. This is issue #24260. (This patch doesn't completely
fix #24260 on go1.10.2 due to issue #25785, which is probably
related to the arm cas kernel helpers. Those have been removed
at tip.)

With this patch applied, I was able to run the reproducer for
issue #24260 for more than 90 minutes without reproducing the
livelock. Without this patch, the livelock took as little as
8 minutes to reproduce.

Fixes #20146
Updates #24260

Change-Id: I64bf53a14d53c4932367d919ac55e17c99d87484
Reviewed-on: https://go-review.googlesource.com/117057
Run-TryBot: Philip Hofer <phofer@umich.edu>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@andybons andybons modified the milestones: Go1.10.4, Go1.10.5 Aug 24, 2018
@dmitshur dmitshur modified the milestones: Go1.10.5, Go1.10.6 Nov 1, 2018
@dmitshur dmitshur modified the milestones: Go1.10.6, Go1.10.7 Dec 13, 2018
@FiloSottile
Copy link
Contributor

The fix to this issue would be just a backport, as I have been unable to reproduce this issue at tip. This suggests the issue has something to do with the kernel helper calls that are no longer present at tip.

Upgrading to 1.11 is a valid workaround, so this is not eligible for backport.

@golang golang locked and limited conversation to collaborators Dec 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants