New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: livelock in cpuProfile.add #25785

Closed
philhofer opened this Issue Jun 7, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@philhofer
Contributor

philhofer commented Jun 7, 2018

Reproduced on linux/arm, go1.10.2, while investigating #24260

The relevant code snippets are as follows:

func (p *cpuProfile) add(gp *g, stk []uintptr) {
	// Simple cas-lock to coordinate with setcpuprofilerate.
	for !atomic.Cas(&prof.signalLock, 0, 1) {
		[[ this loops forever ]]
		osyield()
	}
...

and

func setcpuprofilerate(hz int32) {

	...

	// Stop profiler on this thread so that it is safe to lock prof.
	// if a profiling signal came in while we had prof locked,
	// it would deadlock.
	setThreadCPUProfiler(0)

	for !atomic.Cas(&prof.signalLock, 0, 1) {
		osyield()
	}

...

The trouble is that setThreadCPUProfiler(0) doesn't actually do what the comment above it says it is supposed to do. On Linux, SIGPROF is delivered by arming a timer with setitimer(2), and setThreadCPUProfiler disarms the timer by calling setitimer(ITIMER_PROF, &it, NULL) with an empty struct itimerval. setitimer(2) delivers signals to the process, not the thread. Disarming a timer on thread 0 doesn't mean that thread 1 hasn't already queued a SIGPROF that may or may not be delivered to thread 0. See #14434 (Also see the BUGS section of the man page where it points out that signal generation and signal delivery are distinct events. We might conclude that it's possible to receive at least one SIGPROF delivery after the timer has been disarmed; I'm currently spelunking in the linux source tree to see if that's actually true.)

A confounding issue is that atomic.Cas on arm (up through go1.10.2, but not at tip) calls a kernel helper that leads the kernel to call sched_yield, so the loop in question actually yields two time slices and not one. I don't know if this is necessarily related to the issue, but it may exacerbate it, since it means that the loop could actually sleep longer in each iteration than than the SIGPROF timer interval.

I'm trying to reproduce this at tip. I'm also going to capture the output of perf record so that there's a little more evidence that this is actually what's happening. Frustratingly, the fastest repro I have still takes 20+ minutes to hit, so debugging this issue is slow going.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 7, 2018

Change https://golang.org/cl/117057 mentions this issue: runtime: fix pprof livelock on arm

@bcmills bcmills added the NeedsFix label Jun 7, 2018

@bcmills bcmills added this to the Go1.11 milestone Jun 7, 2018

@bcmills

This comment has been minimized.

Member

bcmills commented Jun 7, 2018

Does this need a backport to 1.10 (and 1.9)?

@philhofer

This comment has been minimized.

Contributor

philhofer commented Jun 7, 2018

CL 117057 might need a backport in order to fix #24260.

The fix to this issue would be just a backport, as I have been unable to reproduce this issue at tip. This suggests the issue has something to do with the kernel helper calls that are no longer present at tip.

@bcmills bcmills modified the milestones: Go1.11, Go1.10.4 Jun 7, 2018

@philhofer

This comment has been minimized.

Contributor

philhofer commented Jun 7, 2018

The issue goes away if I cherry-pick the following commits (with light fixups) onto go1.10, plus the fix for #24260 :

14f929a
0a4b962
f4f41b9e18962b255b48e425ec8a2a75a3575d85

So whatever the issue is, it goes away if we stop using kernel helpers.

gopherbot pushed a commit that referenced this issue Jun 8, 2018

runtime: fix pprof livelock on arm
On 32-bit architectures without native 64-bit atomic instructions,
64-bit atomics are emulated using spinlocks. However,
the sigprof handling code expects to be able to perform
64-bit atomic operations in signal handlers. Spinning on an
acquired spinlock in a signal handler leads to a livelock.
This is issue #20146.

The original fix for #20146 did not include arm in
the list of architectures that need to work around portability
issues in the sigprof handler code. The unit test designed to
catch this issue does not fail on arm builds because arm uses
striped spinlocks, and thus the livelock takes many minutes
to reproduce. This is issue #24260. (This patch doesn't completely
fix #24260 on go1.10.2 due to issue #25785, which is probably
related to the arm cas kernel helpers. Those have been removed
at tip.)

With this patch applied, I was able to run the reproducer for
issue #24260 for more than 90 minutes without reproducing the
livelock. Without this patch, the livelock took as little as
8 minutes to reproduce.

Fixes #20146
Updates #24260

Change-Id: I64bf53a14d53c4932367d919ac55e17c99d87484
Reviewed-on: https://go-review.googlesource.com/117057
Run-TryBot: Philip Hofer <phofer@umich.edu>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>

@andybons andybons modified the milestones: Go1.10.4, Go1.10.5 Aug 24, 2018

@dmitshur dmitshur modified the milestones: Go1.10.5, Go1.10.6 Nov 1, 2018

@dmitshur dmitshur modified the milestones: Go1.10.6, Go1.10.7 Dec 13, 2018

@FiloSottile

This comment has been minimized.

Member

FiloSottile commented Dec 14, 2018

The fix to this issue would be just a backport, as I have been unable to reproduce this issue at tip. This suggests the issue has something to do with the kernel helper calls that are no longer present at tip.

Upgrading to 1.11 is a valid workaround, so this is not eligible for backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment