Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CPU profiling provokes a runtime hang under load #24260

Closed
rakitzis opened this issue Mar 5, 2018 · 17 comments

Comments

Projects
None yet
6 participants
@rakitzis
Copy link

commented Mar 5, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go1.10 (NB: cross-compiling for armv7)

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

Cross compiling for 32 bit arm: GOOS=linux GOARCH=arm go test -c

The host OS is a mac:

; go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/byron/Library/Caches/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/byron/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qv/yq0hdcc10vj2y4gpy_ypjj2m0000gn/T/go-build940051274=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

CPU profiling under load seems to hang the process indefinitely. It becomes unresponsive to SIGINT and SIGQUIT, for example.

We discovered this in production code which doesn't do anything as goofy as the minimal example I am submitting, but I did manage to pare it down to a unit test of a few lines:

https://play.golang.org/p/_B_zC9QeQQr

package main

import (
	"bytes"
	"runtime"
	"runtime/pprof"
	"testing"
	"time"
)

func TestProfileUnderLoad(t *testing.T) {
	go func() {
		for {
			runtime.Gosched()
		}
	}()
	if err := pprof.StartCPUProfile(&bytes.Buffer{}); err != nil {
		t.Fatal(err)
	}
	time.Sleep(30 * time.Second)
	pprof.StopCPUProfile()
}

What did you expect to see?

This should not hang.

What did you see instead?

This code example hangs after 2-3 tries on a Marvel Armada SoC (a uniprocessor) with Linux 4.9.37

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2018

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

This works ok when the CPU is idle. The loop is there as a placeholder for CPU load.

In practice we use ARM SoC's as storage processors in front of SATA disks. So they spend most of their time shuttling bytes from the disk to the network (in both directions).

Please let me know if you can suggest a better proxy for CPU load. I have not tried, e.g., an io.Copy of /dev/zero to /dev/null but that may produce the same result.

Some more details: this does not occur in go-1.8. It occurs reliably from go-1.9 onwards, and manifests in our distributed system as the storage processors first becoming nonresponsive to RPC calls and then to SIGQUIT. It's holding up our upgrade to go-1.9 in production.

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

Yes, i/o load, which does end up keeping this underpowered CPU quite busy.

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

In kernel space, not userspace hopefully

Indeed.

If it is useful I can try to get a repro with io.Copy. The gosched loop was simply the first thing I tried that reproduced the same symptoms with a minimum of code.

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

I have a repro which takes much longer to strike (maybe 30 attempts), using this "load" to approximate I/O plus processing overhead. It is not a proper model (our arms do not md5 every byte) but we do incur some protocol overhead:

        dst, err := os.Create("/dev/null")
        if err != nil {
                t.Fatal(err)
        }
        src, err := os.Open("/dev/zero")
        if err != nil {
                t.Fatal(err)
        }
        r, w := io.Pipe()
        go func() {
                buf := make([]byte, 32768)
                for {
                        n, err := src.Read(buf)
                        if err != nil {
                                w.CloseWithError(err)
                                return
                        }
                        md5.Sum(buf[:n])
                        w.Write(buf[:n])
                }
        }()
        go io.Copy(dst, r)
@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

The test code above (with the contrived Gosched example) is complete -- the code exits when time.Sleep runs for 30 sec. Otherwise it takes "forever". Not sure how long forever is -- but minutes to hours at least (my patience runs out first).

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

And note that the process becomes unresponsive to SIGQUIT and SIGINT.

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

#24260 (comment) is not complete, can you please show the full program so that others can try to reproduce the problem. The full program is important because I want to know how you're avoiding the main goroutine exiting as both the goroutines in #24260 (comment) will not prevent the main goroutine from exiting.

@davecheney

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2018

If your example is structured as a test, be aware that go test will fork off your test program and may capture some signals destined for the child process. Try running go test -c && ./package.test

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

Here is the full source:

https://play.golang.org/p/lhf4TbbFMqS

Here is how it is compiled:

GOARCH=arm GOOS=linux go test -c

Here is how it is run on the target architecture:

while time ./repro.test -test.v; do date; done

The last time I ran this the loop made it through about 250 iterations before hanging.

@rakitzis

This comment has been minimized.

Copy link
Author

commented Mar 6, 2018

Let me know if it helps to recast the repro as a standalone binary with a func main (i.e., no dependency on package testing)

@ianlancetaylor ianlancetaylor added this to the Go1.11 milestone Mar 6, 2018

@ianlancetaylor ianlancetaylor changed the title CPU profiling provokes a runtime hang under load runtime: CPU profiling provokes a runtime hang under load Mar 7, 2018

@kgbvax

This comment has been minimized.

Copy link

commented Apr 23, 2018

I can confirm this with a piece of code I'm running: Reading text at ~1gbit/sec from a handful of sockets and updating some memory. Linux/Samsung Exynos5422

@philhofer

This comment has been minimized.

Copy link
Contributor

commented Jun 6, 2018

It looks like this is an oversight in the code that is meant to prevent systems that emulate 64-bit atomics from deadlocking when the signal handler code grabs one of the spinlocks while it's already held. Basically, this is issue #20146 but for arm instead of mips.

The solution appears to be straightforward:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index 9908951544..8e5186956e 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3699,7 +3699,7 @@ func sigprof(pc, sp, lr uintptr, gp *g, mp *m) {
 	// As a workaround, create a counter of SIGPROFs while in critical section
 	// to store the count, and pass it to sigprof.add() later when SIGPROF is
 	// received from somewhere else (with _LostSIGPROFDuringAtomic64 as pc).
-	if GOARCH == "mips" || GOARCH == "mipsle" {
+	if GOARCH == "mips" || GOARCH == "mipsle" || GOARCH == "arm" {
 		if f := findfunc(pc); f.valid() {
 			if hasprefix(funcname(f), "runtime/internal/atomic") {
 				lostAtomic64Count++
@@ -3839,7 +3839,7 @@ func sigprof(pc, sp, lr uintptr, gp *g, mp *m) {
 	}
 
 	if prof.hz != 0 {
-		if (GOARCH == "mips" || GOARCH == "mipsle") && lostAtomic64Count > 0 {
+		if (GOARCH == "mips" || GOARCH == "mipsle" || GOARCH == "arm") && lostAtomic64Count > 0 {
 			cpuprof.addLostAtomic64(lostAtomic64Count)
 			lostAtomic64Count = 0
 		}

I'm going to go confirm that this does indeed fix the issue, and then send a CL.

@gopherbot

This comment has been minimized.

Copy link

commented Jun 7, 2018

Change https://golang.org/cl/117057 mentions this issue: runtime: fix pprof livelock on arm

@gopherbot gopherbot closed this in afae876 Jun 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.