Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestCgoPprof and TestCgoPprofPIE fail sometimes on linux/amd64 #18856

Closed
codesenberg opened this issue Jan 30, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@codesenberg
Copy link
Contributor

commented Jan 30, 2017

What version of Go are you using (go version)?

go version devel +4cffe2b Sun Jan 29 23:31:20 2017 +0000 linux/amd64
bootstrapped with:
go version go1.8rc3 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/sanbox/go"
GORACE=""
GOROOT="/home/sanbox/Documents/go/go"
GOTOOLDIR="/home/sanbox/Documents/go/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build693607066=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Tried to build Go from source and ran all.bash.

What did you expect to see?

Tests to pass.

What did you see instead?

So far I have seen three subtly different ways in which tests failed: first, second and third one.

Here are binaries and corresponding profiles.

I ran all.bash several times and it failed consistently.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2017

Copying data here for ease of reading:

First:

--- FAIL: TestCgoPprofPIE (6.25s)
    crash_cgo_test.go:305: 30ms of 50ms total (60.00%)
        Showing top 1 nodes out of 8 (cum >= 30ms)
              flat  flat%   sum%        cum   cum%
              30ms 60.00% 60.00%       30ms 60.00%  runtime._ExternalCode
    crash_cgo_test.go:311: missing cpuHog in pprof output
FAIL
FAIL    runtime 20.083s

Second:

--- FAIL: TestCgoPprofPIE (0.12s)
 crash_cgo_test.go:305: 50ms of 80ms total (62.50%)
  Showing top 1 nodes out of 12 (cum >= 50ms)
        flat  flat%   sum%        cum   cum%
        50ms 62.50% 62.50%       50ms 62.50%  runtime._ExternalCode
 crash_cgo_test.go:311: missing cpuHog in pprof output
--- FAIL: TestCgoPprof (0.10s)
 crash_cgo_test.go:305: 30ms of 50ms total (60.00%)
  Showing top 1 nodes out of 8 (cum >= 30ms)
        flat  flat%   sum%        cum   cum%
        30ms 60.00% 60.00%       30ms 60.00%  runtime._ExternalCode
 crash_cgo_test.go:311: missing cpuHog in pprof output
FAIL
FAIL runtime 30.785s

Third:

##### GOMAXPROCS=2 runtime -cpu=1,2,4
--- FAIL: TestCgoPprofPIE (0.23s)
    crash_cgo_test.go:305: 170ms of 190ms total (89.47%)
        Showing top 1 nodes out of 8 (cum >= 170ms)
              flat  flat%   sum%        cum   cum%
             170ms 89.47% 89.47%      170ms 89.47%  runtime._ExternalCode
    crash_cgo_test.go:311: missing cpuHog in pprof output
--- FAIL: TestCgoPprof (0.12s)
    crash_cgo_test.go:305: 60ms of 80ms total (75.00%)
        Showing top 1 nodes out of 8 (cum >= 60ms)
              flat  flat%   sum%        cum   cum%
              60ms 75.00% 75.00%       60ms 75.00%  runtime._ExternalCode
    crash_cgo_test.go:311: missing cpuHog in pprof output
--- FAIL: TestCgoPprofPIE (0.10s)
    crash_cgo_test.go:305: 30ms of 50ms total (60.00%)
        Showing top 1 nodes out of 8 (cum >= 30ms)
              flat  flat%   sum%        cum   cum%
              30ms 60.00% 60.00%       30ms 60.00%  runtime._ExternalCode
    crash_cgo_test.go:311: missing cpuHog in pprof output
--- FAIL: TestCgoPprof (0.14s)
    crash_cgo_test.go:305: 70ms of 90ms total (77.78%)
        Showing top 1 nodes out of 8 (cum >= 70ms)
              flat  flat%   sum%        cum   cum%
              70ms 77.78% 77.78%       70ms 77.78%  runtime._ExternalCode
    crash_cgo_test.go:311: missing cpuHog in pprof output
FAIL
FAIL    runtime 30.467s
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2017

What version of GCC are you using? What does gcc --version print?

What do you see if you do this:

cd runtime/testdata/testprogcgo
go build
go tool pprof -top `./testprogcgo CgoPprof`
@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2017

@ianlancetaylor, GCC version is gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609.
I've ran commands you provided several times and seen these outputs:

30ms of 30ms total (  100%)
      flat  flat%   sum%        cum   cum%
      20ms 66.67% 66.67%       20ms 66.67%  cpuHog
      10ms 33.33%   100%       10ms 33.33%  runtime._ExternalCode
         0     0%   100%       20ms 66.67%  main.CgoPprof
         0     0%   100%       20ms 66.67%  main._Cfunc_cpuHog
         0     0%   100%       20ms 66.67%  main.main
         0     0%   100%       10ms 33.33%  runtime._System
         0     0%   100%       20ms 66.67%  runtime.cgocall
         0     0%   100%       20ms 66.67%  runtime.main
20ms of 20ms total (  100%)
      flat  flat%   sum%        cum   cum%
      20ms   100%   100%       20ms   100%  cpuHog
         0     0%   100%       20ms   100%  main.CgoPprof
         0     0%   100%       20ms   100%  main._Cfunc_cpuHog
         0     0%   100%       20ms   100%  main.main
         0     0%   100%       20ms   100%  runtime.cgocall
         0     0%   100%       20ms   100%  runtime.main

This is the top from profile I've attached to issue's description:

60ms of 60ms total (  100%)
      flat  flat%   sum%        cum   cum%
      40ms 66.67% 66.67%       40ms 66.67%  runtime._ExternalCode
      20ms 33.33%   100%       20ms 33.33%  cpuHog
         0     0%   100%       20ms 33.33%  main.CgoPprof
         0     0%   100%       20ms 33.33%  main._Cfunc_cpuHog
         0     0%   100%       20ms 33.33%  main.main
         0     0%   100%       40ms 66.67%  runtime._System
         0     0%   100%       20ms 33.33%  runtime.cgocall
         0     0%   100%       20ms 33.33%  runtime.main

Also, if I run it like go tool pprof -top $(GOMAXPROCS=2 ./testprogcgo CgoPprof)(note GOMAXPROCS), then I see this from time to time:

60ms of 60ms total (  100%)
      flat  flat%   sum%        cum   cum%
      40ms 66.67% 66.67%       40ms 66.67%  runtime._ExternalCode
      20ms 33.33%   100%       20ms 33.33%  cpuHog
         0     0%   100%       20ms 33.33%  main.CgoPprof
         0     0%   100%       20ms 33.33%  main._Cfunc_cpuHog
         0     0%   100%       20ms 33.33%  main.main
         0     0%   100%       40ms 66.67%  runtime._System
         0     0%   100%       20ms 33.33%  runtime.cgocall
         0     0%   100%       20ms 33.33%  runtime.main
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2017

Interesting. I never see any instances of runtime._ExternalCode when I try it myself. From looking at the code in sigprof in runtime/proc.go, I think it must be that for some reason no traceback is available and pc > firstmoduledata.etext is true. A high value for pc suggests that the program is running in libc, but if that is so the function registered by runtime.SetCgoTraceback ought to be providing a simple traceback. I'm not sure what is happening.

@ianlancetaylor ianlancetaylor added this to the Go1.8Maybe milestone Jan 30, 2017

@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2017

Actually, with go1.7.5 tests pass just fine. I've collected some data using this script:

go build && rm prof.txt &&
 for i in {1..1000}; do
   go tool pprof -nodecount 1 -top $(GOMAXPROCS=<N> ./testprogcgo CgoPprof) >> prof.txt;
 done
go version -> 3e55059 3e55059 3e55059 go1.7.5 go1.7.5 go1.7.5
GOMAXPROCS 1 2 4 1 2 4
cpuHog 918 459 437 990 902 922
runtime._ExternalCode 77 541 563 10 98 78
runtime.usleep 5 0 0 0 0 0

So, with 3e55059 and higher GOMAXPROCS runtime._ExternalCode appears as top entry more frequently. Which is also true for go1.7.5, but to a lesser extent. @ianlancetaylor, looks like regression to you?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2017

Yes, it looks like a regression, but I still can't recreate it and I still have no idea why it is happening for you.

Can you find out if there are any calls to runtime.sigprofNonGoPC when you run the test? I don't think there should be any, and I don't know why there would be any, but that function is new in 1.8 and calls to it could cause this result.

@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Jan 31, 2017

Can you find out if there are any calls to runtime.sigprofNonGoPC when you run the test?

Maybe, I did something wrong, but after applying this patch, running make.bash again and rebuilding testprogcgo no new messages appeared in console during runs.

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672d..9f1859d 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3379,7 +3379,9 @@ func sigprofNonGo() {
 //go:nosplit
 //go:nowritebarrierrec
 func sigprofNonGoPC(pc uintptr) {
+	println("runtime.sigprofNonGoPC called")
 	if prof.hz != 0 {
+		println("runtime.sigprofNonGoPC \"hz != 0\" branch")
 		pc := []uintptr{
 			pc,
 			funcPC(_ExternalCode) + sys.PCQuantum,
@@ -3390,6 +3392,7 @@ func sigprofNonGoPC(pc uintptr) {
 			osyield()
 		}
 		if prof.hz != 0 {
+			println("runtime.sigprofNonGoPC \"cpuprof.addNonGo(pc)\" branch")
 			cpuprof.addNonGo(pc)
 		}
 		atomic.Store(&prof.lock, 0)
@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2017

@ianlancetaylor, I dug a bit and tried another patch. Here it is:

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f41672d..317fae7 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -3174,14 +3174,23 @@ func _System()       { _System() }
 func _ExternalCode() { _ExternalCode() }
 func _GC()           { _GC() }
 
+var (
+	Sigprof, SigprofHz0, SigprofPCMoreFMD                     bool
+	SigprofNonGo                                              bool
+	SigprofNonGoPC, SigprofNonGoPCHz0, SigprofNonGoPCAddNonGo bool
+)
+
 // Called if we receive a SIGPROF signal.
 // Called by the signal handler, may run during STW.
 //go:nowritebarrierrec
 func sigprof(pc, sp, lr uintptr, gp *g, mp *m) {
+	Sigprof = true
 	if prof.hz == 0 {
 		return
 	}
 
+	SigprofHz0 = true
+
 	// Profiling runs concurrently with GC, so it must not allocate.
 	// Set a trap in case the code does allocate.
 	// Note that on windows, one thread takes profiles of all the
@@ -3317,6 +3326,7 @@ func sigprof(pc, sp, lr uintptr, gp *g, mp *m) {
 			n = 2
 			// "ExternalCode" is better than "etext".
 			if pc > firstmoduledata.etext {
+				SigprofPCMoreFMD = true
 				pc = funcPC(_ExternalCode) + sys.PCQuantum
 			}
 			stk[0] = pc
@@ -3354,6 +3364,7 @@ var sigprofCallersUse uint32
 //go:nosplit
 //go:nowritebarrierrec
 func sigprofNonGo() {
+	SigprofNonGo = true
 	if prof.hz != 0 {
 		n := 0
 		for n < len(sigprofCallers) && sigprofCallers[n] != 0 {
@@ -3379,7 +3390,9 @@ func sigprofNonGo() {
 //go:nosplit
 //go:nowritebarrierrec
 func sigprofNonGoPC(pc uintptr) {
+	SigprofNonGoPC = true
 	if prof.hz != 0 {
+		SigprofNonGoPCHz0 = true
 		pc := []uintptr{
 			pc,
 			funcPC(_ExternalCode) + sys.PCQuantum,
@@ -3390,6 +3403,7 @@ func sigprofNonGoPC(pc uintptr) {
 			osyield()
 		}
 		if prof.hz != 0 {
+			SigprofNonGoPCAddNonGo = true
 			cpuprof.addNonGo(pc)
 		}
 		atomic.Store(&prof.lock, 0)
diff --git a/src/runtime/testdata/testprogcgo/pprof.go b/src/runtime/testdata/testprogcgo/pprof.go
index 4460b93..64486ae 100644
--- a/src/runtime/testdata/testprogcgo/pprof.go
+++ b/src/runtime/testdata/testprogcgo/pprof.go
@@ -94,4 +94,11 @@ func CgoPprof() {
 	}
 
 	fmt.Println(name)
+	fmt.Fprintln(os.Stderr, "Sigprof                ", runtime.Sigprof)
+	fmt.Fprintln(os.Stderr, "SigprofHz0             ", runtime.SigprofHz0)
+	fmt.Fprintln(os.Stderr, "SigprofPCMoreFMD       ", runtime.SigprofPCMoreFMD)
+	fmt.Fprintln(os.Stderr, "SigprofNonGo           ", runtime.SigprofNonGo)
+	fmt.Fprintln(os.Stderr, "SigprofNonGoPC         ", runtime.SigprofNonGoPC)
+	fmt.Fprintln(os.Stderr, "SigprofNonGoPCHz0      ", runtime.SigprofNonGoPCHz0)
+	fmt.Fprintln(os.Stderr, "SigprofNonGoPCAddNonGo ", runtime.SigprofNonGoPCAddNonGo)
 }

I've observed following things. Given the following output:

Sigprof                 true     <- always true
SigprofHz0              true     <- always true
SigprofPCMoreFMD        true     <- this is always true, when I see _runtime.ExternalCode in profile
SigprofNonGo            false    <- this and next four are always false
SigprofNonGoPC          false
SigprofNonGoPCHz0       false
SigprofNonGoPCAddNonGo  false

Hope this helps.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2017

Thanks. I do not understand what is different about your system that causes that code path to happen so frequently.

@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2017

@ianlancetaylor, I just switched paravirtualization interface to KVM on VirtualBox VM inside of which this tests failed and miraculously runtime._ExternalCode vanished from profiles. Now all tests pass.

@rsc rsc modified the milestones: Go1.9Maybe, Go1.8Maybe Feb 7, 2017

@aclements

This comment has been minimized.

Copy link
Member

commented Jul 18, 2017

@codesenberg, @ianlancetaylor, is there anything more to do on this issue? It sounds like maybe it was a KVM problem?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2017

I never understood this issue. I'm fine with closing it.

@codesenberg

This comment has been minimized.

Copy link
Contributor Author

commented Jul 19, 2017

Yeah, looks like an issue with VirtualBox to me. I don't think there is anything to be done, so, closing it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.