Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057

Open
rhysh opened this issue Oct 21, 2019 · 13 comments
Open

runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057

rhysh opened this issue Oct 21, 2019 · 13 comments

Comments

@rhysh
Copy link
Contributor

@rhysh rhysh commented Oct 21, 2019

What version of Go are you using (go version)?

$ go1.13 version
go version go1.13.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.

Am I holding it wrong?

Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?

The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.

What operating system and processor architecture are you using (go env)?

I'm compiling on darwin/amd64 and running on linux/amd64.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/rhys/Library/Caches/go-build"
GOENV="/Users/rhys/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY="*"
GONOSUMDB="*"
GOOS="darwin"
GOPATH="/Users/rhys/go"
GOPRIVATE="*"
GOPROXY="direct"
GOROOT="/usr/local/go"
GOSUMDB="off"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/49/zmds5zsn75z1283vtzxyfr5hj7yjq4/T/go-build343144681=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I looked at a CPU profile for a Go program running on a Linux machine where top reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.

I ran a test with the -test.cpuprofile flag on a Linux machine and compared the results of the time shell built-in with go tool pprof's view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.

I ran the test under perf stat -e 'signal:*' and found that its count of signal:signal_generate events lined up with the number of SIGPROF deliveries I'd expect, that its count of signal:signal_deliver events lined up with the number of samples in the CPU profile, and that the two matched well only when the "generate" rate was less than 250 samples per second.


Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime's default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The time built-in reports slightly less than 16 minutes (960 seconds) of "user" CPU. That's good.

The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That's close to the number of signals that the kernel reports it delivered to the program, but that doesn't match the number generated by the kernel or the actual CPU time spent.

$ time sudo perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

            95,690      signal:signal_generate                                      
             9,688      signal:signal_deliver                                       

      10.211210687 seconds time elapsed


real	0m10.316s
user	15m57.840s
sys	0m0.045s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:16pm (PDT)
Duration: 10.20s, Total samples = 1.61mins (949.83%)
Showing nodes accounting for 1.61mins, 100% of 1.61mins total
Dropped 13 nodes (cum <= 0.01mins)
      flat  flat%   sum%        cum   cum%
  1.61mins   100%   100%   1.61mins   100%  command-line-arguments.cpuHog1
         0     0%   100%   1.61mins   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%   1.61mins   100%  command-line-arguments.cpuHogger

Calling runtime.SetCPUProfileRate with "2 Hz" right before the testing package's CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel's measurements of "signal:signal_generate" and "signal:signal_deliver" matching each other, and for go tool pprof's measurement of "15.94mins" to come very close to what the time built-in sees at "user 15m57.048s".

$ time sudo env PROFILE_HZ=2 perf stat -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96
runtime: cannot set cpu profile rate until previous profile has finished.
PASS

 Performance counter stats for './test-n -test.cpuprofile=./prof-n -test.cpu=96':

             1,913      signal:signal_generate                                      
             1,913      signal:signal_deliver                                       

      10.120272912 seconds time elapsed


real	0m10.221s
user	15m57.048s
sys	0m0.378s
$ go tool pprof -text /tmp/test-n /tmp/prof-n 
File: test-n
Type: cpu
Time: Oct 21, 2019 at 2:17pm (PDT)
Duration: 10.11s, Total samples = 15.94mins (9464.52%)
Showing nodes accounting for 15.93mins, 99.90% of 15.94mins total
Dropped 1 node (cum <= 0.08mins)
      flat  flat%   sum%        cum   cum%
 15.93mins 99.90% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHog1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.90%  15.93mins 99.90%  command-line-arguments.cpuHogger

I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.

$ uname -a
Linux ip-172-31-18-196.us-west-2.compute.internal 4.14.123-111.109.amzn2.x86_64 #1 SMP Mon Jun 10 19:37:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ cat /boot/config-$(uname -r) | grep CONFIG_HIGH_RES_TIMERS
CONFIG_HIGH_RES_TIMERS=y
$ cat /boot/config-$(uname -r) | grep ^CONFIG_HZ
CONFIG_HZ_250=y
CONFIG_HZ=250

I've seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)

What did you expect to see?

I expected the number of seconds of CPU time reported by go tool pprof to align with the number of seconds of CPU time observed by the kernel.

When I run go tool pprof, I expect the time reported in the "Duration" line (like "Duration: 5.11s, Total samples = 8.50s (166.40%)") to match what I'd see from looking at a tool like top at the same time.

What did you see instead?

The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don't know if it drops them fairly—the profiles might be skewed.


Open questions

Is there a simple setting that my coworkers and I are missing? I've reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.

Is the right move for runtime.SetCPUProfileRate to limit its input to 250 / GOMAXPROCS?

Does the number "250" come from Linux's CONFIG_HZ_250=y / CONFIG_HZ=250, and is it right for that configuration to end up compiled in to Go?

Thanks!


Here's the test program:

package repro

import (
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var sink int64

func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}

func TestCPUProfile(t *testing.T) {
	workers := runtime.GOMAXPROCS(0)

	var wg sync.WaitGroup
	for i := 0; i < workers; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			var v int
			cpuHogger(cpuHog1, &v, 10*time.Second)
			atomic.StoreInt64(&sink, int64(v))
		}()
	}

	wg.Wait()
}

// cpuHogger and cpuHog1 from src/runtime/pprof/pprof_test.go

func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}

// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}
@hyangah
Copy link
Contributor

@hyangah hyangah commented Oct 22, 2019

@rsc @aclements @dvyukov
Many reported various limitations of pprof-based cpu profiling (#13841, #14434, ...).
In addition to that, collecting cpu profiles with runtime/pprof requires extra cpu resources to the process (and there is no special priority in scheduling for profile collection afaik), so personally I avoid trusting the cpu profiles from overloaded processes.

If you are on linux, you can try linux perf. The pprof tool understands perf input as well.

@heschik
Copy link
Contributor

@heschik heschik commented Oct 23, 2019

Can you get some more information from the signal_generate tracepoint? Specifically, the thread ID to see if #14434 is in play, and the result to see how often it's TRACE_SIGNAL_ALREADY_PENDING. My (ill-informed) guess is that the delivery is hugely biased toward one thread, and a lot of the signals are getting dropped as a result of the overload.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 23, 2019

I was able to get a lot of info from perf trace --no-syscalls -e 'signal:*'. It looks like each thread is getting roughly its fair share of both signal_generate and signal_deliver events, but that often the signal_generate events have res=2 (which I take to mean TRACE_SIGNAL_ALREADY_PENDING).

I don't see significant bias towards particular threads.

I see a few dozen signal_generate events being generated in a very short time window (tens of microseconds). This repeats every 4ms (at 250 Hz). The first in the burst has res=0, and subsequent ones have res=2 until a signal_deliver event shows up. The signal_generate event following that will once again have res=0. It looks like the generation of SIGPROF events is racing against their delivery.


The test ran for 10 seconds on 96 vCPUs, with 10108 samples in the resulting CPU profile.

$ sudo perf trace --no-syscalls -e 'signal:*' ./test-n -test.cpuprofile=./prof-n -test.cpu=96 &> ./trace-100-96
$ go tool pprof -raw ./test-n ./prof-n | sed -n -e '/^Samples/,/^Locations/ p' | awk '{if ($1 ~ /^[0-9]+$/) {s += $1}} END {print s}'
10108

Looking at a small sample of threads, each got about 100 signal_generate events with res=0 and about 850 with res=2.

$ cat ./trace-100-96 | grep generate | awk '{$1="" ; print $0}' | sort | uniq -c | tail
  95  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=0)
 887  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=2)
 125  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37087 grp=1 res=0)
 812  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37087 grp=1 res=2)
 100  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37088 grp=1 res=0)
 850  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37088 grp=1 res=2)
 117  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37089 grp=1 res=0)
 851  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37089 grp=1 res=2)
 106  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37090 grp=1 res=0)
 888  signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37090 grp=1 res=2)

The 95782 signal_generate events were split across 97 threads, with between 846 and 1122 events for each thread (only modest skew).

$ cat ./trace-100-96 | grep generate | awk '{print $6}' | sort | uniq -c | awk '{print $1}' | dist
N 97  sum 95782  mean 987.443  gmean 986.417  std dev 45.2544  variance 2047.96

     min 846
   1%ile 850.293
   5%ile 912.8
  25%ile 962.667
  median 984
  75%ile 1008
  95%ile 1075.6
  99%ile 1117.09
     max 1122

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠖⠒⠒⠦⠤⠤⠤⠖⠒⢦⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.012
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡤⠚⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠢⣀⣠⠤⠤⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠖⠒⠒⠒⠒⠒⠒⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⠲⠤⠤⠤⠤⠖⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                900                 1000                 1100

The signal_generate events with res=0 are also split across 97 threads with between 68 and 136 events per thread. There are 10108 of them, an exact match for the count of samples in the generated CPU profile.

$ cat ./trace-100-96 | grep generate | grep 'res=0' | awk '{print $6}' | sort | uniq -c | awk '{print $1}' | dist
N 97  sum 10108  mean 104.206  gmean 103.117  std dev 14.9276  variance 222.832

     min 68
   1%ile 69.84
   5%ile 80.4
  25%ile 93
  median 103
  75%ile 117
  95%ile 128.8
  99%ile 134.467
     max 136

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⣀⠤⠤⠤⠖⠒⠒⠒⠒⠲⠤⢄⣀⣀⣀⣀⣀⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.026
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣠⠤⠔⠒⠋⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠓⠒⠒⠦⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠴⠒⠒⠒⠒⠒⠚⠉⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠒⠒⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                                100

Here's the distribution of the inter-arrival times of signal_generate events for a particular thread:

$ cat ./trace-100-96 | grep generate | grep '37090' | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 994  sum 10092  mean 10.1529  gmean 7.99014  std dev 8.56788  variance 73.4086

     min 3.838
   1%ile 3.89428
   5%ile 3.93505
  25%ile 4.02
  median 7.993
  75%ile 12.057
  95%ile 24.095
  99%ile 38.8371
     max 136.01

⠀⠀⠀⠀⠀⠀⢀⠖⠲⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.185
⠀⠀⠀⠀⠀⠀⡸⠀⠀⠣⠀⢠⠞⠙⣆⠀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠇⠀⠀⠘⠤⠎⠀⠀⠈⠦⠴⠉⠀⠙⠤⠴⠊⠉⠓⠤⠤⠤⠖⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠁
  0                                                               50

And the inter-arrival times of signal_generate events with res=0:

$ cat ./trace-100-96 | grep generate | grep res=0 | grep '37090' | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 106  sum 10080  mean 95.0942  gmean 53.0591  std dev 99.8375  variance 9967.52

     min 3.943
   1%ile 3.98505
   5%ile 7.9314
  25%ile 24.0063
  median 61.9045
  75%ile 151.981
  95%ile 307.972
  99%ile 477.82
     max 500.062

⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⠔⠒⠒⠲⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.008
⠀⠀⠀⠀⠀⠀⠀⣠⠞⠁⠀⠀⠀⠀⠀⠀⠉⠒⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠊⠉⠉⠉⠉⠉⠉⠙⠒⠒⠒⠤⠤⠤⠤⠤⠤⠤⠴⠒⠒⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠁
        0                                                     500

The signal_generate events come in bursts. The median time between them is 2µs. The bursts are 4ms apart.

$ cat ./trace-100-96 | sort -k 1 -n | grep generate | awk '{print $1}' | awk '{print $1-t ; t=$1}' | dist
N 95782  sum 10128  mean 0.10574  std dev 0.627422  variance 0.393658

     min 0
   1%ile 0.001
   5%ile 0.001
  25%ile 0.002
  median 0.002
  75%ile 0.003
  95%ile 0.005
  99%ile 3.93
     max 4.006

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⡖ 2.578
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠴⠥⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
     -0

The bursts of signal_generate events from the kernel seem to be racing against the application's signal handler. After a signal_deliver event, the next signal_generate will have res=0 but all others will have res=2.

$ cat ./trace-100-96 | sort -k 1 -n | head -n 10000 | tail -n 30
  1016.072 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37086 grp=1 res=2)
  1016.074 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37061 grp=1 res=2)
  1016.076 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37017 grp=1 res=2)
  1016.078 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37055 grp=1 res=2)
  1016.080 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37019 grp=1 res=2)
  1016.082 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37068 grp=1 res=2)
  1016.084 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37008 grp=1 res=2)
  1016.086 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37036 grp=1 res=2)
  1016.088 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37046 grp=1 res=2)
  1016.090 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37034 grp=1 res=2)
  1016.093 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37029 grp=1 res=2)
  1016.094 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37059 grp=1 res=2)
  1016.095 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1016.097 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37056 grp=1 res=0)
  1016.098 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37014 grp=1 res=2)
  1016.099 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37083 grp=1 res=2)
  1016.101 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37013 grp=1 res=2)
  1016.102 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37002 grp=1 res=2)
  1016.103 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37032 grp=1 res=2)
  1016.104 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=36996 grp=1 res=2)
  1016.106 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1016.107 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37079 grp=1 res=0)
  1016.109 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=36999 grp=1 res=2)
  1016.111 signal:signal_deliver:sig=27 errno=0 code=128 sa_handler=45b970 sa_flags=1c000004)
  1019.995 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37013 grp=1 res=0)
  1020.002 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37014 grp=1 res=2)
  1020.006 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37005 grp=1 res=2)
  1020.009 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37051 grp=1 res=2)
  1020.011 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37059 grp=1 res=2)
  1020.012 signal:signal_generate:sig=27 errno=0 code=128 comm=test-n pid=37008 grp=1 res=2)

After a signal_generate event with res=0, the median delay before the next signal_deliver event is 10µs:

$ cat ./trace-100-96 | sort -k 1 -n | grep signal | grep -v 'res=2' | awk '{if (/deliver/) {print $1-t} else {t=$1}}' | dist
N 10108  sum 218.332  mean 0.0215999  gmean 0.0114421  std dev 0.0351222  variance 0.00123357

     min 0.001
   1%ile 0.002
   5%ile 0.002
  25%ile 0.005
  median 0.01
  75%ile 0.026
  95%ile 0.079
  99%ile 0.122
     max 1

⠀⠀⠀⠀⠀⠀⠀⡔⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 73.804
⠀⠀⠀⠀⠀⠀⢰⠁⠀⠓⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠏⠀⠀⠀⠀⠈⠉⠓⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                                      0.1
@heschik
Copy link
Contributor

@heschik heschik commented Oct 23, 2019

It may be time for me to let more knowledgeable people handle this, but I'm confused by the raw trace text. It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads? Can that be right?

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 23, 2019

It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads?

That's what it looks like to me too. Maybe there's a way to have that limit apply to each thread rather than the process as a whole?

CC @ianlancetaylor for expertise on signals.

@hyangah
Copy link
Contributor

@hyangah hyangah commented Oct 24, 2019

From http://man7.org/linux/man-pages/man7/signal.7.html I read

Standard signals do not queue. If multiple instances of a standard
signal are generated while that signal is blocked, then only one
instance of the signal is marked as pending (and the signal will be
delivered just once when it is unblocked). In the case where a
standard signal is already pending, the siginfo_t structure (see
sigaction(2)) associated with that signal is not overwritten on
arrival of subsequent instances of the same signal. Thus, the
process will receive the information associated with the first
instance of the signal.

I don't know how SIGPROF based profiling is supposed to work reliably with multi-core, multi-threaded systems. If this is true and I don't misread this, it sounds to me like we need to adjust our sampling frequency based on the available cores and the number of threads.

@hyangah
Copy link
Contributor

@hyangah hyangah commented Oct 24, 2019

cc @aalexand and @rauls5382 for advice on how sigprof handling works in c++ and other languages.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 24, 2019

In #14434, @aclements pointed out that applications can use the create_timer syscall to track time spent in individual threads with the CLOCK_THREAD_CPUTIME_ID clockid. With the Linux-specific SIGEV_THREAD_ID event, it looks like they can request that the signal be delivered to a particular thread. If each thread/M uses that to request individual cost accounting and individual signal delivery, does that result in a separate 250 Hz limit for each thread?

Put another way, does each thread have its own bitmask / size-one-queue of pending signals?

Here are some options I see that build on each other, in increasing risk/complexity:

  1. Limit the per-process delivery rate in setitimer to no more than a constant divided by GOMAXPROCS. We might always guess "250 Hz", so would allow the usual 100 Hz rate for GOMAXPROCS of 1 or 2, but would limit it to 50 Hz when GOMAXPROCS=5.
  2. Instead of using setitimer to request process-wide SIGPROF delivery, use create_timer with CLOCK_PROCESS_CPUTIME_ID to request that same process-wide delivery. Then, use timer_getoverrun to count the number of missed signals and attribute them to a placeholder function in the final profile.
  3. Enhance the use of create_timer to request per-thread accounting and delivery with CLOCK_THREAD_CPUTIME_ID. Return to the default interval of 100 Hz. (If each Linux thread has an independent bitmask for pending signals.)

Is that the right path forward?

@aalexand
Copy link
Contributor

@aalexand aalexand commented Oct 25, 2019

This is a known and unfortunate shortcoming of SIGPROF. See https://elinux.org/Kernel_Timer_Systems. Also http://hpctoolkit.org/man/hpcrun.html#section_15:

"On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page."

It's a bit strange that this extends to per-core timers, but apparently it's the case.

There is Google-internal b/69814231 that tracks some investigations around that, I think the options are:

  1. Use getrusage() to compute the total CPU time consumed by the process during the profiling session. Use that to pro-rate the sample weights. This will get the overall CPU usage right, but won't change the number of samples so the accuracy may suffer. Easy to implement though.

  2. Use the CLOCK_THREAD_CPUTIME_ID mode. That's what we do (not by default though) in Stackdriver Profiler's Java agent (link).

  3. It should also be possible I think to use Linux perf subsystem as the source of the timer events, but not 100% sure how accurate that would be and whether there are permission implications inside of Docker containers.

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 25, 2019

Thanks for the context, @aalexand .

I tried out the three options I proposed yesterday:

  • Scaling down the requested profile interval works well, at a cost of lower sampling frequency.
  • Using timer_create for process-wide CPU accounting probably won't work. The signals it delivers seem to get picked up by arbitrary threads—including the sleeping sysmon and template threads—rather than only those spending CPU cycles. In addition, timer_getoverrun underreports the missed samples. When I run the test program with GOMAXPROCS=96 (on a machine with 96 vCPUs) and request a 10ms CLOCK_PROCESS_CPUTIME_ID interval, the profile includes around 4,400 samples in a ten-second window (rather than 96,000) and timer_getoverrun reports around 20 missed events.
  • Using timer_create and CLOCK_THREAD_CPUTIME_ID on each thread (with SIGEV_THREAD_ID to request the thread itself be notified) gives good results. Each thread is able to get 100 signals per second, even when there are 96 active threads. To make this work, each M would need to call timer_create itself, but I'd expect that any thread within the process would be able to arm/disarm all of the timers via timer_settime in runtime.setProcessCPUProfiler.
@aalexand
Copy link
Contributor

@aalexand aalexand commented Oct 25, 2019

CLOCK_THREAD_CPUTIME_ID is probably most promising. The only caveat is handling threads that the runtime is not aware of. On Linux this could be solved by iterating over all process' threads using /proc file system (this would miss short-lived threads created after the profiling has started though, so might not always work). Not sure about non-Linux systems.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 30, 2019

Change https://golang.org/cl/204279 mentions this issue: runtime: limit SIGPROF rate to 250 Hz on Linux

@rhysh
Copy link
Contributor Author

@rhysh rhysh commented Oct 31, 2019

Work that comes in bursts—causing the process to spend more than 10ms of CPU time in one kernel tick—is systematically under-sampled. I've updated the reproducer to do work in a single goroutine, and then to do the same amount of work across GOMAXPROCS goroutines. In the resulting profile with go1.13.3 on a 96-vCPU machine, samples from when the process used only one goroutine make up 90.5% of the profile. The work done in parallel appears in only 9.3% of the samples. I'd expect both to be close to 50% / 1:1 rather than that 9.7:1 skew ratio.

It seems like as long as profiling on Linux uses setitimer, the sample rate has to balance low resolution against that bias.

The workaround for low-resolution profiles is to collect profiles over a longer time. I don't know of a practical workaround for user programs. (List threads, get the clock for each, change calls to runtime/pprof.StartCPUProfile to create and enable each thread's timer and then dial down the setitimer rate to 1 Hz, then fix up any disagreement in nanoseconds-per-sample in the resulting profile, then deactivate the per-thread timers.)

Is the project willing to trade away resolution to get lower bias? If so, how much?


updated reproducer showing 9.7:1 skew
package repro

import (
	"flag"
	"os"
	"runtime"
	"strconv"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

var (
	sink      int64
	taskCount = flag.Int("task-count", 100, "")
)

func init() {
	hz, _ := strconv.Atoi(os.Getenv("PROFILE_HZ"))
	if hz > 0 {
		runtime.SetCPUProfileRate(hz)
	}
}

func TestCPUProfile(t *testing.T) {
	parallelism := runtime.GOMAXPROCS(0)
	tasks := make(chan func())

	// force creation of GOMAXPROCS threads
	var gate sync.WaitGroup
	gate.Add(1)

	var wg sync.WaitGroup
	for i := 0; i < parallelism; i++ {
		gate.Add(1)
		wg.Add(1)
		go func() {
			defer wg.Done()

			runtime.LockOSThread()
			gate.Done()
			gate.Wait()
			runtime.UnlockOSThread()

			for task := range tasks {
				task()
			}
		}()
	}

	gate.Done()

	var taskWg sync.WaitGroup
	enqueueTask := func(fn func(x int) int) {
		taskWg.Add(1)
		tasks <- func() {
			defer taskWg.Done()
			var v int
			cpuHogger(fn, &v, 0)
			atomic.StoreInt64(&sink, int64(v))
		}
	}

	// do work sequentially
	for i := 0; i < *taskCount; i++ {
		enqueueTask(cpuHog1)
		taskWg.Wait()
	}

	// do work in parallel
	for i := 0; i < *taskCount; i++ {
		enqueueTask(cpuHog2)
	}
	taskWg.Wait()

	close(tasks)
	wg.Wait()
}

// cpuHogger, cpuHog1, and cpuHog2 from src/runtime/pprof/pprof_test.go

func cpuHogger(f func(x int) int, y *int, dur time.Duration) {
	// We only need to get one 100 Hz clock tick, so we've got
	// a large safety buffer.
	// But do at least 500 iterations (which should take about 100ms),
	// otherwise TestCPUProfileMultithreaded can fail if only one
	// thread is scheduled during the testing period.
	t0 := time.Now()
	accum := *y
	for i := 0; i < 500 || time.Since(t0) < dur; i++ {
		accum = f(accum)
	}
	*y = accum
}

// The actual CPU hogging function.
// Must not call other functions nor access heap/globals in the loop,
// otherwise under race detector the samples will be in the race runtime.
func cpuHog1(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 1
		}
	}
	return foo
}

func cpuHog2(x int) int {
	foo := x
	for i := 0; i < 1e5; i++ {
		if foo > 0 {
			foo *= foo
		} else {
			foo *= foo + 2
		}
	}
	return foo
}
$ time ./test-n -test.count=1 -test.v -test.cpuprofile=./prof-n -test.cpu=96 -task-count=1000
=== RUN   TestCPUProfile
--- PASS: TestCPUProfile (52.84s)
PASS

real	0m52.973s
user	1m50.164s
sys	0m0.096s
$ go tool pprof -text /tmp/test-n /tmp/prof-n
File: test-n
Type: cpu
Time: Oct 31, 2019 at 11:06am (PDT)
Duration: 52.97s, Total samples = 57.66s (108.86%)
Showing nodes accounting for 57.54s, 99.79% of 57.66s total
Dropped 18 nodes (cum <= 0.29s)
      flat  flat%   sum%        cum   cum%
    52.19s 90.51% 90.51%     52.19s 90.51%  command-line-arguments.cpuHog1
     5.35s  9.28% 99.79%      5.35s  9.28%  command-line-arguments.cpuHog2
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.TestCPUProfile.func1
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.TestCPUProfile.func2.1
         0     0% 99.79%     57.54s 99.79%  command-line-arguments.cpuHogger
$ time env PROFILE_HZ=2 ./test-n -test.count=1 -test.v -test.cpuprofile=./prof-n -test.cpu=96 -task-count=1000
=== RUN   TestCPUProfile
runtime: cannot set cpu profile rate until previous profile has finished.
--- PASS: TestCPUProfile (52.78s)
PASS

real	0m52.927s
user	1m49.887s
sys	0m0.104s
$ go tool pprof -text /tmp/test-n /tmp/prof-n
File: test-n
Type: cpu
Time: Oct 31, 2019 at 11:04am (PDT)
Duration: 52.92s, Total samples = 109.50s (206.92%)
Showing nodes accounting for 109.50s, 100% of 109.50s total
      flat  flat%   sum%        cum   cum%
    57.50s 52.51% 52.51%     57.50s 52.51%  command-line-arguments.cpuHog2
       52s 47.49%   100%        52s 47.49%  command-line-arguments.cpuHog1
         0     0%   100%    109.50s   100%  command-line-arguments.TestCPUProfile.func1
         0     0%   100%    109.50s   100%  command-line-arguments.TestCPUProfile.func2.1
         0     0%   100%    109.50s   100%  command-line-arguments.cpuHogger
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.