runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057
Comments
@rsc @aclements @dvyukov If you are on linux, you can try linux perf. The pprof tool understands perf input as well. |
Can you get some more information from the signal_generate tracepoint? Specifically, the thread ID to see if #14434 is in play, and the result to see how often it's TRACE_SIGNAL_ALREADY_PENDING. My (ill-informed) guess is that the delivery is hugely biased toward one thread, and a lot of the signals are getting dropped as a result of the overload. |
I was able to get a lot of info from I don't see significant bias towards particular threads. I see a few dozen signal_generate events being generated in a very short time window (tens of microseconds). This repeats every 4ms (at 250 Hz). The first in the burst has res=0, and subsequent ones have res=2 until a signal_deliver event shows up. The signal_generate event following that will once again have res=0. It looks like the generation of SIGPROF events is racing against their delivery. The test ran for 10 seconds on 96 vCPUs, with 10108 samples in the resulting CPU profile.
Looking at a small sample of threads, each got about 100 signal_generate events with res=0 and about 850 with res=2.
The 95782 signal_generate events were split across 97 threads, with between 846 and 1122 events for each thread (only modest skew).
The signal_generate events with res=0 are also split across 97 threads with between 68 and 136 events per thread. There are 10108 of them, an exact match for the count of samples in the generated CPU profile.
Here's the distribution of the inter-arrival times of signal_generate events for a particular thread:
And the inter-arrival times of signal_generate events with res=0:
The signal_generate events come in bursts. The median time between them is 2µs. The bursts are 4ms apart.
The bursts of signal_generate events from the kernel seem to be racing against the application's signal handler. After a signal_deliver event, the next signal_generate will have res=0 but all others will have res=2.
After a signal_generate event with res=0, the median delay before the next signal_deliver event is 10µs:
|
It may be time for me to let more knowledgeable people handle this, but I'm confused by the raw trace text. It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads? Can that be right? |
That's what it looks like to me too. Maybe there's a way to have that limit apply to each thread rather than the process as a whole? CC @ianlancetaylor for expertise on signals. |
From http://man7.org/linux/man-pages/man7/signal.7.html I read
I don't know how SIGPROF based profiling is supposed to work reliably with multi-core, multi-threaded systems. If this is true and I don't misread this, it sounds to me like we need to adjust our sampling frequency based on the available cores and the number of threads. |
cc @aalexand and @rauls5382 for advice on how sigprof handling works in c++ and other languages. |
In #14434, @aclements pointed out that applications can use the Put another way, does each thread have its own bitmask / size-one-queue of pending signals? Here are some options I see that build on each other, in increasing risk/complexity:
Is that the right path forward? |
This is a known and unfortunate shortcoming of SIGPROF. See https://elinux.org/Kernel_Timer_Systems. Also http://hpctoolkit.org/man/hpcrun.html#section_15: "On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page." It's a bit strange that this extends to per-core timers, but apparently it's the case. There is Google-internal b/69814231 that tracks some investigations around that, I think the options are:
|
Thanks for the context, @aalexand . I tried out the three options I proposed yesterday:
|
|
Change https://golang.org/cl/204279 mentions this issue: |
Work that comes in bursts—causing the process to spend more than 10ms of CPU time in one kernel tick—is systematically under-sampled. I've updated the reproducer to do work in a single goroutine, and then to do the same amount of work across GOMAXPROCS goroutines. In the resulting profile with go1.13.3 on a 96-vCPU machine, samples from when the process used only one goroutine make up 90.5% of the profile. The work done in parallel appears in only 9.3% of the samples. I'd expect both to be close to 50% / 1:1 rather than that 9.7:1 skew ratio. It seems like as long as profiling on Linux uses setitimer, the sample rate has to balance low resolution against that bias. The workaround for low-resolution profiles is to collect profiles over a longer time. I don't know of a practical workaround for user programs. (List threads, get the clock for each, change calls to runtime/pprof.StartCPUProfile to create and enable each thread's timer and then dial down the setitimer rate to 1 Hz, then fix up any disagreement in nanoseconds-per-sample in the resulting profile, then deactivate the per-thread timers.) Is the project willing to trade away resolution to get lower bias? If so, how much? updated reproducer showing 9.7:1 skew
|
I have a real-world example of this type of work: the garbage collection done by the runtime. One of the apps I work with runs on a machine with a large number of hyperthreads, but typically does only a small amount of application-specific work. A CPU profile (from The 4% vs 20% figures don't account for edge effects (the trace covers 3 GC runs and 4 long periods of application work), or for cases where the Go runtime believed it had scheduled a goroutine to a thread but the kernel had in fact suspended the thread. But a 5x difference in the two reports is significant, it aligns with the behavior I've described in this issue, and it's in a workload that affects any Go program that 1) uses the GC, 2) has many hyperthreads available for use, 3) is provisioned to use less than 100% CPU. |
It looks like There's more:
If we use The result of this would be profiles of work done on threads created by the Go runtime that are more accurate (not cutting off at 250% CPU usage) and precise (not under-sampling GC work on large, mostly-idle machines by a factor of 5), while keeping the current behavior for non-Go-runtime-created threads. If there's interest in improving profiles for work done on threads that the Go runtime did not create (or are otherwise not in I'm working on code to make this happen, but it's resulting in a lot of new branching between Linux and the rest of the Unix support; there's a lot that was shared in src/runtime/signal_unix.go that's now different. If you're reading this and can think of other caveats or blockers that might keep that contribution out of the tree, or reasons that branching wouldn't be allowed, I'd be interested to hear and discuss them. Thanks! |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.
Am I holding it wrong?
Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?
The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.
What operating system and processor architecture are you using (
go env
)?I'm compiling on darwin/amd64 and running on linux/amd64.
go env
OutputWhat did you do?
I looked at a CPU profile for a Go program running on a Linux machine where
top
reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.I ran a test with the
-test.cpuprofile
flag on a Linux machine and compared the results of thetime
shell built-in withgo tool pprof
's view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.I ran the test under
perf stat -e 'signal:*'
and found that its count ofsignal:signal_generate
events lined up with the number of SIGPROF deliveries I'd expect, that its count ofsignal:signal_deliver
events lined up with the number of samples in the CPU profile, and that the two matched well only when the "generate" rate was less than 250 samples per second.Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime's default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The
time
built-in reports slightly less than 16 minutes (960 seconds) of "user" CPU. That's good.The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That's close to the number of signals that the kernel reports it delivered to the program, but that doesn't match the number generated by the kernel or the actual CPU time spent.
Calling
runtime.SetCPUProfileRate
with "2 Hz" right before the testing package's CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel's measurements of "signal:signal_generate" and "signal:signal_deliver" matching each other, and forgo tool pprof
's measurement of "15.94mins" to come very close to what thetime
built-in sees at "user 15m57.048s".I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.
I've seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)
What did you expect to see?
I expected the number of seconds of CPU time reported by
go tool pprof
to align with the number of seconds of CPU time observed by the kernel.When I run
go tool pprof
, I expect the time reported in the "Duration" line (like "Duration: 5.11s, Total samples = 8.50s (166.40%)") to match what I'd see from looking at a tool liketop
at the same time.What did you see instead?
The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don't know if it drops them fairly—the profiles might be skewed.
Open questions
Is there a simple setting that my coworkers and I are missing? I've reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.
Is the right move for
runtime.SetCPUProfileRate
to limit its input to250 / GOMAXPROCS
?Does the number "250" come from Linux's
CONFIG_HZ_250=y
/CONFIG_HZ=250
, and is it right for that configuration to end up compiled in to Go?Thanks!
Here's the test program:
The text was updated successfully, but these errors were encountered: