-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/pprof: Linux CPU profiles inaccurate beyond 250% CPU use #35057
Comments
@rsc @aclements @dvyukov If you are on linux, you can try linux perf. The pprof tool understands perf input as well. |
Can you get some more information from the signal_generate tracepoint? Specifically, the thread ID to see if #14434 is in play, and the result to see how often it's TRACE_SIGNAL_ALREADY_PENDING. My (ill-informed) guess is that the delivery is hugely biased toward one thread, and a lot of the signals are getting dropped as a result of the overload. |
I was able to get a lot of info from I don't see significant bias towards particular threads. I see a few dozen signal_generate events being generated in a very short time window (tens of microseconds). This repeats every 4ms (at 250 Hz). The first in the burst has res=0, and subsequent ones have res=2 until a signal_deliver event shows up. The signal_generate event following that will once again have res=0. It looks like the generation of SIGPROF events is racing against their delivery. The test ran for 10 seconds on 96 vCPUs, with 10108 samples in the resulting CPU profile.
Looking at a small sample of threads, each got about 100 signal_generate events with res=0 and about 850 with res=2.
The 95782 signal_generate events were split across 97 threads, with between 846 and 1122 events for each thread (only modest skew).
The signal_generate events with res=0 are also split across 97 threads with between 68 and 136 events per thread. There are 10108 of them, an exact match for the count of samples in the generated CPU profile.
Here's the distribution of the inter-arrival times of signal_generate events for a particular thread:
And the inter-arrival times of signal_generate events with res=0:
The signal_generate events come in bursts. The median time between them is 2µs. The bursts are 4ms apart.
The bursts of signal_generate events from the kernel seem to be racing against the application's signal handler. After a signal_deliver event, the next signal_generate will have res=0 but all others will have res=2.
After a signal_generate event with res=0, the median delay before the next signal_deliver event is 10µs:
|
It may be time for me to let more knowledgeable people handle this, but I'm confused by the raw trace text. It looks like the process is only allowed to have one outstanding SIGPROF at a time across all threads? Can that be right? |
That's what it looks like to me too. Maybe there's a way to have that limit apply to each thread rather than the process as a whole? CC @ianlancetaylor for expertise on signals. |
From http://man7.org/linux/man-pages/man7/signal.7.html I read
I don't know how SIGPROF based profiling is supposed to work reliably with multi-core, multi-threaded systems. If this is true and I don't misread this, it sounds to me like we need to adjust our sampling frequency based on the available cores and the number of threads. |
cc @aalexand and @rauls5382 for advice on how sigprof handling works in c++ and other languages. |
In #14434, @aclements pointed out that applications can use the Put another way, does each thread have its own bitmask / size-one-queue of pending signals? Here are some options I see that build on each other, in increasing risk/complexity:
Is that the right path forward? |
This is a known and unfortunate shortcoming of SIGPROF. See https://elinux.org/Kernel_Timer_Systems. Also http://hpctoolkit.org/man/hpcrun.html#section_15: "On Linux systems, the kernel will not deliver itimer interrupts faster than the unit of a jiffy, which defaults to 4 milliseconds; see the itimer man page." It's a bit strange that this extends to per-core timers, but apparently it's the case. There is Google-internal b/69814231 that tracks some investigations around that, I think the options are:
|
Thanks for the context, @aalexand . I tried out the three options I proposed yesterday:
|
|
Change https://golang.org/cl/204279 mentions this issue: |
Work that comes in bursts—causing the process to spend more than 10ms of CPU time in one kernel tick—is systematically under-sampled. I've updated the reproducer to do work in a single goroutine, and then to do the same amount of work across GOMAXPROCS goroutines. In the resulting profile with go1.13.3 on a 96-vCPU machine, samples from when the process used only one goroutine make up 90.5% of the profile. The work done in parallel appears in only 9.3% of the samples. I'd expect both to be close to 50% / 1:1 rather than that 9.7:1 skew ratio. It seems like as long as profiling on Linux uses setitimer, the sample rate has to balance low resolution against that bias. The workaround for low-resolution profiles is to collect profiles over a longer time. I don't know of a practical workaround for user programs. (List threads, get the clock for each, change calls to runtime/pprof.StartCPUProfile to create and enable each thread's timer and then dial down the setitimer rate to 1 Hz, then fix up any disagreement in nanoseconds-per-sample in the resulting profile, then deactivate the per-thread timers.) Is the project willing to trade away resolution to get lower bias? If so, how much? updated reproducer showing 9.7:1 skew
|
I have a real-world example of this type of work: the garbage collection done by the runtime. One of the apps I work with runs on a machine with a large number of hyperthreads, but typically does only a small amount of application-specific work. A CPU profile (from The 4% vs 20% figures don't account for edge effects (the trace covers 3 GC runs and 4 long periods of application work), or for cases where the Go runtime believed it had scheduled a goroutine to a thread but the kernel had in fact suspended the thread. But a 5x difference in the two reports is significant, it aligns with the behavior I've described in this issue, and it's in a workload that affects any Go program that 1) uses the GC, 2) has many hyperthreads available for use, 3) is provisioned to use less than 100% CPU. |
It looks like There's more:
If we use The result of this would be profiles of work done on threads created by the Go runtime that are more accurate (not cutting off at 250% CPU usage) and precise (not under-sampling GC work on large, mostly-idle machines by a factor of 5), while keeping the current behavior for non-Go-runtime-created threads. If there's interest in improving profiles for work done on threads that the Go runtime did not create (or are otherwise not in I'm working on code to make this happen, but it's resulting in a lot of new branching between Linux and the rest of the Unix support; there's a lot that was shared in src/runtime/signal_unix.go that's now different. If you're reading this and can think of other caveats or blockers that might keep that contribution out of the tree, or reasons that branching wouldn't be allowed, I'd be interested to hear and discuss them. Thanks! |
Change https://golang.org/cl/324129 mentions this issue: |
I did the work to split Linux profiling out from the generic Unix profiling code, and added the syscalls to all the architectures that Go+Linux supports, so CL324129 PS3 now (1) compiles and (2) works well enough to run my reproducer test from the top of this issue. Setting Here are some results I observed on a 64-core arm64 / Aarch64 machine (with 64 real cores and no hyperthreading / SMT). Recall that "cpuHog1" would be better named "work done serially on one thread", that "cpuHog2" is "work done across all available threads in parallel", and that those functions each do the same amount of total work. The CL needs some refinement (what to do when a new M starts during profiling, etc), but I think this shows that (1) the skew in CC @felixge First with PS3 but GODEBUG=pproftimercreate=0 to get old setitimer profiling:
Then with the same executable with the new timer_create profiler via GODEBUG=pproftimercreate=1:
|
@rhysh this is awesome 👏. As mentioned on twitter, I'd be happy to help with this. So on my end I put together a small standalone C project called proftest that makes it easy to compare signal delivery bias as well as max signal delivery frequencies between @rhysh I know your patch still says "do not review", but if you'd like I'd be happy to do some testing and reviewing off it on my end. Additionally I could try to integrate your test case from above into the patch as well. Unless there is a reason you haven't done so yourself yet? |
Thanks @felixge — the results from proftest are very encouraging! I think it may be easiest to collaborate on review, on testing, and on problem-solving. My impression of Gerrit is that it doesn't do much to enable work from multiple people on the same patch—do you have ideas on how to collaborate on the code itself without getting bogged down in mechanics? I have a bit of test code in CL 204279 to compare the magnitude of CPU time reported by the app (pprof) vs the OS (getrusage). I should update the timer_create CL to include that test, and a test of the relative magnitude of serial/parallel execution like the one at the top of this issue. No good reason I haven't done it yet, will do. CL 324129 isn't ready for detailed code review, but I'd appreciate your review of it for "ideas" and "structure", and to get problem-solving advice on some parts of it:
And maybe also:
|
We could just collaborate in a GitHub PR in your fork until we're ready to submit upstream. This way you could even give me access to push commits into the branch (of course we'd discuss things before).
SGTM. This is also something I could try to do on my end if it makes your life easier.
I'm not an runtime internals expert, but I'll try to add some thoughts.
LGTM, but not my area of expertise.
My understanding is that non-Go functions are executing on their own stack, and when a signal arrives the various SetCgoTraceback callbacks get invoked to do the unwinding and symbolizations of those stacks. cgocall.go also has some great comments outlining the details.
Not sure about the right away, but you hinted at using
Tricky. My naive intuition would be to not attempt to disable the timers pro-actively, and instead check the process-wide profiling state every-time a thread receives a signal, and disable timers inside of the signal handler as needed?
I think so. But ideally setitimer() would only be used to discover threads and then configure new time_create() instances for them to make sure they are sampled with the same accuracy as Go threads.
Yeah, adding some jitter to the timers is probably a good idea to avoid signals being generated while a signal handler is active which would cause them to be dropped like we're seeing with setitimer. That being said, I can try to do some testing on this in proftest to investigate this further. I'm actually quite surprised that I haven't seen timer_create running into similar race conditions in my testing so far, but I'll have to simulate a "slow" signal handler to get a better picture. From my analysis the current signal handler code usually takes ~1usec to run, so there is a chance that the "natural" jitter of the timers not being created at precisely the same moment is already "good enough" to steer clear of this issue. Next steps for me:
Let me know what else I can do to help. Potential ideas:
Thanks for all the work you've already put into this. I'm looking forward to collaborating! |
As promised, I updated proftest so it can simulate signal handlers that spend non-trivial amounts of time (via Below is an example of a signal handler that spends an average of 221 usec inside of a busy loop without dropping signals. Since generated signals don't queue, I suspect Maybe my analysis is wrong, but if it isn't, I think we don't need to worry about introducing artificial jitter when setting up
Edit: My initial results showed the signal handler time 10x longer than it really is. So as a sanity check, I've also instrumented it via bpftrace now and corrected the data above. The conclusion is still the same. |
Thanks @felixge , this is all very helpful. Your diagram of the signal handling flow control in PS4 clarifies how cgo programs are (or may be) different. You're right to call out "Write test cases for cgo code that spawns its own threads" and "Write CPU profiler test cases for cgo in case the runtime doesn't have good coverage already." Getting cgo "right" (good results, easy to understand and explain the behavior) is a big remaining hurdle. I've done some work (below) to show the current behavior; I'd especially appreciate any help you can offer on contributing test cases upstream, which doesn't seem to have any yet for cgo profiling. Thank you! I put together a rough demo (the code and results are in the expandable "Details" sections at the bottom) to compare how the profiler measures busy loops in Go and in cgo. The test matrix I have so far compares 1/ CPU spent in Go code vs C code, 2/ in threads created by the Go runtime vs in C by pthreads, 3/ when profiling starts before vs during the work, 4/ when there's 1 thread of work vs "many" (4 threads, on my 4-core/8-hyperthread test machine), 5/ in a stable release (go1.16.3 was most convenient for me) vs with PS4, and 6/ with the new code paths in PS4 active vs inactive (via Usually I see 1 thread of work measured as "100%". The exception to that is when the work happens in C code on a thread that the Go runtime created, and the profiler started after the call into C. I'm not especially worried about this. When there are 4 threads of work and the profiler is using setitimer (either go1.16.3 or When there are 4 threads of work and the profiler is using timer_create ( Looking at the (assembly) code for It also looks like I haven't created a case yet where a thread's work would be double-counted. Getting signals from two sources means the thread has called timer_create, which means it has an M. But hitting any code path aside from the normal Code ./cgo.go
./cgo.c
Results
|
For background: a complete implementation of I'm not sure quite what it means for a thread to have an M but not a G, since in general threads get their M by looking at |
Thanks @ianlancetaylor . What I mean for a thread to "have" an M but not be able to find its G is 1/ there is a On linux-ppc64le, the pointer to the g struct is stored in R30. If a Go program on that platform uses It looks like when SIGPROF arrives on that thread, we'd get a call to Is that common? If a program does that, does it break in many ways or would only the profiler be affected? |
I guess I'm not sure what you mean by suggesting that a C/assembly function might break the TLSBSS mechanism. Why do we care about that case? Such a program would presumably break any C signal handler that uses thread-local variables. That said, the As a separate matter, with the current runtime code, when a thread created by C calls into Go, an M is assigned to that thread. That M will have its |
I'll try to create some this week and submit them separately. Will drop the review links here.
Yeah. I should have clarified that my diagram was for linux/amd64.
Fantastic, I'll take a closer look at this as well. Sorry for the late follow-up. |
Edit: On second thought, we might not need I'm starting to work on a patch to add cgo cpu profiling test cases now. It seems that in order to make statistical assertions on the amount of time spend in different cgo threads, we'll need a I'm thinking that vendoring @ianlancetaylor's cgosymbolizer for the test suite might be the easiest option. @ianlancetaylor does this seem reasonable? If taking on this dependency for the test suite is not an option we'll need something else. AFAIK So for now I'll try to @ianlancetaylor's symbolizer and see if I can turn @rhysh test cases from above into something that fits into the current test suite. @rhysh what do you think? |
On second thought, we might be able to get away without a |
@rhysh I created a CL for the test cases we discussed. PTAL https://go-review.googlesource.com/c/go/+/334769 |
Change https://golang.org/cl/342052 mentions this issue: |
Change https://golang.org/cl/342053 mentions this issue: |
Change https://golang.org/cl/342054 mentions this issue: |
Change https://golang.org/cl/342055 mentions this issue: |
@rhysh thanks for all the detailed analysis on this. I think there is quite a compelling argument, and that we probably want this. I plan to review your CLs this week (sorry for the delay! Just got back from vacation.). A few notes/thoughts I have:
|
For what it's worth, I took a look at the implementation for these, and I am not too concerned. For expiration and delivery, handling is nearly identical to Creation of a timer is a bit more involved, but it is primarily just an allocation and installation on the proper timer list. Of course, we should still measure. |
Thanks for the reviews, @prattmic , and for checking on the implementation in the kernel. I wrote a benchmark for creating, setting, and deleting a timer and found that those steps take about 1–2µs total on my linux/amd64 test machine with 8 hyperthreads / 4 physical cores and go1.17.1 and Linux in the v5.4 series. That should cover the cost of enabling the timers, and (for the few moments that each timer exists in the benchmark code) some of the cost of managing them. To measure the performance cost of the timers while they're active, I ran a benchmark from a core package (BenchmarkCodeEncoder from encoding/json, because of its use of b.RunParallel) while collecting (and discarding) a CPU profile. I compared the results with my most recent changes (CL 342055 PS 5, The results claim that the new timer_create profiler leads to slightly faster benchmark performance when GOMAXPROCS=1 (weird!), and slightly slower performance when GOMAXPROCS=8 (maybe due to more signals being successfully delivered?), and equivalent performance with GOMAXPROCS of 2 or 4. It also claims more memory allocated when GOMAXPROCS is 4 or 8, even though there are 0 allocs per operation. It looks like that's describing the memory used for the CPU profile, since alloc/op for that benchmark is about 1 byte when CPU profiling is disabled.
Here are the results and code for the create/settime/delete benchmark.
Benchmark code
Benchmark results
|
Thanks @rhysh for the detailed benchmarks. They all look good to me. Just one question, for CodeEncoder-4 and CodeEncoder-8, I'm assuming that before your CL those benchmarks were hitting the 250% limit in profiles, and after they show accurate usage? If so, that would explain the addition memory usage, as they are (correctly) receiving more profiling signals. |
That's right, before my CL the -cpu=4,8 tests hit the 250% limit. Before:
After:
|
Change https://golang.org/cl/351790 mentions this issue: |
Updates #35057 Change-Id: Id702b502fa4e4005ba1e450a945bc4420a8a8b8c Reviewed-on: https://go-review.googlesource.com/c/go/+/342052 Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Than McIntosh <thanm@google.com>
Updates #35057 Change-Id: I56ea8f4750022847f0866c85e237a2cea40e0ff7 Reviewed-on: https://go-review.googlesource.com/c/go/+/342053 Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com>
Using setitimer on Linux to request SIGPROF signal deliveries in proportion to the process's on-CPU time results in under-reporting when the program uses several goroutines in parallel. Linux calculates the process's total CPU spend on a regular basis (often every 4ms); if the process has spent enough CPU time since the last calculation to warrant more than one SIGPROF (usually 10ms for the default sample rate of 100 Hz), the kernel is often able to deliver only one of them. With these common settings, that results in Go CPU profiles being attenuated for programs that use more than 2.5 goroutines in parallel. To avoid in effect overflowing the kernel's process-wide CPU counter, and relying on Linux's typical behavior of having the active thread handle the resulting process-targeted signal, use timer_create to request a timer for each OS thread that the Go runtime manages. Have each timer track the CPU time of a single thread, with the resulting SIGPROF going directly to that thread. To continue tracking CPU time spent on threads that don't interact with the Go runtime (such as those created and used in cgo), keep using setitimer in addition to the new mechanism. When a SIGPROF signal arrives, check whether it's due to setitimer or timer_create and filter as appropriate: If the thread is known to Go (has an M) and has a timer_create timer, ignore SIGPROF signals from setitimer. If the thread is not known to Go (does not have an M), ignore SIGPROF signals that are not from setitimer. Counteract the new bias that per-thread profiling adds against short-lived threads (or those that are only active on occasion for a short time, such as garbage collection workers on mostly-idle systems) by configuring the timers' initial trigger to be from a uniform random distribution between "immediate trigger" and the full requested sample period. Updates #35057 Change-Id: Iab753c4e5101bdc09ef9132eec84a75478e05579 Reviewed-on: https://go-review.googlesource.com/c/go/+/324129 Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Go Bot <gobot@golang.org> Trust: David Chase <drchase@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
The sigprofNonGo and sigprofNonGoPC functions are only used on unix-like platforms. In preparation for unix-specific changes to sigprofNonGo, move it (plus its close relative) to a unix-specific file. Updates #35057 Change-Id: I9c814127c58612ea9a9fbd28a992b04ace5c604d Reviewed-on: https://go-review.googlesource.com/c/go/+/351790 Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: David Chase <drchase@google.com>
Updates #35057 Change-Id: I61d772a2cbfb27540fb70c14676c68593076ca94 Reviewed-on: https://go-review.googlesource.com/c/go/+/342054 Run-TryBot: Rhys Hiltner <rhys@justin.tv> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Trust: Michael Knyszek <mknyszek@google.com>
@rhysh the new TestCPUProfileMultithreadMagnitude seems to be immediately flaky on linux-amd64-longtest builders (where the 10% max difference is enforced): 2021-09-27T18:58:36-5b90958/linux-amd64-longtest I haven't yet been able to reproduce these failures. |
Actually, the new timers aren't in use on amd64 until http://golang.org/cl/342054, so it makes sense that the test was failing (i.e., it was added a bit too early). Since the CL is in, we don't expect any more failures. While we're at it, I believe this issue is now complete? |
Yes, with CL 324129 (to fix on Linux for most architectures) and CL 342054 (to enable the fix on the remaining architectures), this is now complete. Thank you (and @felixge , and everyone both up-thread and elsewhere) for your help to get it done! You're right @prattmic , I put the test too early in the CL stack. Is it worth a follow-up to move the go118UseTimerCreateProfiler flag to gate the test too? I'm happy to send one, either now or once the Go team is back from break. But the harm for bisects is already done, and if we have to turn the feature off it'll be obvious enough that the test needs to come out too. |
I don't think we need to bother. It will be pretty obvious in tests if the flag is disabled. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, this is present in Go 1.13 and in current tip with Linux 4.14. It seems to exist even back in Go 1.4, and with Linux 3.2 and 2.6.32.
Am I holding it wrong?
Around Go 1.6, a lot of the tests in runtime/pprof were called out as being flaky. It looks like around that same time, the builders got an overhaul. Maybe they moved to machines with more CPU cores than before, and the increase in flakiness was due to some SIGPROF deliveries being skipped?
The tests in runtime/pprof both now and around Go 1.6 seem to compare parts of the profile to itself, but not to the CPU usage reported by the operating system. If this is a real bug, those tests would not have discovered it.
What operating system and processor architecture are you using (
go env
)?I'm compiling on darwin/amd64 and running on linux/amd64.
go env
OutputWhat did you do?
I looked at a CPU profile for a Go program running on a Linux machine where
top
reported 20 cores of CPU usage (2000%), but the profile showed about 240% / 2.4 cores of usage.I ran a test with the
-test.cpuprofile
flag on a Linux machine and compared the results of thetime
shell built-in withgo tool pprof
's view of the usage. I varied the rate that the program asked the kernel to deliver SIGPROF and found that the two measurements agreed on the number of CPU cycles spent as long as there were fewer than 250 SIGPROF deliveries per second.I ran the test under
perf stat -e 'signal:*'
and found that its count ofsignal:signal_generate
events lined up with the number of SIGPROF deliveries I'd expect, that its count ofsignal:signal_deliver
events lined up with the number of samples in the CPU profile, and that the two matched well only when the "generate" rate was less than 250 samples per second.Here, the test uses 96 vCPUs of a machine with 96 hyperthreads for 10 seconds, using the Go runtime's default profile rate of 100 Hz. The Linux kernel generates slightly less than 96,000 signals (which are probably all SIGPROF). The
time
built-in reports slightly less than 16 minutes (960 seconds) of "user" CPU. That's good.The resulting profile shows 10.20 seconds of wall-clock time and 1.61 minutes (about 96.6 seconds) of CPU time, or about 9660 samples at 100 Hz. That's close to the number of signals that the kernel reports it delivered to the program, but that doesn't match the number generated by the kernel or the actual CPU time spent.
Calling
runtime.SetCPUProfileRate
with "2 Hz" right before the testing package's CPU profile starts lets me dial the profile rate down to less than 250 Hz process-wide. (The warning message seems harmless in this case.) This leads to the kernel's measurements of "signal:signal_generate" and "signal:signal_deliver" matching each other, and forgo tool pprof
's measurement of "15.94mins" to come very close to what thetime
built-in sees at "user 15m57.048s".I confirmed that the kernel was configured with high-resolution timers as recommended in #13841.
I've seen this effect both on virtual machines and on physical hardware. (Most of my follow-up testing has taken place on virtual machines.)
What did you expect to see?
I expected the number of seconds of CPU time reported by
go tool pprof
to align with the number of seconds of CPU time observed by the kernel.When I run
go tool pprof
, I expect the time reported in the "Duration" line (like "Duration: 5.11s, Total samples = 8.50s (166.40%)") to match what I'd see from looking at a tool liketop
at the same time.What did you see instead?
The Linux kernel seems to drop SIGPROF events when they come more than 250 times per second. I don't know if it drops them fairly—the profiles might be skewed.
Open questions
Is there a simple setting that my coworkers and I are missing? I've reproduced this with vanilla machine images for Ubuntu and Amazon Linux 2.
Is the right move for
runtime.SetCPUProfileRate
to limit its input to250 / GOMAXPROCS
?Does the number "250" come from Linux's
CONFIG_HZ_250=y
/CONFIG_HZ=250
, and is it right for that configuration to end up compiled in to Go?Thanks!
Here's the test program:
The text was updated successfully, but these errors were encountered: