runtime/pprof: details of Linux SIGPROF delivery may cause very skewed profiles #14434
I've initially spotted this in gperftools as this affects all users of SIGPROF. The problem is that SIGPROF is delivered to process, which translates to "any thread that isn't blocking SIGROF". Luckily for us, in practice it becomes "thread that is running now". But if there are several running threads something within kernel is making it choose one thread more often than another.
The test program at https://gist.github.com/alk/568c0465f4f208196d8b makes it very easy to reproduce. This program spawns two goroutines that do nothing but burn CPU.
When profiled with perf:
$ perf record ./goprof-test; perf report
I see correct 50/50 division of profiling ticks between two goroutines, since on multicore machine go runtime runs two goroutines on two OS threads which kernel run in parallel on two different cores.
When profiling with runtime/pprof:
$ CPUPROFILE=goprof-test-prof ./goprof-test ; pprof --web ./goprof-test ./goprof-test-prof
I see as much skew as 80/20.
This is exactly same behavior that I've seen with gperftools (and google3's profiler).
For most programs it apparently doesn't matter. But for programs that have distinct pools of threads doing very different work, this may cause real problems. Particularly, I've seen this (with gperftools) to cause very skewed profiles for Couchbase's memcached binary where they have small pool of network worker threads and another pool of IO worker threads.
In gperftools I've implemented workaround which creates per-thread timers that "tick" on corresponding thread's cpu time. But I don't think it's scalable enough to be made default (and another problem but arguably specific for gperftools is that all threads have to call ProfilerRegisterThread again). You can see my implementation at: https://github.com/gperftools/gperftools/blob/master/src/profile-handler.cc (parts that are under HAVE_LINUX_SIGEV_THREAD_ID defined)
I've seen this behavior on FreeBSD VMs too, but don't know about other OSes.
Maybe there is better way to avoid this skew or maybe we should just ask kernel folks to change SIGPROF signal delivery to avoid this skew. In any case this is bug worth tracking.
This is somewhat related, but distinct issue from #13841
The text was updated successfully, but these errors were encountered:
When I run this program on OS X I get exactly 50/50, which is nice. When I run it on Linux I do get much more skewed results, as you say.
The runtime is actually written as though setitimer were per-thread. I am not sure why it works as well as it does given that setitimer appears to be actually per-process. In any event if there is a new per-thread timer system call to use on Linux, it seems like that would be easy to slide in. Probably not for Go 1.8.
@aclements, you had figured out some other reason pprof profiles might be very skewed, right? I thought you filed an issue but I can't find it.
I don't recall anything outside of what's already in #13841.
According to the man pages, timer_create with CLOCK_THREAD_CPUTIME_ID has been around since Linux 2.6.12. It's even kind of sort of an optional part of POSIX.
Since this issue is pretty old now, I decided to retest this against newer versions of Go. I tested go 1.13 to 1.16 on my MBP running docker for mac. What I found is that:
The detailed results can be seen here: https://gist.github.com/felixge/9858a2412b61853343263a19968b54c0
Take these results with a grain of salt given the environment they were produced in, but I think it's fair to say that this is still an issue. If I had to guess why 1.14+ is doing better, I'd say async preemption.
I don't know if my skills will be sufficient, but I might be able to make some time for the next release cycle to take a closer look to see if I'm able to help fix this issue.