Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/pprof: CPU profiles incorrect for kernels with broken setitimer support #13841

Open
rsc opened this issue Jan 6, 2016 · 5 comments

Comments

Projects
None yet
5 participants
@rsc
Copy link
Contributor

commented Jan 6, 2016

As of Go 1.6, pprof's CPU profiles are known to be incorrect on a few systems due to what are arguably kernel bugs. This issue documents those systems.

The text below distinguishes a profile being incomplete (missing profile samples for code that was running) from being incorrect (containing samples for code that wasn't running).

  • DragonflyBSD (all known versions): Delivers a profiling signal only when a thread runs continuously for an entire clock tick (20ms). In workloads with high context-switch or garbage collection rates, this may cause profiles to be incomplete.
  • Linux (without CONFIG_HIGH_RES_TIMERS=y): Delivers a profiling signal only when a thread runs continuously for an entire clock tick (often 10ms). In workloads with high context-switch or garbage collection rates, this may cause profiles to be incomplete. Most Linux kernels in use today do enable high-resolution timers and therefore do not suffer from this problem.
  • NetBSD (all known versions): Delivers signals to the wrong thread. On such systems, profiles are commonly very incorrect.
  • OpenBSD (all known versions): Delivers a profiling signal only when a thread runs continuously for an entire clock tick (20ms). In workloads with high context-switch or garbage collection rates, this may cause profiles to be incomplete.
  • OS X (fixed in OS X 10.11 El Capitan): Deliver signals to the wrong thread. On such systems, profiles are commonly very incorrect. See rsc.io/pprof_mac_fix for a workaround on those early systems.
  • Solaris (fixed in Solaris 8): Delivers a profiling signal only when a thread runs continuously for an entire clock tick (10ms). In workloads with high context-switch or garbage collection rates, this may cause profiles to be incomplete. Solaris 8 fixes the problem on systems with APIC hardware (most x86 systems). On systems that continue to exhibit the problem, adding set hires_tick = 1 to /etc/system can mitigate this problem somewhat by reducing the clock tick to 1ms.

Please comment on this issue only if the text above is incomplete or incorrect; we will keep this top-level comment up to date.

@bradfitz

This comment has been minimized.

@rsc

This comment has been minimized.

Copy link
Contributor Author

commented Jan 6, 2016

Added Dragonfly, thanks.

@aclements

This comment has been minimized.

Copy link
Member

commented Jan 8, 2016

FreeBSD has the same problem as OpenBSD and DragonflyBSD, but by default runs at 1000Hz, so the problem is less noticeable.

@mdempsky

This comment has been minimized.

Copy link
Member

commented Jan 22, 2016

I don't believe the description of OpenBSD is entirely accurate. In particular, OpenBSD's "hard clock" interrupt runs at 100Hz (i.e., 10ms clock tick intervals) on all CPUs supported by Go. Also, I don't see any evidence that OpenBSD's kernel cares about whether the process was running for the full time slice; only that it was running when the interrupt fired. The process may only have actually run for the last 1ms of the clock slice, but it will still have its timer credited for the full 10ms.

Lastly, I'm not sure it's relevant to the runtime/pprof tests, but for completeness: OpenBSD sends SIGPROF to the process, not the thread. It's just that it favors sending to the running thread when possible. If the interrupted thread has blocked SIGPROF and the process has other threads that are not blocking SIGPROF, the kernel will send SIGPROF to one of those instead.

gopherbot pushed a commit that referenced this issue Feb 2, 2016

runtime/pprof: mark dragonfly and solaris as bad at pprof
Updates #13841

Change-Id: I121bce054e2756c820c76444e51357f474b7f3d6
Reviewed-on: https://go-review.googlesource.com/19161
Reviewed-by: Russ Cox <rsc@golang.org>
@tdfbsd

This comment has been minimized.

Copy link

commented Jul 8, 2016

From Matt Dillon on Dragonfly: That one really isn't a bug. DragonFly will profile at any point (not just on a full tick), but we use a low resolution profiling timer so the collected statistics will not be very good for short tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.