Report expected vs. actual samples count to highlight low CPU usage/availability to application #126

Open
ceeaspb opened this Issue May 10, 2016 · 6 comments

Comments

Projects
None yet
2 participants
@ceeaspb

ceeaspb commented May 10, 2016

Low number of samples likely reflects low cpu usage or a short profiling duration, etc.

But by reporting the cpu cost as a percentage we could lose sight of this (low sample number).

This warning is presented in other sampling profilers (ie. typical / covers off a profiling risk).

How many samples is low? how long is a piece of string? TBC

@nitsanw

This comment has been minimized.

Show comment
Hide comment
@nitsanw

nitsanw May 10, 2016

Member

There's 2 problems here:

  1. Missing samples: The timer fires an interrupt but the JVM is not actually on CPU, so is not interrupted. I'm not sure where this interrupts go. We currently have no expectation on the number of samples, but we could start by logging the intended sampling rate and compare to that.
  2. Failed samples: The JVM is running, a thread is interrupted and for any reason sampling fails. I feel this is covered by #97 .
Member

nitsanw commented May 10, 2016

There's 2 problems here:

  1. Missing samples: The timer fires an interrupt but the JVM is not actually on CPU, so is not interrupted. I'm not sure where this interrupts go. We currently have no expectation on the number of samples, but we could start by logging the intended sampling rate and compare to that.
  2. Failed samples: The JVM is running, a thread is interrupted and for any reason sampling fails. I feel this is covered by #97 .
@ceeaspb

This comment has been minimized.

Show comment
Hide comment
@ceeaspb

ceeaspb May 10, 2016

@nitsanw

  1. it looks like SIGPROF will only ever fire when a thread is on cpu. It counts down as user and system cpu time elapses. So if the process has no threads on cpu then the timer will not expire and a signal not sent when off cpu.
    So I don't think we can "miss" a SIGPROF interrupt so to speak (would be good to check). They may just never occur if the whole jvm is off cpu forever.

Other timers:
SIGALRM - counts down real time
SIGVTALRM - counts down user cpu time and not system cpu time. Which makes me think of something else but I think my ticket quota today is exceeded.

  1. agree this ticket isn't about failed samples.

probably needs more thought. the risk is you are led to optimise for cpu when you should be looking at off cpu time (in which case you need a different tool).

ceeaspb commented May 10, 2016

@nitsanw

  1. it looks like SIGPROF will only ever fire when a thread is on cpu. It counts down as user and system cpu time elapses. So if the process has no threads on cpu then the timer will not expire and a signal not sent when off cpu.
    So I don't think we can "miss" a SIGPROF interrupt so to speak (would be good to check). They may just never occur if the whole jvm is off cpu forever.

Other timers:
SIGALRM - counts down real time
SIGVTALRM - counts down user cpu time and not system cpu time. Which makes me think of something else but I think my ticket quota today is exceeded.

  1. agree this ticket isn't about failed samples.

probably needs more thought. the risk is you are led to optimise for cpu when you should be looking at off cpu time (in which case you need a different tool).

@nitsanw

This comment has been minimized.

Show comment
Hide comment
@nitsanw

nitsanw May 10, 2016

Member

From what you say I take that the time will expire, but the SIGPROF will get swallowed. This matches my observations looking at HP logs. But I would need to experiment more to validate.
Discovering the number of actual samples vs. wanted samples is easy by logging the timer thread counter of signals emitted and comparing to number of log entries.

Member

nitsanw commented May 10, 2016

From what you say I take that the time will expire, but the SIGPROF will get swallowed. This matches my observations looking at HP logs. But I would need to experiment more to validate.
Discovering the number of actual samples vs. wanted samples is easy by logging the timer thread counter of signals emitted and comparing to number of log entries.

@nitsanw

This comment has been minimized.

Show comment
Hide comment
@nitsanw

nitsanw May 10, 2016

Member

This implies profiler stop sequence should be terminate timer, drain logger, log number of timer emitted signals.

Member

nitsanw commented May 10, 2016

This implies profiler stop sequence should be terminate timer, drain logger, log number of timer emitted signals.

@nitsanw nitsanw changed the title from low samples warning to Report expected vs. actual samples count to highlights low CPU usage/availability to application May 11, 2016

@nitsanw

This comment has been minimized.

Show comment
Hide comment
@nitsanw

nitsanw May 11, 2016

Member

Rephrasing title to reflect discussion

Member

nitsanw commented May 11, 2016

Rephrasing title to reflect discussion

@nitsanw nitsanw changed the title from Report expected vs. actual samples count to highlights low CPU usage/availability to application to Report expected vs. actual samples count to highlight low CPU usage/availability to application Jun 22, 2016

@nitsanw

This comment has been minimized.

Show comment
Hide comment
@nitsanw

nitsanw Jun 22, 2016

Member

@ceeaspb reading the code more closely it looks like I misunderstood how the notification timer hangs together. In particular we rely on:
http://man7.org/linux/man-pages/man2/setitimer.2.html

ITIMER_VIRTUAL decrements only when the process is executing, and
                  delivers SIGVTALRM upon expiration.
ITIMER_PROF    decrements both when the process executes and when the
                  system is executing on behalf of the process.  Coupled
                  with ITIMER_VIRTUAL, this timer is usually used to
                  profile the time spent by the application in user and
                  kernel space.  SIGPROF is delivered upon expiration.

So if no CPU is used (user or system) no signal will be delivered. This does not mean we cannot follow through on this feature. An estimate of intended sample rate should be sufficient as an indicator.

Member

nitsanw commented Jun 22, 2016

@ceeaspb reading the code more closely it looks like I misunderstood how the notification timer hangs together. In particular we rely on:
http://man7.org/linux/man-pages/man2/setitimer.2.html

ITIMER_VIRTUAL decrements only when the process is executing, and
                  delivers SIGVTALRM upon expiration.
ITIMER_PROF    decrements both when the process executes and when the
                  system is executing on behalf of the process.  Coupled
                  with ITIMER_VIRTUAL, this timer is usually used to
                  profile the time spent by the application in user and
                  kernel space.  SIGPROF is delivered upon expiration.

So if no CPU is used (user or system) no signal will be delivered. This does not mean we cannot follow through on this feature. An estimate of intended sample rate should be sufficient as an indicator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment