Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

Closed
freesoft opened this issue Jan 10, 2014 · 7 comments

Comments

@freesoft
Copy link

Hello,

I've been using Servo-core for the project, and there is some need to reset some of BasicCounter when it's required, and this need to be done outside of program itself, which means through mbean operation or something.

Let's say there is external monitoring system and it pulls JMX variable and use it for system alerting, but monitoring system want to reset those BasicCounter to zero if alerts are gone or for other reasons.

Is it possible in current version of servo-core library?

@dmuino
Copy link
Contributor

dmuino commented Jan 10, 2014

Servo does not support mbean operations. BasicCounter does not support resetting the value. One way to get that functionality for the Counter would be to use annotations around an AtomicLong/
AtomicInteger and then expose some mechanism that would reset it.

@Monitor(name="TotalErrors", type=COUNTER)
 private AtomicInteger totalErrors = new AtomicInteger(0);

public void resetErrors() {
    totalErrors.set(0);
}

@brharrington
Copy link
Contributor

It should also be pointed out that it isn't generally a good idea to reset a BasicCounter. If you have multiple observers polling the value then they each need to have independent state to get accurate values. The CounterToRateTransform can be wrapped around a particular observer to keep the state and convert the cumulative total to a rate per second which is typically more useful. If you reset the value, then you are corrupting the state for other observers.

@freesoft
Copy link
Author

@dmuino Thank you for your suggestion, but I don't think that's something external system can do it from outside of running JVM. Btw, the value in your example need to be AtomicLong.

@brharrington That's really depends on how LiveOps/DevOps/GNOC guys ( or someone in charge of live operation ) are want to use, and still I think resetting values through mbean operation is handy and useful.

@brharrington
Copy link
Contributor

Can you elaborate on the overall workflow you are wanting?

Internally we typically have a setup that supports multiple observers:

  • Send data to internal time series database.
    • Small subset with higher resolution, every 10s
    • Most data every 1m to main stack
  • Send some data to CloudWatch to support auto-scaling
  • Optionally log values to local file for debugging purposes
  • Local observer on the instance that checks conditions and can trigger an alert
  • Viewing data via jmx

So we typically have 5+ different observers configured to receive all or parts of the data and we need each of them to get a consistent view. Resetting a BasicCounter means that at least some of these will get a bad rate value for the polling interval where the reset occurred. Servo does support ResettableMonitor types, these would be configured such that there is a primary poller that would be responsible for resetting the value and then we would typically have the observer that receives the data tee it to all the others that need samples at that interval.

@freesoft
Copy link
Author

Let's assume

  • You already have your own monitoring/alert system written in C/C++ or Python. The system is invented to cover different types of OS and languages. The system has its own logging protocol other than JMX.
  • You have small agent running on each server to collect metrics and send the metrics to the centralized monitoring server. So yes, it has observer, but neither Servo version nor Java. It is common case for many companies or developers who were working on different projects with different platform/languages for several years.
  • Now, let's say you have new Java application with Servo library. You need to add JMX attributes gathering feature in your agent to send metrics to the monitoring server, which is non-Java system.
  • Monitoring server triggers alerts based on different measure, some are rates, others are some specific number/amount. Let's say one condition of alerts will triggered based on current success/fail ratio ( like "alert when FAIL CNT / SUCCESS CNT > 0.1" or something ). Once fail rate is over 10%, it will keep alert every 5 minutes or any given time frame until someone fix the issue and SUCCESS CNT is increased enough to make fail rate < 10%. => Monitoring system will keep alerting until the system has enough success count EVEN AFTER PROBLEM HAS SOLVED.

Solution without resetting JMX attributes through mbean operation would be

  1. Restart service server to stop alert.
  2. Change every alert measure to use rate per given time frame instead of using counter.
    but those are sounds odd to me.

I understand your concern about inconsistency when counter has reset, but still those feature can be useful depends on the system or situation.

@freesoft
Copy link
Author

Can I get any updates if you guys are thinking about this feature or not? Or maybe you guys will accept code changes if I commit? If you think it's unnecessary, I'm going to find workaround for my case rather than waiting response.
Thank you!

@brharrington
Copy link
Contributor

In response to your first three bullets:

  • Servo can send to other systems that are not jvm based. For example there is an observer implementation that forwards to graphite which is python. You could also write one that sends to your local agent on the machine.
  • For servo, JMX is just a view of the data. Servo data can be captured by plugging in an observer that then communicates with whatever your backend is.

On the last bullet, I disagree. The goal of servo is to provide a way to indicate monitors and collect the data via observers. It should be able to tell you what happened during the last polling interval (provided you have wrapped the observer in CounterToRateTransform for the case of monotonic counters). This is critical because it means that the signal you are getting to the monitoring system will also tell you when the problem actually goes way in terms of what is measured, not just when someone clicks reset and says it is fixed. In our case the monitoring system supports defining alerts and we can visually depict this information so we'll see something like:

failure_example

We'll then resolve the alert after confirming that the state is back to normal. In short I don't think we will accept this change because:

  1. It doesn't seem necessary. I don't see why you couldn't write an observer that bridges your internal system with the data coming in from servo. Look at graphite as an example of talking to a non-java system.
  2. Resetting the state of a basic counter breaks the model and has undesirable pitfalls. If you really need this it should follow the gauge contract and you can wrap a gauge around any Number implemenation like AtomicLong that would give you full control if you needed it. Note "counter" is a bit overloaded, we use the RRD notion where it is a monotonically increasing value used to generate a rate per second.
  3. As described above, I think the current servo approach is better in that it gives the downstream monitoring/collection system an input signal that can tell you when the actual measurement shows the issue was resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants