Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

freesoft · 2014-01-10T19:16:28Z

Hello,

I've been using Servo-core for the project, and there is some need to reset some of BasicCounter when it's required, and this need to be done outside of program itself, which means through mbean operation or something.

Let's say there is external monitoring system and it pulls JMX variable and use it for system alerting, but monitoring system want to reset those BasicCounter to zero if alerts are gone or for other reasons.

Is it possible in current version of servo-core library?

dmuino · 2014-01-10T21:31:08Z

Servo does not support mbean operations. BasicCounter does not support resetting the value. One way to get that functionality for the Counter would be to use annotations around an AtomicLong/
AtomicInteger and then expose some mechanism that would reset it.

@Monitor(name="TotalErrors", type=COUNTER)
 private AtomicInteger totalErrors = new AtomicInteger(0);

public void resetErrors() {
    totalErrors.set(0);
}

brharrington · 2014-01-10T21:52:28Z

It should also be pointed out that it isn't generally a good idea to reset a BasicCounter. If you have multiple observers polling the value then they each need to have independent state to get accurate values. The CounterToRateTransform can be wrapped around a particular observer to keep the state and convert the cumulative total to a rate per second which is typically more useful. If you reset the value, then you are corrupting the state for other observers.

freesoft · 2014-01-11T08:23:31Z

@dmuino Thank you for your suggestion, but I don't think that's something external system can do it from outside of running JVM. Btw, the value in your example need to be AtomicLong.

@brharrington That's really depends on how LiveOps/DevOps/GNOC guys ( or someone in charge of live operation ) are want to use, and still I think resetting values through mbean operation is handy and useful.

brharrington · 2014-01-11T15:21:33Z

Can you elaborate on the overall workflow you are wanting?

Internally we typically have a setup that supports multiple observers:

Send data to internal time series database.
- Small subset with higher resolution, every 10s
- Most data every 1m to main stack
Send some data to CloudWatch to support auto-scaling
Optionally log values to local file for debugging purposes
Local observer on the instance that checks conditions and can trigger an alert
Viewing data via jmx

So we typically have 5+ different observers configured to receive all or parts of the data and we need each of them to get a consistent view. Resetting a BasicCounter means that at least some of these will get a bad rate value for the polling interval where the reset occurred. Servo does support ResettableMonitor types, these would be configured such that there is a primary poller that would be responsible for resetting the value and then we would typically have the observer that receives the data tee it to all the others that need samples at that interval.

freesoft · 2014-01-14T06:38:01Z

Let's assume

You already have your own monitoring/alert system written in C/C++ or Python. The system is invented to cover different types of OS and languages. The system has its own logging protocol other than JMX.
You have small agent running on each server to collect metrics and send the metrics to the centralized monitoring server. So yes, it has observer, but neither Servo version nor Java. It is common case for many companies or developers who were working on different projects with different platform/languages for several years.
Now, let's say you have new Java application with Servo library. You need to add JMX attributes gathering feature in your agent to send metrics to the monitoring server, which is non-Java system.
Monitoring server triggers alerts based on different measure, some are rates, others are some specific number/amount. Let's say one condition of alerts will triggered based on current success/fail ratio ( like "alert when FAIL CNT / SUCCESS CNT > 0.1" or something ). Once fail rate is over 10%, it will keep alert every 5 minutes or any given time frame until someone fix the issue and SUCCESS CNT is increased enough to make fail rate < 10%. => Monitoring system will keep alerting until the system has enough success count EVEN AFTER PROBLEM HAS SOLVED.

Solution without resetting JMX attributes through mbean operation would be

Restart service server to stop alert.
Change every alert measure to use rate per given time frame instead of using counter.
but those are sounds odd to me.

I understand your concern about inconsistency when counter has reset, but still those feature can be useful depends on the system or situation.

freesoft · 2014-01-15T23:48:13Z

Can I get any updates if you guys are thinking about this feature or not? Or maybe you guys will accept code changes if I commit? If you think it's unnecessary, I'm going to find workaround for my case rather than waiting response.
Thank you!

brharrington · 2014-01-16T01:23:29Z

In response to your first three bullets:

Servo can send to other systems that are not jvm based. For example there is an observer implementation that forwards to graphite which is python. You could also write one that sends to your local agent on the machine.
For servo, JMX is just a view of the data. Servo data can be captured by plugging in an observer that then communicates with whatever your backend is.

On the last bullet, I disagree. The goal of servo is to provide a way to indicate monitors and collect the data via observers. It should be able to tell you what happened during the last polling interval (provided you have wrapped the observer in CounterToRateTransform for the case of monotonic counters). This is critical because it means that the signal you are getting to the monitoring system will also tell you when the problem actually goes way in terms of what is measured, not just when someone clicks reset and says it is fixed. In our case the monitoring system supports defining alerts and we can visually depict this information so we'll see something like:

We'll then resolve the alert after confirming that the state is back to normal. In short I don't think we will accept this change because:

It doesn't seem necessary. I don't see why you couldn't write an observer that bridges your internal system with the data coming in from servo. Look at graphite as an example of talking to a non-java system.
Resetting the state of a basic counter breaks the model and has undesirable pitfalls. If you really need this it should follow the gauge contract and you can wrap a gauge around any Number implemenation like AtomicLong that would give you full control if you needed it. Note "counter" is a bit overloaded, we use the RRD notion where it is a monotonically increasing value used to generate a rate per second.
As described above, I think the current servo approach is better in that it gives the downstream monitoring/collection system an input signal that can tell you when the actual measurement shows the issue was resolved.

brharrington closed this as completed Jan 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

freesoft commented Jan 10, 2014

dmuino commented Jan 10, 2014

brharrington commented Jan 10, 2014

freesoft commented Jan 11, 2014

brharrington commented Jan 11, 2014

freesoft commented Jan 14, 2014

freesoft commented Jan 15, 2014

brharrington commented Jan 16, 2014

Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

Question : expose mbean opveration to reset Monitor's status? ( especially for BasicCounter? ) #210

Comments

freesoft commented Jan 10, 2014

dmuino commented Jan 10, 2014

brharrington commented Jan 10, 2014

freesoft commented Jan 11, 2014

brharrington commented Jan 11, 2014

freesoft commented Jan 14, 2014

freesoft commented Jan 15, 2014

brharrington commented Jan 16, 2014