-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose important metrics #175
Comments
one approach for libs to expose metrics is to provide an API which freezes a snapshot of current metric state, and callers can then write a simple adapter to copy those to their desired metric backend. Guava does this on their Cache objects: https://code.google.com/p/guava-libraries/wiki/CachesExplained#Statistics |
Now that we've shipped HTTP/2 support, I think this will be much more straightforward. I've put out a request for advice on best practices as to how we'd manage metrics when there are lots of |
I think it’d be a good idea not to tie Pushy to a specific metrics library BTW, Hystrix is a good example of a production-quality library which Also, Marshall Pierce’s idea of specifying a string prefix for each set of —j. On Tue, Jan 19, 2016 at 12:47 AM, Jon Chambers notifications@github.com
|
Here's something demoing the kind of thing I mean: #222 |
Awesome. Thank you! That answered a lot of the questions I had in my head, but hadn't found the time to articulate. Looking at #222, it seems to me that there are two extremes of approaches we could take to exposing internal metrics:
#222 seems to go right down the middle: we provide our own metrics interfaces (
Please don't mistake my curiosity for disagreement; exposing metrics for other people to use is a brave new world for me, so I don't have a whole lot of best practices to draw upon, and I'm eager to learn more about the space. Thanks kindly for all of the time and effort so far! |
I have to admit, where #1 is concerned, I have a selfish angle -- we're still using Metrics v2.2.0, and I'm not certain about its ABI compatibility with Metrics v3. TBH though, we should probably just get around to dealing with that tech debt, and this may be a decent excuse to shave that yak ;) #2 is a good point, too. Dropwizard Metrics does seem to be the most flexible option. so yeah, I would be ok to depend on Metrics from Pushy. |
Ha—okay. I don't mean to become too attached to Dropwizard's library. It just happens to be the one I looked at first ;)
Do you think it's actually a good choice, or do you still think a different approach would be better? |
Just want to chime in, i'd love to see pushy take the more general approach. I think we could use Metrics as a reference implementation, but locking into just one library limits the usefulness of pushy IMHO. Additionally, going the route of the general interface approach opens the door to supporting use cases beyond just metrics. For example, say you want your service to be able to take some action if it detects certain APNS errors like invalid tokens (while this specific example is possible today, it requires that you register a listener to every response future). Just my two cents, happy to further discuss! |
@zfarrell As a point of reference, what would you (specifically) use instead of Metrics? Again, this is a new world to me, and I'm interested in learning more about what other people actually do in the real world ;) |
We're not using any pure metrics library at the moment. We have a project coming up that'll we'll likely be evaluating options, but for now we're effectively just logging things we want to keep track of (which get aggregated and sent off to splunk ) |
The important part here is determining what values should be exposed at all. As an engineer I am mostly interested in checking that it works, and a way less interested in having a chart with millions of notifications sent :-) So, I am thinking of exposing means to detect problems in a running application and I would define the bare minimum as exposing size (and ideally for how long tasks are waiting there) of internal queues (some getter methods should be fine, so one can wrap those as metrics Gauge) and exceptions when connecting to APNS (could be caused by network down, DNS down, expired certificate - logging and/or a listener should be enough here). What else is important? Successful or failed results are probably nice to have, as they could be counted in a listener attached to the returned Future. With regards to a particular approach of implementing this, I would suggest not having a dependency on a particular metrics library in the core module, as in dropwizard metrics they are renaming packages in version 4. There should some getters/listeners in core and separate adapter modules for particular metrics libraries. |
'Do you think it's actually a good choice, or do you still think a different approach would be better?' Dropwizard Metrics is certainly the most popular choice in the java world, integrates decently with many backend systems, and is reliable -- we rely heavily on it our production systems, recording over 100k metric datapoints per second through it, and have rarely spotted any issues with it. (graphite, on the other hand....) |
'I would define the bare minimum as exposing size (and ideally for how long tasks are waiting there) of internal queues (some getter methods should be fine, so one can wrap those as metrics Gauge)' Regarding this -- I think in the current HTTP/2 code, these are externally managed -- Pushy just returns Futures to the calling code, and the caller gets to store and track them, and record metrics around queue lifetime, backlog, etc. (This is totally fine with me, fwiw) |
Right—I think it would be helpful to be clear about the purpose of the metrics we're exposing. To address a theme identified by @zfarrell and @panchenko, I was basically thinking what @jmason just said: Pushy is designed so that you can figure out everything you need to know about the health and operation of the whole system from the core API. We return If there's something related to performance or health that you think you may need to react to programmatically and it's not exposed, we should consider that a shortcoming of the core API and should work to address it there rather than allowing the metrics API to be the only source of that information. As far as exposing metrics go, I basically view this as a convenience thing so callers can easily get performance data to the reporting service of their choice. All of the critical information should be derivable from the core API (although we might throw in a couple convenience things like absolute number of reconnection attempts). @panchenko I agree that the primary purpose shouldn't be showing impressive numbers (even though that can be pretty fun). To me, the purpose is to identify problems that require a response from actual people rather than a response from code (i.e. "it looks like we're having a network outage" or "it looks like we're sending way too many notifications and may have a bug somewhere"). So, tl;dr: my proposal is that metrics—however we decide to expose them—are just a convenience thing intended to get data to a reporting service as easily as possible, and should not be the only source of critical information needed to react to problems in code. Seem legit? |
@jchambers what i'm hearing (please correct me if i'm wrong):
A couple thoughts:
|
I think that's a good summary. Thanks ;)
I'm open to the idea, but I'm really struggling to imagine how we'd do that in an entirely separable way. It seems like we'd need some notion of a metrics source (i.e. Alternatively, we could have a thing that tries to derive all metrics from the public API, but that seems like we'd be jumping through outrageous hoops that would make life needlessly difficult for end-users (if it's even possible). Maybe that would mean having something like a
Seems like a good idea. From the open branches/pull requests so far (and making no judgment as to whether or not we want all of these), we have:
|
@zfarrell I just realized that I misunderstood your suggestion. Sorry! You were talking about Dropwizard Metrics, not the idea of lowercase 'm' metrics in general. In that case, yes, I think a separate package is certainly a possibility. |
I think however a metrics listener interface or a snapshot metrics object (getCurrentMetrics() ) would be great for those of us who don't use Dropwizard Metrics, so please don't expose the metrics in a Dropwizard-only way. |
Okay; sounds like there's a clear preference to do something vendor-agnostic. We'll roll with that. Thanks, everybody! |
What do you folks think of the Thanks! |
I would suggest using something like As an advanced optimization - might be a fork of the necessary parts of CopyOnWriteArrayList which returns a constant for empty iterator. |
We could also conceivably move to just allowing one listener instead of a list. Can you elaborate on the abuse-via-synchronized thing, though? |
Thanks @jchambers, this is looking really good. How would you feel about moving the event subscribing/publishing to a separate service? So for example, rather than
we'd have (very rough sketch of what this could look like; many ways this could be accomplished):
A couple advantages to this approach:
Thanks again for being open to the vendor-agnostic approach. |
That would help with avoiding synchronization, especially if listener is passed only via constructor.
I mean if one is doing time consuming operations in some listener methods, that could block other threads. |
Agreed that it would be better to get rid of the 'synchronized' scopes. I would say limiting to a single listener would be fine as a way to avoid this. Jon -- one thing that's missing is support for Timer measurements; it'd be nice to have a way to time events like "how long does it take to establish a connection" etc. But that could be left as a potential future follow-up -- this is a great start! zfarrell: IMO it'd be better to do the API this way, and allow users themselves to implement an event bus (or whatever) in their implementation. The 'metrics_listener' approach has the advantage of not creating any objects or imposing any overhead unless the user wants to actually do something with the calls. (It'd be trivial to implement a no-op ApnsClientMetricsListener.) |
My two cents:
try(Timer.Context ctx = timer.start()) {
doSomething();
} Not super-useful in async code, but very handy in sync one.
|
@panchenko True! I don't think that's necessarily tied to the
@zfarrell Also true! I'm not sure if it's worth it, though; I suspect most reasonable implementations (by which I mean those that are published publicly and peer-reviewed) will just be incrementing integers here and there, and I'm not sure how much complexity we want to invest to protect against unreasonable implementations (which are likely to be used only by the author and unlikely to cause widespread harm). My preference for now would be to put a note in the docs that says "don't send email and mine bitcoins," but I'm definitely open to arguments that other approaches would be better.
@jmason Oh, right! I wanted to address that in the docs, but then, uh, didn't ;) My thinking there was that, for connections at least, listeners will always get a "connection attempt started" signal followed by exactly one of an "attempt failed" or an "attempt succeeded" signal. I think that should be enough for listeners to time connection attempts if they want to, right? Similarly, for sending notifications, that's what I was hoping to accomplish by passing My thought with this approach was to (a) avoid imposing a notion of a "timer context" on implementors and (b) keep things really simple (all we have to do internally is increment a number, and we don't have to maintain a notification-to-timer map). Does this seem crazy?
I'm generally with @jmason on the let's-not-use-an-event-bus approach here; I'd also point out that if people are writing their own Up until now, I had assumed (not for any great reason!) that nobody would actually subclass Thanks for the continued feedback, everybody! |
Oh, also:
Could we achieve the same thing by adding the client as an argument to the listener methods? For example: void handleNotificationAccepted(long notificationId); …would change to: void handleNotificationAccepted(ApnsClient<? extends ApnsPushNotification> apnsClient, long notificationId); |
yep, the notificationId thing makes sense -- that gives us enough info for Timer usage. |
I've opened #251, which incorporates (I think!) all of the suggestions to date. What do you folks think? |
This kind of thing has come up in a number of different contexts. #114 and #121 both offer partial solutions, but I think the most robust way to go (as suggested on the mailing list) would be to use something like JMX or Metrics. Opening this issue as a placeholder/catch-all for this kind of request and for discussion of possible approaches.
The text was updated successfully, but these errors were encountered: