Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose important metrics #175

Closed
jchambers opened this issue May 24, 2015 · 30 comments
Closed

Expose important metrics #175

jchambers opened this issue May 24, 2015 · 30 comments
Milestone

Comments

@jchambers
Copy link
Owner

This kind of thing has come up in a number of different contexts. #114 and #121 both offer partial solutions, but I think the most robust way to go (as suggested on the mailing list) would be to use something like JMX or Metrics. Opening this issue as a placeholder/catch-all for this kind of request and for discussion of possible approaches.

@jmason
Copy link

jmason commented May 25, 2015

one approach for libs to expose metrics is to provide an API which freezes a snapshot of current metric state, and callers can then write a simple adapter to copy those to their desired metric backend. Guava does this on their Cache objects: https://code.google.com/p/guava-libraries/wiki/CachesExplained#Statistics

@jchambers jchambers mentioned this issue Jan 12, 2016
9 tasks
@jchambers
Copy link
Owner Author

Now that we've shipped HTTP/2 support, I think this will be much more straightforward. I've put out a request for advice on best practices as to how we'd manage metrics when there are lots of ApnsClients in play and plan on tackling this as soon as I get an answer.

@jchambers jchambers added this to the v0.6 milestone Jan 19, 2016
@jmason
Copy link

jmason commented Jan 19, 2016

I think it’d be a good idea not to tie Pushy to a specific metrics library
— expose an interface which callers can implement to receive metrics
callbacks. Some people using Dropwizard Metrics, some use Datadog (via
statsd), some use statsd directly.

BTW, Hystrix is a good example of a production-quality library which
exposes metrics well, it may provide some ideas:
https://github.com/Netflix/Hystrix/wiki/Metrics-and-Monitoring-(up-to-1.4.x)
. Their approach of having a “metrics-publisher” maven jar, which can be
instantiated and registered against the core Hystrix lib, works well, in my
experience.

Also, Marshall Pierce’s idea of specifying a string prefix for each set of
metrics is a good one — I use that internally in our code which logs
metrics (to graphite via Dropwizard Metrics).

—j.

On Tue, Jan 19, 2016 at 12:47 AM, Jon Chambers notifications@github.com
wrote:

Now that we've shipped HTTP/2 support, I think this will be much more
straightforward. I've put out a request for advice on best practices
https://groups.google.com/d/msg/metrics-user/gPdBrs2R_94/9td7_JLeFAAJ
as to how we'd manage metrics when there are lots of ApnsClients in play
and plan on tackling this as soon as I get an answer.


Reply to this email directly or view it on GitHub
#175 (comment).

@jmason
Copy link

jmason commented Jan 25, 2016

Here's something demoing the kind of thing I mean: #222

@jchambers
Copy link
Owner Author

Here's something demoing the kind of thing I mean: #222

Awesome. Thank you! That answered a lot of the questions I had in my head, but hadn't found the time to articulate.

Looking at #222, it seems to me that there are two extremes of approaches we could take to exposing internal metrics:

  1. Choose a specific metrics library and use their classes to report metrics. For what it's worth, I also created a quick prototype branch for this approach.
  2. Have a very general listener interface that has methods like handleNewReconnectionAttempt and handleNotificationSuccess that avoids committing to any specific metrics classes/model so end users can write whatever adapters they need.

#222 seems to go right down the middle: we provide our own metrics interfaces (Counter and Timer), but end users will still need to provide their own implementations of those interfaces to work with specific metrics libraries. I definitely get the point there, but am wondering if it makes sense to move all the way to one end of the spectrum or the other. Specifically:

  1. Although I don't have a practical counter-example, you do point out that, by imposing this model, we might (in a hand-wavy theoretical way) make life difficult for users who want to use some other metrics-reporting technique that we're not familiar with. I think the cost here is small, but is the benefit we gain by imposing a model (as opposed to just exposing events) worth the cost?
  2. Going the other way, I agree that just using something like Dropwizard Metrics imposes a third-party library selection upon users, but given the number of third-party libraries and adapters out there, how high is that cost in practical terms? Is the benefit of having something 100% done and working worth the cost in terms of choice?

Please don't mistake my curiosity for disagreement; exposing metrics for other people to use is a brave new world for me, so I don't have a whole lot of best practices to draw upon, and I'm eager to learn more about the space. Thanks kindly for all of the time and effort so far!

@jmason
Copy link

jmason commented Jan 26, 2016

I have to admit, where #1 is concerned, I have a selfish angle -- we're still using Metrics v2.2.0, and I'm not certain about its ABI compatibility with Metrics v3. TBH though, we should probably just get around to dealing with that tech debt, and this may be a decent excuse to shave that yak ;)

#2 is a good point, too. Dropwizard Metrics does seem to be the most flexible option.

so yeah, I would be ok to depend on Metrics from Pushy.

@jchambers
Copy link
Owner Author

Ha—okay. I don't mean to become too attached to Dropwizard's library. It just happens to be the one I looked at first ;)

I would be ok to depend on Metrics from Pushy.

Do you think it's actually a good choice, or do you still think a different approach would be better?

@zfarrell
Copy link

Just want to chime in, i'd love to see pushy take the more general approach. I think we could use Metrics as a reference implementation, but locking into just one library limits the usefulness of pushy IMHO.

Additionally, going the route of the general interface approach opens the door to supporting use cases beyond just metrics. For example, say you want your service to be able to take some action if it detects certain APNS errors like invalid tokens (while this specific example is possible today, it requires that you register a listener to every response future).

Just my two cents, happy to further discuss!

@jchambers
Copy link
Owner Author

@zfarrell As a point of reference, what would you (specifically) use instead of Metrics? Again, this is a new world to me, and I'm interested in learning more about what other people actually do in the real world ;)

@zfarrell
Copy link

We're not using any pure metrics library at the moment. We have a project coming up that'll we'll likely be evaluating options, but for now we're effectively just logging things we want to keep track of (which get aggregated and sent off to splunk )

@panchenko
Copy link
Contributor

The important part here is determining what values should be exposed at all.

As an engineer I am mostly interested in checking that it works, and a way less interested in having a chart with millions of notifications sent :-)

So, I am thinking of exposing means to detect problems in a running application and I would define the bare minimum as exposing size (and ideally for how long tasks are waiting there) of internal queues (some getter methods should be fine, so one can wrap those as metrics Gauge) and exceptions when connecting to APNS (could be caused by network down, DNS down, expired certificate - logging and/or a listener should be enough here).

What else is important?

Successful or failed results are probably nice to have, as they could be counted in a listener attached to the returned Future.

With regards to a particular approach of implementing this, I would suggest not having a dependency on a particular metrics library in the core module, as in dropwizard metrics they are renaming packages in version 4. There should some getters/listeners in core and separate adapter modules for particular metrics libraries.

@jmason
Copy link

jmason commented Jan 27, 2016

'Do you think it's actually a good choice, or do you still think a different approach would be better?'

Dropwizard Metrics is certainly the most popular choice in the java world, integrates decently with many backend systems, and is reliable -- we rely heavily on it our production systems, recording over 100k metric datapoints per second through it, and have rarely spotted any issues with it. (graphite, on the other hand....)

@jmason
Copy link

jmason commented Jan 27, 2016

'I would define the bare minimum as exposing size (and ideally for how long tasks are waiting there) of internal queues (some getter methods should be fine, so one can wrap those as metrics Gauge)'

Regarding this -- I think in the current HTTP/2 code, these are externally managed -- Pushy just returns Futures to the calling code, and the caller gets to store and track them, and record metrics around queue lifetime, backlog, etc. (This is totally fine with me, fwiw)

@jchambers
Copy link
Owner Author

Right—I think it would be helpful to be clear about the purpose of the metrics we're exposing. To address a theme identified by @zfarrell and @panchenko, I was basically thinking what @jmason just said: Pushy is designed so that you can figure out everything you need to know about the health and operation of the whole system from the core API.

We return Futures so you know exactly when notifications were processed and what happened to them. We no longer have the crazy internal retry queues and constantly-cycling connections that were required under the old APNs protocol. The model now is that there's a single connection that we basically expect to stay open forever (plus or minus reboots on Apple's side), and we now get responses to all notifications, so we don't have to maintain a ring buffer of everything we ever sent in case we discover that something went wrong a long time ago.

If there's something related to performance or health that you think you may need to react to programmatically and it's not exposed, we should consider that a shortcoming of the core API and should work to address it there rather than allowing the metrics API to be the only source of that information.

As far as exposing metrics go, I basically view this as a convenience thing so callers can easily get performance data to the reporting service of their choice. All of the critical information should be derivable from the core API (although we might throw in a couple convenience things like absolute number of reconnection attempts).

@panchenko I agree that the primary purpose shouldn't be showing impressive numbers (even though that can be pretty fun). To me, the purpose is to identify problems that require a response from actual people rather than a response from code (i.e. "it looks like we're having a network outage" or "it looks like we're sending way too many notifications and may have a bug somewhere").

So, tl;dr: my proposal is that metrics—however we decide to expose them—are just a convenience thing intended to get data to a reporting service as easily as possible, and should not be the only source of critical information needed to react to problems in code.

Seem legit?

@zfarrell
Copy link

@jchambers what i'm hearing (please correct me if i'm wrong):

  1. In theory, any type of "metric" that we could currently want should already be derivable from the core apis (somehow). If there's something additional not available, you're open to exposing an api for that.
  2. Since the ability already exists to respond to various "metrics" via code, the purpose of this issue is purely to hook into those (already available) apis and make that data available to a human.

A couple thoughts:

  • Could this goal be accomplished outside of Pushy core (for this example, let's call it "pushy-metrics")? Those who want to use the Metrics library could include the "pushy-metrics" dependency and get that for free. Those who don't wouldn't be forced into using it, and for those that have requirements to use some other library could consult "pushy-metrics" as a reference implementation.
  • Should we come up with a list of metrics that we would want to see? That way we can figure out how we would derive the metrics and/or identify any shortcomings of the current api.

@jchambers
Copy link
Owner Author

what i'm hearing…

I think that's a good summary. Thanks ;)

Could this goal be accomplished outside of Pushy core (for this example, let's call it "pushy-metrics")?

I'm open to the idea, but I'm really struggling to imagine how we'd do that in an entirely separable way. It seems like we'd need some notion of a metrics source (i.e. getBigPileOfMetrics()) or a metrics sink (i.e. addMetricsListener()) as part of ApnsClient, so that would need to get baked into the core somehow.

Alternatively, we could have a thing that tries to derive all metrics from the public API, but that seems like we'd be jumping through outrageous hoops that would make life needlessly difficult for end-users (if it's even possible). Maybe that would mean having something like a MetricifiedApnsClient subclass? Hm.

Should we come up with a list of metrics that we would want to see?

Seems like a good idea. From the open branches/pull requests so far (and making no judgment as to whether or not we want all of these), we have:

  • Number of notifications "in flight"
  • Number of accepted notifications
  • Number of rejected notifications
  • Number of write failures
  • Time-to-response for accepted/rejected notifications
  • Number of TLS handshake failures
  • Number of connection attempts (both successful and failed)
  • Number of automated reconnection attempts

@jchambers
Copy link
Owner Author

@zfarrell I just realized that I misunderstood your suggestion. Sorry! You were talking about Dropwizard Metrics, not the idea of lowercase 'm' metrics in general. In that case, yes, I think a separate package is certainly a possibility.

@flozano
Copy link

flozano commented Feb 3, 2016

I think however a metrics listener interface or a snapshot metrics object (getCurrentMetrics() ) would be great for those of us who don't use Dropwizard Metrics, so please don't expose the metrics in a Dropwizard-only way.

@jchambers
Copy link
Owner Author

Okay; sounds like there's a clear preference to do something vendor-agnostic. We'll roll with that. Thanks, everybody!

@jchambers
Copy link
Owner Author

What do you folks think of the metrics_listener branch? It should be very library-agnostic, but I think provides a means to get at all of the important information. If it seems reasonable to everybody, I'll push ahead with docs and maybe a reference implementation or two.

Thanks!

@panchenko
Copy link
Contributor

I would suggest using something like CopyOnWriteArrayList and avoid synchronized, in order to make it harder to abuse this listener.

As an advanced optimization - might be a fork of the necessary parts of CopyOnWriteArrayList which returns a constant for empty iterator.

@jchambers
Copy link
Owner Author

We could also conceivably move to just allowing one listener instead of a list. Can you elaborate on the abuse-via-synchronized thing, though?

@zfarrell
Copy link

Thanks @jchambers, this is looking really good. How would you feel about moving the event subscribing/publishing to a separate service?

So for example, rather than

synchronized (this.metricsListeners) {
  for (final ApnsClientMetricsListener listener : this.metricsListeners) {
      listener.handleConnectionAttemptStarted();
  }
}

we'd have (very rough sketch of what this could look like; many ways this could be accomplished):

// 1st arg: name of event (probably would be an Enum) 
// 2nd arg: reference to the client publishing the event, maybe just an id instead?
// 3rd arg: context object. capable of passing additional context data
eventBus.publish("event.connection.started", this, context)

A couple advantages to this approach:

  1. We could use a separate thread (or thread pool/Exec service) so that the listeners have no impact on a client's performance.
  2. More lightweight when multiple ApnsClient's are in use (only one set of listeners for all clients, rather than a set of listeners for each client)
  3. Makes it easier to add custom events (in custom ApnsClient's implementations)

Thanks again for being open to the vendor-agnostic approach.

@panchenko
Copy link
Contributor

We could also conceivably move to just allowing one listener instead of a list.

That would help with avoiding synchronization, especially if listener is passed only via constructor.

Can you elaborate on the abuse-via-synchronized thing, though?

I mean if one is doing time consuming operations in some listener methods, that could block other threads.

@jmason
Copy link

jmason commented Feb 22, 2016

Agreed that it would be better to get rid of the 'synchronized' scopes. I would say limiting to a single listener would be fine as a way to avoid this.

Jon -- one thing that's missing is support for Timer measurements; it'd be nice to have a way to time events like "how long does it take to establish a connection" etc. But that could be left as a potential future follow-up -- this is a great start!

zfarrell: IMO it'd be better to do the API this way, and allow users themselves to implement an event bus (or whatever) in their implementation. The 'metrics_listener' approach has the advantage of not creating any objects or imposing any overhead unless the user wants to actually do something with the calls. (It'd be trivial to implement a no-op ApnsClientMetricsListener.)

@flozano
Copy link

flozano commented Feb 22, 2016

My two cents:

  • The interface looks fine, more than enough to serve all my purposes.
  • One suggestion: make Context in Timer implement AutoCloseable, and make it call stop() by default - that way you can do:
try(Timer.Context ctx = timer.start()) {
    doSomething();
}

Not super-useful in async code, but very handy in sync one.

  • I am OK with just one listener per ApnsConnection. The listener could multiplex itself to N if needed for whoever needs it, so it'd be OK to move that complexity to inside the listener itself.

@jchambers
Copy link
Owner Author

if one is doing time consuming operations in some listener methods, that could block other threads.

@panchenko True! I don't think that's necessarily tied to the synchronized thing, though—if implementors want to send emails and mine bitcoins every time a notification gets sent, that's going to be an issue regardless of how we're keeping track of the listeners internally. To that point…

We could use a separate thread (or thread pool/Exec service) so that the listeners have no impact on a client's performance.

@zfarrell Also true! I'm not sure if it's worth it, though; I suspect most reasonable implementations (by which I mean those that are published publicly and peer-reviewed) will just be incrementing integers here and there, and I'm not sure how much complexity we want to invest to protect against unreasonable implementations (which are likely to be used only by the author and unlikely to cause widespread harm). My preference for now would be to put a note in the docs that says "don't send email and mine bitcoins," but I'm definitely open to arguments that other approaches would be better.

one thing that's missing is support for Timer measurements; it'd be nice to have a way to time events like "how long does it take to establish a connection" etc.

@jmason Oh, right! I wanted to address that in the docs, but then, uh, didn't ;)

My thinking there was that, for connections at least, listeners will always get a "connection attempt started" signal followed by exactly one of an "attempt failed" or an "attempt succeeded" signal. I think that should be enough for listeners to time connection attempts if they want to, right?

Similarly, for sending notifications, that's what I was hoping to accomplish by passing long notificationId as an argument to all of those methods. The notification ID should allow listeners to correlate the various "send attempt finished" signals with the initial "send attempt started" signal.

My thought with this approach was to (a) avoid imposing a notion of a "timer context" on implementors and (b) keep things really simple (all we have to do internally is increment a number, and we don't have to maintain a notification-to-timer map). Does this seem crazy?

Makes it easier to add custom events (in custom ApnsClient's implementations)

I'm generally with @jmason on the let's-not-use-an-event-bus approach here; I'd also point out that if people are writing their own ApnsClient subclasses, they can probably just bake in whatever metrics stuff they want without worrying about the listener interface ;)

Up until now, I had assumed (not for any great reason!) that nobody would actually subclass ApnsClient except to write mock clients that just return immediately-failed or -succeeded futures. @zfarrell do you have other use cases in mind?

Thanks for the continued feedback, everybody!

@jchambers
Copy link
Owner Author

Oh, also:

More lightweight when multiple ApnsClient's are in use (only one set of listeners for all clients, rather than a set of listeners for each client)

Could we achieve the same thing by adding the client as an argument to the listener methods? For example:

void handleNotificationAccepted(long notificationId);

…would change to:

void handleNotificationAccepted(ApnsClient<? extends ApnsPushNotification> apnsClient, long notificationId);

@jmason
Copy link

jmason commented Feb 22, 2016

yep, the notificationId thing makes sense -- that gives us enough info for Timer usage.

@jchambers
Copy link
Owner Author

I've opened #251, which incorporates (I think!) all of the suggestions to date. What do you folks think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants