Add garbage collection stats #113

oliver006 · 2015-04-30T14:02:33Z

This adds basic garbage collections stats to the metrics that are exported by default by the golang client.

brian-brazil · 2015-04-30T14:09:48Z

prometheus/gc_collector.go

invocations

These two stats would be better as Summary without percentiles, as they're related Counters.

Thx for the feedback, let me have a look at that.

I guess that'd be a single summary called process_gc_pause_seconds, which then generates two metrics:

process_gc_pause_seconds_count (the total GC count)

process_gc_pause_seconds_sum (the total number of seconds)

@brian-brazil Not sure though whether I like that more than having two separate explicit counters. From a user perspective, I always felt that explicit (non-summary) counter metrics have the better/clearer names. I know that we're currently using the summary-counter approach quite often, but that's usually in the context of also collecting quantiles via observations. In this case, we don't even have the individual observations/events, only two existing counters that we are bridging. Wouldn't it be better to keep them as normal counters in that case?

(you'd have to generate synthetic summary observations otherwise based on the increases since the last scrape, so I think the two counters are good as is)

Summary is what we'd use if were instrumenting this directly, having the standard _sum and _count timeseries makes it more obvious what you can do with the metric. If it's technically tricky to do a Summary then two Counters are okay, but they should still follow the Summary naming scheme.

The values are a total count of events/seconds, which is what a Counter is. If they were e.g. the last time a GC ran or the last GC duration it'd be a Gauge.

For comparison, what we do over in Java is https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/GarbageCollectorExports.java#L39

Indeed, even though the values are already provided to us and we don't need to "count" them, they are fundamentally cumulative counters (meaning you can do rates over them, etc.), and the client library should export them typed as such.

See also https://github.com/prometheus/client_golang/blob/master/prometheus/counter.go#L34-L39

Ah, thanks for the pointers, didn't see the Set() routine for Counter.

juliusv · 2015-04-30T15:30:13Z

@oliver006 Thanks, this is great! I think the best place for these new stats would be in the existing Go collector: https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go, but I'll let @beorn7 be the authoritative person to judge that.

beorn7 · 2015-04-30T15:56:01Z

@oliver006 Yeah, this should definitely in the Go collector. It has only one metric there, and we always wanted to add GC metrics there.

juliusv · 2015-04-30T16:46:34Z

prometheus/go_collector.go

Ah, the intention was to make it part of the goCollector instead of having a separate gcCollector (not just move it to the same file).

fabxc · 2015-04-30T17:42:46Z

The current build is failing as you are not importing runtime/debug.

oliver006 · 2015-04-30T19:50:49Z

Thanks for all the feedback, changed the Gauges to Counters and moved it all into goCollector as suggested.

juliusv · 2015-04-30T19:57:13Z

👍 from my side, waiting for authoritative 👍 by @beorn7.

Should we keep the metric names process_gc_num_total (as it is named NumGC internally in Go), or should it be named more verbosely, process_gc_invocations_total?

fabxc · 2015-04-30T20:24:33Z

👍 for process_gc_invocations_total - for other metrics we are always going with the verbose route.

oliver006 · 2015-04-30T20:58:48Z

Cool, changed it to process_gc_invocations_total.

brian-brazil · 2015-04-30T21:17:10Z

It should be process_gc_invocations_seconds_count/_sum or similar, otherwise the calculation of the average gc pause will end up as instance:process_gc_invocations_per_process_gc_pause_seconds:ratio_rate5m rather than the more standard instance:process_gc_invocations_seconds:mean5m

juliusv · 2015-04-30T21:46:15Z

Poor @oliver006! Sorry, it seems we need to get in agreement internally first.

@brian-brazil, can I not still call the resulting recording rule the latter (more standard) way, even if the input metrics have somewhat different (more human-friendly) names? I see that this means the mapping between original/divided metric names won't be as clear anymore, but it also means that the original metric names will be less obtuse. Given that the summary-style naming was downvoted by @beorn7 before, it seems this needs to be discussed some more.

oliver006 · 2015-04-30T22:34:33Z

Ha, no worries, I'll wait till you guys agreed on a name for the metric.

brian-brazil · 2015-05-01T09:42:51Z

@juliusv We had a debate on the Summary naming a while back, and decided to leave it as _sum/_count.

can I not still call the resulting recording rule the latter (more standard) way, even if the input metrics have somewhat different (more human-friendly) names?

The point of the standard naming is that you can look at a variable and tell where it comes from, if you feel the need to change the naming mid-stream you're better off changing the instrumentation to export what you wanted in the first place.

If the concern is the name is unclear then we can improve the base of the name, the Summary suffixes will ultimately make it clearer and more consistent across the entire ecosystem. We should be avoiding special cases for local clarity, as it ultimately adds complexity that makes everything more difficult to work with.

beorn7 · 2015-05-01T12:46:42Z

On Fri, May 1, 2015 at 11:42 AM, Brian Brazil notifications@github.com wrote:

@juliusv We had a debate on the Summary naming a while back, and decided to leave it as _sum/_count.

Yes, but:

(1) This naming scheme is, as said, for summaries (and now histograms).

(2) The naming is invisible from the client side. It only happens when
a complex Prometheus metric type (summary or histogram) has to be
broken down into a number of simple time series. This break down has
to happen on the server because of an unfortunate limitation of the
server (it internally treats everything as gauges, essentially). It
also has to happen for the text format. But the text format is a
compromise with limited scope. The canonical representation of the
Prometheus exposition format is the protobuf definition.

Arguably, this case is not a summary in the first place. The one
counter counts GC invocations. The other counter sums up time spent in
pauses for GC. Depending on the GC implementation, the duration of the
GC is not necessarily the time spent in pauses. Future implementations
might not stop the world anymore for GC, at all.

It would be a different discussion if we regularly want to use
summaries without quantiles for anything that sums one thing up and
counts that thing at the same time. If we wanted that, we needed an
API in the client libs that allows to set the sum and count directly
to copy it over from a 3rd party system (as in this case).

Brian, you argue for mirroring the internal naming scheme for
broken-down summaries into the client space. That makes sense from the
perspective of the user dealing with time series and expressions on
the server side, because that user has to deal with the broken-down
summaries for as long as there is no first-class support for complex
metric types on the server. The problem I have with that is that the
naming does not play well with the naming outside of
summaries/histograms:

Organically, you would call a counter counting requests
requests_total. A counter counting the duration spent in the
requests would be called request_duration_seconds_total.

However, a summary or histogram would be called
request_duration_seconds. The autogenerated metrics (by the
break-down) are called request_duration_seconds_sum and
request_duration_seconds_count. That makes sense in the context of a
summary/histogram because it doesn't change the original name and just
adds a suffix. We could not implement an algorithm that would do the
transformation from request_duration_seconds to requests_total
because that algorithm needed an understanding of English
pluralization and of the meaning of the components of the
summary/histogram name (not all are durations...). On the other hand,
we cannot generally transform request_duration_seconds into
request_duration_seconds_total because there are
summaries/histograms with negative observations so that
request_duration_seconds_sum is sometimes a gauge and therefore must
not be called request_duration_seconds_total, which implies a
counter.

In summary, I want to limit the _sum/_count naming scheme to the
automated metric break-down of "true" summaries and histograms.
In general, if I count a thing and sum up times for that thing, doing
that in different metrics named in isolation and appropriately is OK.
In this case, using a quantile less summary causes confusion for the
reasons stated above (GC pause != time spent in GC; the count is a

count of GCs, not of pauses).

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

brian-brazil · 2015-05-01T13:03:25Z

Arguably, this case is not a summary in the first place. The one
counter counts GC invocations. The other counter sums up time spent in
pauses for GC. Depending on the GC implementation, the duration of the
GC is not necessarily the time spent in pauses.

That's a cpu time vs. wall time, which is a matter for general metric naming rather than which type of metric.

This is a Summary.

If we wanted that, we needed an
API in the client libs that allows to set the sum and count directly
to copy it over from a 3rd party system (as in this case).

The Go client lacks that, but Java and Python have it (and use it).

In general, if I count a thing and sum up times for that thing, doing
that in different metrics named in isolation and appropriately is OK.

I strongly disagree, that's a Summary and should appear to the user accordingly. Having the user have to worry individually about each special case isn't a scalable approach.

It sounds like the issue here is a deficiency in the go client library, as it doesn't allow direct proxying of Summarys from other monitoring systems. Can we add that?

beorn7 · 2015-05-01T18:51:16Z

The deficiency in the go client library to mirror summaries and histograms is a real issue (filed #114 ), but it's not the issue here. The issue here is if the two counters are a summary at all or not.

As reported earlier, I had the impression that pause time (and number of pauses) is different from number of GC runs and the time spent for the GC. The Go docs are not exactly conclusive here. Therefore, I checked out the source code. Conclusion: At least with the current implementation, "one pause" is identical to "one GC run", and the pause time is identical tot he GC time. I'd still expect that we need to revisit once the garbage collector is changed to not stop the world, but for now, the two values are indeed a summary. BTW: The GCStats struct even contains quantiles we can mirror into the summary, too. Which sells the rest of me for the Summary (as we can have actual quantiles in it).

I'd suggest process_gc_duration_seconds as the name for the summary to avoid all the misunderstandings the naming in the Go library causes.

And of course, we first need to fix client_golang to support the mirroring. I can look into it soon-ish.

juliusv · 2015-05-01T19:44:15Z

Nice to see a conclusion and agreement here!

@oliver006 Ok, looks like this'll unfortunately have to sit here for a short while until the client library supports proxying summaries. The reason why it's so important to get this "right" the first time around is that these metrics will end up being exported by every instance in the world instrumented with this library, and changing them later could cause pain to a lot of users (we try our best to avoid breaking changes, though it's not always possible at this stage yet).

oliver006 · 2015-05-01T19:46:50Z

No worries, glad to see this got resolved. Ping me when the client library got updated and I'll make the changes to the PR.

brian-brazil · 2015-05-02T06:10:12Z

The other thing that just occurred to me is that as these metrics have the process_ prefix they would have to follow the rule for the standard exports: https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit

How GC works is going to vary wildly across languages and implementations, so I doubt we'd be able to come up with standard exports for it. Following the naming in the hotspot stats (jvm_) and korfuri's python gc stats (python_), these should be prefixed with go_ or similar.

beorn7 · 2015-05-04T10:23:31Z

@brian-brazil Good point.

@oliver006 PR for statically settable summary imminent.

beorn7 · 2015-05-04T12:55:26Z

PR is in. @oliver006 feel free to ask if you have any questions.

oliver006 · 2015-05-04T14:01:31Z

@beorn7 great, thanks, I'll have a look later today and push up my changes.

oliver006 · 2015-05-04T18:11:48Z

Pushed up my changes to use the ConstSummary - let me know what you think.

juliusv · 2015-05-05T04:47:10Z

prometheus/go_collector.go

As per #113 (comment), this should be process_go_gc_duration_seconds.

Just go_gc_duration_seconds

beorn7 · 2015-05-05T13:55:21Z

prometheus/go_collector.go

But why skip the min? The Go runtime provides it, why not mirror it, too?

As what? The zero-th pecentile?
Something like quantiles[0.0] = stats.PauseQuantiles[0].Seconds() ?

Yes, that would make most sense, IMHO. Objections welcome. :)

beorn7 · 2015-05-05T16:39:23Z

👍 Everybody else happy? Will merge in 3... 2... 1...

brian-brazil · 2015-05-05T16:41:42Z

👍

Add garbage collection stats

oliver006 · 2015-05-06T03:26:07Z

👍 thx everybody for the help & feedback!

juliusv · 2015-05-06T06:19:25Z

@oliver006 Thanks for the patience :)

brian-brazil reviewed Apr 30, 2015
View reviewed changes

juliusv reviewed Apr 30, 2015
View reviewed changes

juliusv reviewed May 5, 2015
View reviewed changes

Add garbage collection stats

ff586ea

beorn7 reviewed May 5, 2015
View reviewed changes

record quantiles as well

21b132f

beorn7 added a commit that referenced this pull request May 5, 2015

Merge pull request #113 from oliver006/add_gc_metrics

86bd077

Add garbage collection stats

beorn7 merged commit 86bd077 into prometheus:master May 5, 2015

Add garbage collection stats #113

Add garbage collection stats #113

Uh oh!

Conversation

oliver006 commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

juliusv commented Apr 30, 2015

Uh oh!

beorn7 commented Apr 30, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fabxc commented Apr 30, 2015

Uh oh!

oliver006 commented Apr 30, 2015

Uh oh!

juliusv commented Apr 30, 2015

Uh oh!

fabxc commented Apr 30, 2015

Uh oh!

oliver006 commented Apr 30, 2015

Uh oh!

brian-brazil commented Apr 30, 2015

Uh oh!

juliusv commented Apr 30, 2015

Uh oh!

oliver006 commented Apr 30, 2015

Uh oh!

brian-brazil commented May 1, 2015

Uh oh!

beorn7 commented May 1, 2015

count of GCs, not of pauses).

Uh oh!

brian-brazil commented May 1, 2015

Uh oh!

beorn7 commented May 1, 2015

Uh oh!

juliusv commented May 1, 2015

Uh oh!

oliver006 commented May 1, 2015

Uh oh!

brian-brazil commented May 2, 2015

Uh oh!

beorn7 commented May 4, 2015

Uh oh!

beorn7 commented May 4, 2015

Uh oh!

oliver006 commented May 4, 2015

Uh oh!

oliver006 commented May 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment