Skip to content

Conversation

oliver006
Copy link
Contributor

This adds basic garbage collections stats to the metrics that are exported by default by the golang client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invocations

These two stats would be better as Summary without percentiles, as they're related Counters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the feedback, let me have a look at that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that'd be a single summary called process_gc_pause_seconds, which then generates two metrics:

  • process_gc_pause_seconds_count (the total GC count)
  • process_gc_pause_seconds_sum (the total number of seconds)

@brian-brazil Not sure though whether I like that more than having two separate explicit counters. From a user perspective, I always felt that explicit (non-summary) counter metrics have the better/clearer names. I know that we're currently using the summary-counter approach quite often, but that's usually in the context of also collecting quantiles via observations. In this case, we don't even have the individual observations/events, only two existing counters that we are bridging. Wouldn't it be better to keep them as normal counters in that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you'd have to generate synthetic summary observations otherwise based on the increases since the last scrape, so I think the two counters are good as is)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary is what we'd use if were instrumenting this directly, having the standard _sum and _count timeseries makes it more obvious what you can do with the metric. If it's technically tricky to do a Summary then two Counters are okay, but they should still follow the Summary naming scheme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values are a total count of events/seconds, which is what a Counter is. If they were e.g. the last time a GC ran or the last GC duration it'd be a Gauge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, even though the values are already provided to us and we don't need to "count" them, they are fundamentally cumulative counters (meaning you can do rates over them, etc.), and the client library should export them typed as such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the pointers, didn't see the Set() routine for Counter.

@juliusv
Copy link
Member

juliusv commented Apr 30, 2015

@oliver006 Thanks, this is great! I think the best place for these new stats would be in the existing Go collector: https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go, but I'll let @beorn7 be the authoritative person to judge that.

@beorn7
Copy link
Member

beorn7 commented Apr 30, 2015

@oliver006 Yeah, this should definitely in the Go collector. It has only one metric there, and we always wanted to add GC metrics there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the intention was to make it part of the goCollector instead of having a separate gcCollector (not just move it to the same file).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@fabxc
Copy link
Contributor

fabxc commented Apr 30, 2015

The current build is failing as you are not importing runtime/debug.

@oliver006
Copy link
Contributor Author

Thanks for all the feedback, changed the Gauges to Counters and moved it all into goCollector as suggested.

@juliusv
Copy link
Member

juliusv commented Apr 30, 2015

👍 from my side, waiting for authoritative 👍 by @beorn7.

Should we keep the metric names process_gc_num_total (as it is named NumGC internally in Go), or should it be named more verbosely, process_gc_invocations_total?

@fabxc
Copy link
Contributor

fabxc commented Apr 30, 2015

👍 for process_gc_invocations_total - for other metrics we are always going with the verbose route.

@oliver006
Copy link
Contributor Author

Cool, changed it to process_gc_invocations_total.

@brian-brazil
Copy link
Contributor

It should be process_gc_invocations_seconds_count/_sum or similar, otherwise the calculation of the average gc pause will end up as instance:process_gc_invocations_per_process_gc_pause_seconds:ratio_rate5m rather than the more standard instance:process_gc_invocations_seconds:mean5m

@juliusv
Copy link
Member

juliusv commented Apr 30, 2015

Poor @oliver006! Sorry, it seems we need to get in agreement internally first.

@brian-brazil, can I not still call the resulting recording rule the latter (more standard) way, even if the input metrics have somewhat different (more human-friendly) names? I see that this means the mapping between original/divided metric names won't be as clear anymore, but it also means that the original metric names will be less obtuse. Given that the summary-style naming was downvoted by @beorn7 before, it seems this needs to be discussed some more.

@oliver006
Copy link
Contributor Author

Ha, no worries, I'll wait till you guys agreed on a name for the metric.

@brian-brazil
Copy link
Contributor

@juliusv We had a debate on the Summary naming a while back, and decided to leave it as _sum/_count.

can I not still call the resulting recording rule the latter (more standard) way, even if the input metrics have somewhat different (more human-friendly) names?

The point of the standard naming is that you can look at a variable and tell where it comes from, if you feel the need to change the naming mid-stream you're better off changing the instrumentation to export what you wanted in the first place.

If the concern is the name is unclear then we can improve the base of the name, the Summary suffixes will ultimately make it clearer and more consistent across the entire ecosystem. We should be avoiding special cases for local clarity, as it ultimately adds complexity that makes everything more difficult to work with.

@beorn7
Copy link
Member

beorn7 commented May 1, 2015

On Fri, May 1, 2015 at 11:42 AM, Brian Brazil notifications@github.com wrote:

@juliusv We had a debate on the Summary naming a while back, and decided to leave it as _sum/_count.

Yes, but:

(1) This naming scheme is, as said, for summaries (and now histograms).

(2) The naming is invisible from the client side. It only happens when
a complex Prometheus metric type (summary or histogram) has to be
broken down into a number of simple time series. This break down has
to happen on the server because of an unfortunate limitation of the
server (it internally treats everything as gauges, essentially). It
also has to happen for the text format. But the text format is a
compromise with limited scope. The canonical representation of the
Prometheus exposition format is the protobuf definition.

Arguably, this case is not a summary in the first place. The one
counter counts GC invocations. The other counter sums up time spent in
pauses for GC. Depending on the GC implementation, the duration of the
GC is not necessarily the time spent in pauses. Future implementations
might not stop the world anymore for GC, at all.

It would be a different discussion if we regularly want to use
summaries without quantiles for anything that sums one thing up and
counts that thing at the same time. If we wanted that, we needed an
API in the client libs that allows to set the sum and count directly
to copy it over from a 3rd party system (as in this case).

Brian, you argue for mirroring the internal naming scheme for
broken-down summaries into the client space. That makes sense from the
perspective of the user dealing with time series and expressions on
the server side, because that user has to deal with the broken-down
summaries for as long as there is no first-class support for complex
metric types on the server. The problem I have with that is that the
naming does not play well with the naming outside of
summaries/histograms:

Organically, you would call a counter counting requests
requests_total. A counter counting the duration spent in the
requests would be called request_duration_seconds_total.

However, a summary or histogram would be called
request_duration_seconds. The autogenerated metrics (by the
break-down) are called request_duration_seconds_sum and
request_duration_seconds_count. That makes sense in the context of a
summary/histogram because it doesn't change the original name and just
adds a suffix. We could not implement an algorithm that would do the
transformation from request_duration_seconds to requests_total
because that algorithm needed an understanding of English
pluralization and of the meaning of the components of the
summary/histogram name (not all are durations...). On the other hand,
we cannot generally transform request_duration_seconds into
request_duration_seconds_total because there are
summaries/histograms with negative observations so that
request_duration_seconds_sum is sometimes a gauge and therefore must
not be called request_duration_seconds_total, which implies a
counter.

In summary, I want to limit the _sum/_count naming scheme to the
automated metric break-down of "true" summaries and histograms.
In general, if I count a thing and sum up times for that thing, doing
that in different metrics named in isolation and appropriately is OK.
In this case, using a quantile less summary causes confusion for the
reasons stated above (GC pause != time spent in GC; the count is a

count of GCs, not of pauses).

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@brian-brazil
Copy link
Contributor

Arguably, this case is not a summary in the first place. The one
counter counts GC invocations. The other counter sums up time spent in
pauses for GC. Depending on the GC implementation, the duration of the
GC is not necessarily the time spent in pauses.

That's a cpu time vs. wall time, which is a matter for general metric naming rather than which type of metric.

This is a Summary.

If we wanted that, we needed an
API in the client libs that allows to set the sum and count directly
to copy it over from a 3rd party system (as in this case).

The Go client lacks that, but Java and Python have it (and use it).

In general, if I count a thing and sum up times for that thing, doing
that in different metrics named in isolation and appropriately is OK.

I strongly disagree, that's a Summary and should appear to the user accordingly. Having the user have to worry individually about each special case isn't a scalable approach.

It sounds like the issue here is a deficiency in the go client library, as it doesn't allow direct proxying of Summarys from other monitoring systems. Can we add that?

@beorn7
Copy link
Member

beorn7 commented May 1, 2015

The deficiency in the go client library to mirror summaries and histograms is a real issue (filed #114 ), but it's not the issue here. The issue here is if the two counters are a summary at all or not.

As reported earlier, I had the impression that pause time (and number of pauses) is different from number of GC runs and the time spent for the GC. The Go docs are not exactly conclusive here. Therefore, I checked out the source code. Conclusion: At least with the current implementation, "one pause" is identical to "one GC run", and the pause time is identical tot he GC time. I'd still expect that we need to revisit once the garbage collector is changed to not stop the world, but for now, the two values are indeed a summary. BTW: The GCStats struct even contains quantiles we can mirror into the summary, too. Which sells the rest of me for the Summary (as we can have actual quantiles in it).

I'd suggest process_gc_duration_seconds as the name for the summary to avoid all the misunderstandings the naming in the Go library causes.

And of course, we first need to fix client_golang to support the mirroring. I can look into it soon-ish.

@juliusv
Copy link
Member

juliusv commented May 1, 2015

Nice to see a conclusion and agreement here!

@oliver006 Ok, looks like this'll unfortunately have to sit here for a short while until the client library supports proxying summaries. The reason why it's so important to get this "right" the first time around is that these metrics will end up being exported by every instance in the world instrumented with this library, and changing them later could cause pain to a lot of users (we try our best to avoid breaking changes, though it's not always possible at this stage yet).

@oliver006
Copy link
Contributor Author

No worries, glad to see this got resolved. Ping me when the client library got updated and I'll make the changes to the PR.

@brian-brazil
Copy link
Contributor

The other thing that just occurred to me is that as these metrics have the process_ prefix they would have to follow the rule for the standard exports: https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit

How GC works is going to vary wildly across languages and implementations, so I doubt we'd be able to come up with standard exports for it. Following the naming in the hotspot stats (jvm_) and korfuri's python gc stats (python_), these should be prefixed with go_ or similar.

@beorn7
Copy link
Member

beorn7 commented May 4, 2015

@brian-brazil Good point.

@oliver006 PR for statically settable summary imminent.

@beorn7
Copy link
Member

beorn7 commented May 4, 2015

PR is in. @oliver006 feel free to ask if you have any questions.

@oliver006
Copy link
Contributor Author

@beorn7 great, thanks, I'll have a look later today and push up my changes.

@oliver006
Copy link
Contributor Author

Pushed up my changes to use the ConstSummary - let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per #113 (comment), this should be process_go_gc_duration_seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just go_gc_duration_seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why skip the min? The Go runtime provides it, why not mirror it, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As what? The zero-th pecentile?
Something like quantiles[0.0] = stats.PauseQuantiles[0].Seconds() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would make most sense, IMHO. Objections welcome. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done.

@beorn7
Copy link
Member

beorn7 commented May 5, 2015

👍 Everybody else happy? Will merge in 3... 2... 1...

@brian-brazil
Copy link
Contributor

👍

beorn7 added a commit that referenced this pull request May 5, 2015
@beorn7 beorn7 merged commit 86bd077 into prometheus:master May 5, 2015
@oliver006
Copy link
Contributor Author

👍 thx everybody for the help & feedback!

@juliusv
Copy link
Member

juliusv commented May 6, 2015

@oliver006 Thanks for the patience :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants