-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add garbage collection stats #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
prometheus/gc_collector.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invocations
These two stats would be better as Summary without percentiles, as they're related Counters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the feedback, let me have a look at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that'd be a single summary called process_gc_pause_seconds
, which then generates two metrics:
process_gc_pause_seconds_count
(the total GC count)process_gc_pause_seconds_sum
(the total number of seconds)
@brian-brazil Not sure though whether I like that more than having two separate explicit counters. From a user perspective, I always felt that explicit (non-summary) counter metrics have the better/clearer names. I know that we're currently using the summary-counter approach quite often, but that's usually in the context of also collecting quantiles via observations. In this case, we don't even have the individual observations/events, only two existing counters that we are bridging. Wouldn't it be better to keep them as normal counters in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(you'd have to generate synthetic summary observations otherwise based on the increases since the last scrape, so I think the two counters are good as is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary is what we'd use if were instrumenting this directly, having the standard _sum and _count timeseries makes it more obvious what you can do with the metric. If it's technically tricky to do a Summary then two Counters are okay, but they should still follow the Summary naming scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values are a total count of events/seconds, which is what a Counter is. If they were e.g. the last time a GC ran or the last GC duration it'd be a Gauge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For comparison, what we do over in Java is https://github.com/prometheus/client_java/blob/master/simpleclient_hotspot/src/main/java/io/prometheus/client/hotspot/GarbageCollectorExports.java#L39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, even though the values are already provided to us and we don't need to "count" them, they are fundamentally cumulative counters (meaning you can do rates over them, etc.), and the client library should export them typed as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the pointers, didn't see the Set()
routine for Counter.
@oliver006 Thanks, this is great! I think the best place for these new stats would be in the existing Go collector: https://github.com/prometheus/client_golang/blob/master/prometheus/go_collector.go, but I'll let @beorn7 be the authoritative person to judge that. |
@oliver006 Yeah, this should definitely in the Go collector. It has only one metric there, and we always wanted to add GC metrics there. |
prometheus/go_collector.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the intention was to make it part of the goCollector
instead of having a separate gcCollector
(not just move it to the same file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
The current build is failing as you are not importing |
Thanks for all the feedback, changed the Gauges to Counters and moved it all into goCollector as suggested. |
👍 from my side, waiting for authoritative 👍 by @beorn7. Should we keep the metric names |
👍 for |
Cool, changed it to |
It should be process_gc_invocations_seconds_count/_sum or similar, otherwise the calculation of the average gc pause will end up as instance:process_gc_invocations_per_process_gc_pause_seconds:ratio_rate5m rather than the more standard instance:process_gc_invocations_seconds:mean5m |
Poor @oliver006! Sorry, it seems we need to get in agreement internally first. @brian-brazil, can I not still call the resulting recording rule the latter (more standard) way, even if the input metrics have somewhat different (more human-friendly) names? I see that this means the mapping between original/divided metric names won't be as clear anymore, but it also means that the original metric names will be less obtuse. Given that the summary-style naming was downvoted by @beorn7 before, it seems this needs to be discussed some more. |
Ha, no worries, I'll wait till you guys agreed on a name for the metric. |
@juliusv We had a debate on the Summary naming a while back, and decided to leave it as _sum/_count.
The point of the standard naming is that you can look at a variable and tell where it comes from, if you feel the need to change the naming mid-stream you're better off changing the instrumentation to export what you wanted in the first place. If the concern is the name is unclear then we can improve the base of the name, the Summary suffixes will ultimately make it clearer and more consistent across the entire ecosystem. We should be avoiding special cases for local clarity, as it ultimately adds complexity that makes everything more difficult to work with. |
On Fri, May 1, 2015 at 11:42 AM, Brian Brazil notifications@github.com wrote:
Yes, but: (1) This naming scheme is, as said, for summaries (and now histograms). (2) The naming is invisible from the client side. It only happens when Arguably, this case is not a summary in the first place. The one It would be a different discussion if we regularly want to use Brian, you argue for mirroring the internal naming scheme for Organically, you would call a counter counting requests However, a summary or histogram would be called In summary, I want to limit the _sum/_count naming scheme to the count of GCs, not of pauses).Björn Rabenstein, Engineer SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany |
That's a cpu time vs. wall time, which is a matter for general metric naming rather than which type of metric. This is a Summary.
The Go client lacks that, but Java and Python have it (and use it).
I strongly disagree, that's a Summary and should appear to the user accordingly. Having the user have to worry individually about each special case isn't a scalable approach. It sounds like the issue here is a deficiency in the go client library, as it doesn't allow direct proxying of Summarys from other monitoring systems. Can we add that? |
The deficiency in the go client library to mirror summaries and histograms is a real issue (filed #114 ), but it's not the issue here. The issue here is if the two counters are a summary at all or not. As reported earlier, I had the impression that pause time (and number of pauses) is different from number of GC runs and the time spent for the GC. The Go docs are not exactly conclusive here. Therefore, I checked out the source code. Conclusion: At least with the current implementation, "one pause" is identical to "one GC run", and the pause time is identical tot he GC time. I'd still expect that we need to revisit once the garbage collector is changed to not stop the world, but for now, the two values are indeed a summary. BTW: The GCStats struct even contains quantiles we can mirror into the summary, too. Which sells the rest of me for the Summary (as we can have actual quantiles in it). I'd suggest And of course, we first need to fix client_golang to support the mirroring. I can look into it soon-ish. |
Nice to see a conclusion and agreement here! @oliver006 Ok, looks like this'll unfortunately have to sit here for a short while until the client library supports proxying summaries. The reason why it's so important to get this "right" the first time around is that these metrics will end up being exported by every instance in the world instrumented with this library, and changing them later could cause pain to a lot of users (we try our best to avoid breaking changes, though it's not always possible at this stage yet). |
No worries, glad to see this got resolved. Ping me when the client library got updated and I'll make the changes to the PR. |
The other thing that just occurred to me is that as these metrics have the process_ prefix they would have to follow the rule for the standard exports: https://docs.google.com/document/d/1Q0MXWdwp1mdXCzNRak6bW5LLVylVRXhdi7_21Sg15xQ/edit How GC works is going to vary wildly across languages and implementations, so I doubt we'd be able to come up with standard exports for it. Following the naming in the hotspot stats (jvm_) and korfuri's python gc stats (python_), these should be prefixed with go_ or similar. |
@brian-brazil Good point. @oliver006 PR for statically settable summary imminent. |
PR is in. @oliver006 feel free to ask if you have any questions. |
@beorn7 great, thanks, I'll have a look later today and push up my changes. |
Pushed up my changes to use the ConstSummary - let me know what you think. |
prometheus/go_collector.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per #113 (comment), this should be process_go_gc_duration_seconds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just go_gc_duration_seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why skip the min? The Go runtime provides it, why not mirror it, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As what? The zero-th pecentile?
Something like quantiles[0.0] = stats.PauseQuantiles[0].Seconds()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would make most sense, IMHO. Objections welcome. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done.
👍 Everybody else happy? Will merge in 3... 2... 1... |
👍 |
👍 thx everybody for the help & feedback! |
@oliver006 Thanks for the patience :) |
This adds basic garbage collections stats to the metrics that are exported by default by the golang client.