Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

Merged
merged 1 commit into from
May 10, 2021

Conversation

pnerg
Copy link
Contributor

@pnerg pnerg commented Apr 29, 2021

This is hopefully the answer to several discussions (#899, #914) around why some of the measurements are using range sampler and not gauge.

This PR adds the possibility to configure (by name pattern) for which distribution based (range, time, histogram) metrics to also generate gauge metrics

E.g. configuring something like this:

kamon.prometheus.gauges.metrics = ["http.server.*"]

Would also render gauge metrics for all http-server metrics.
The existing histogram/summary metrics will still remain, these gauges are add-ons.

For each distribution metric that matches the configured filter three gauges are created

  • min- representing the min value seen in the distribution of the snapshot
  • max- representing the max value seen in the distribution of the snapshot
  • sum- representing the sum of all values seen in the distribution of the snapshot. Which is relevant to understand how many of something (e.g. connections) there has been measured in the interval

E.g. running with akka-http and enabling gauge metrics according to the above configuration I'd get something like this for the range metric http.server.connection.open

# HELP http_server_connection_open_min Number of open connections
# TYPE http_server_connection_open_min gauge
http_server_connection_open_min{port="9696",interface="0.0.0.0",component="akka.http.server"} 0.0
# HELP http_server_connection_open_max Number of open connections
# TYPE http_server_connection_open_max gauge
http_server_connection_open_max{port="9696",interface="0.0.0.0",component="akka.http.server"} 4.0
# HELP http_server_connection_open_sum Number of open connections
# TYPE http_server_connection_open_sum gauge
http_server_connection_open_sum{port="9696",interface="0.0.0.0",component="akka.http.server"} 600.0

When I stop running requests the gauges eventually drop to zero

I think this is a non-intrusive way of providing more suitable metric formats for Prometheus for some of the use cases where one just wants an easy way to see what is going on in the app here and now.

@SimunKaracic
Copy link
Contributor

As usual, great work!
@ivantopo will also take a look at this in the next few days, just to make sure everything is ok, and we'll merge after that

@ivantopo
Copy link
Contributor

ivantopo commented May 5, 2021

Hey @pnerg! Thanks again for the dedication and contributions man 😍

Regarding this comment:

sum- representing the sum of all values seen in the distribution of the snapshot. Which is relevant to understand how many of something (e.g. connections) there has been measured in the interval

The sum will not really be "how many of something" in this case, specially not with a range sampler. For example, if you were measuring the number of open connections to a database with a range sampler and during a entire reporting interval there were exactly 10 open connections, the reported sum would be 9000 (summing the value "10" three times every 200ms for the whole minute).

I'm working on a blog post related to range samplers here: https://github.com/kamon-io/kamon.io/blob/c825fb60e0a3646804246496b4c0af38663408a8/_posts/2021-04-29-monitoring-queues-and-resources-with-kamon-range-samplers.md and even though it needs some polishing, I'm sure it will help you understand the logic behind range samplers.

Probably adding count would be a good thing, so that sum/count at least gives the average of "something" tracked by the range sampler.. which happens to be very close to what summaries do in Prometheus.

Have you considered turning range samplers into a summary with q=0 for the min, q=1 for the max and maybe even a couple percentiles instead? I'm not saying that's should be done, but it is an idea that crossed my mind as I'm reasoning about this PR and how it fits the data and Prometheus practices.

@ivantopo
Copy link
Contributor

ivantopo commented May 5, 2021

And one more thing: it seems like this PR would report the same metric both as histogram and as a group of gauges. Wouldn't that mean that the _sum timeseries would be repeated because histograms already expose it?

@pnerg
Copy link
Contributor Author

pnerg commented May 6, 2021

There you go my lack of understanding the statistical distributions and how to interpret them...:blush:
The _sum part seems a bit misunderstood by me. But it will not be the same _sum as reported by the histogram as I create the gauge based on the snapshot, not the collected total.

Great thing that you're drafting a blog describing the range samplers, us noobs really need that... 👍
quote "Here is where most people get lost. Reasoning about why a queue size metric is producing a histogram distribution instead of a single value does not come easy at first" ... yup that's me

Yes I played around with summaries and histograms and I get that the histograms and range samplers are for statistical reporting over time but I also need something to measure the here and now.
We have situations where traffic goes from very little (almost nothing) to bursts so percentiles would be misleading or I can already guess they'd show that most of the time there's little utilisation.
Often in test plants where we want to see how well something performs in load situations.

Having read your excellent blog I'm wondering if the PR make sense if I remove the _sum gauge or perhaps replace it with a _avg (using the snapshot sum and count to create an avg). These three gauges (_min, _max, _avg) as they're based on the snapshot would then give some albeit rough interpretation on the here and now status.
The raw data is still there in the histogram should one want to perform more fancy analytics.

@SimunKaracic
Copy link
Contributor

Having read your excellent blog I'm wondering if the PR make sense if I remove the _sum gauge or perhaps replace it with a _avg (using the snapshot sum and count to create an avg).

Probably adding count would be a good thing, so that sum/count at least gives the average of "something" tracked by the range sampler.. which happens to be very close to what summaries do in Prometheus.

So, I guess you implement that, and we're ready to merge :D

@pnerg pnerg force-pushed the distributions-as-gauges branch from a177b61 to 7d42c95 Compare May 6, 2021 13:53
@pnerg
Copy link
Contributor Author

pnerg commented May 6, 2021

Updated the PR and replaced _sum with _avg makes more sense than publishing the sum and count as they're difficult to interpret.
The combo of _max, _min and _avg gives a quick indicator on what's going on. Yes one can easily miss spikes/bursts as the average will not see them.
I see these gauges as a complement for situations/metrics which the developer deems they make sense.

@SimunKaracic
Copy link
Contributor

Done, merged, good job @pnerg !
I'll publish tomorrow or on wednesday 🎉

@SimunKaracic SimunKaracic merged commit e668e82 into kamon-io:master May 10, 2021
@pnerg pnerg deleted the distributions-as-gauges branch May 10, 2021 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants