(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

pnerg · 2021-04-29T11:35:12Z

This is hopefully the answer to several discussions (#899, #914) around why some of the measurements are using range sampler and not gauge.

This PR adds the possibility to configure (by name pattern) for which distribution based (range, time, histogram) metrics to also generate gauge metrics

E.g. configuring something like this:

kamon.prometheus.gauges.metrics = ["http.server.*"]

Would also render gauge metrics for all http-server metrics.
The existing histogram/summary metrics will still remain, these gauges are add-ons.

For each distribution metric that matches the configured filter three gauges are created

min- representing the min value seen in the distribution of the snapshot
max- representing the max value seen in the distribution of the snapshot
sum- representing the sum of all values seen in the distribution of the snapshot. Which is relevant to understand how many of something (e.g. connections) there has been measured in the interval

E.g. running with akka-http and enabling gauge metrics according to the above configuration I'd get something like this for the range metric http.server.connection.open

# HELP http_server_connection_open_min Number of open connections
# TYPE http_server_connection_open_min gauge
http_server_connection_open_min{port="9696",interface="0.0.0.0",component="akka.http.server"} 0.0
# HELP http_server_connection_open_max Number of open connections
# TYPE http_server_connection_open_max gauge
http_server_connection_open_max{port="9696",interface="0.0.0.0",component="akka.http.server"} 4.0
# HELP http_server_connection_open_sum Number of open connections
# TYPE http_server_connection_open_sum gauge
http_server_connection_open_sum{port="9696",interface="0.0.0.0",component="akka.http.server"} 600.0

When I stop running requests the gauges eventually drop to zero

I think this is a non-intrusive way of providing more suitable metric formats for Prometheus for some of the use cases where one just wants an easy way to see what is going on in the app here and now.

SimunKaracic · 2021-05-03T13:12:46Z

As usual, great work!
@ivantopo will also take a look at this in the next few days, just to make sure everything is ok, and we'll merge after that

ivantopo · 2021-05-05T20:55:38Z

Hey @pnerg! Thanks again for the dedication and contributions man 😍

Regarding this comment:

sum- representing the sum of all values seen in the distribution of the snapshot. Which is relevant to understand how many of something (e.g. connections) there has been measured in the interval

The sum will not really be "how many of something" in this case, specially not with a range sampler. For example, if you were measuring the number of open connections to a database with a range sampler and during a entire reporting interval there were exactly 10 open connections, the reported sum would be 9000 (summing the value "10" three times every 200ms for the whole minute).

I'm working on a blog post related to range samplers here: https://github.com/kamon-io/kamon.io/blob/c825fb60e0a3646804246496b4c0af38663408a8/_posts/2021-04-29-monitoring-queues-and-resources-with-kamon-range-samplers.md and even though it needs some polishing, I'm sure it will help you understand the logic behind range samplers.

Probably adding count would be a good thing, so that sum/count at least gives the average of "something" tracked by the range sampler.. which happens to be very close to what summaries do in Prometheus.

Have you considered turning range samplers into a summary with q=0 for the min, q=1 for the max and maybe even a couple percentiles instead? I'm not saying that's should be done, but it is an idea that crossed my mind as I'm reasoning about this PR and how it fits the data and Prometheus practices.

ivantopo · 2021-05-05T20:59:51Z

And one more thing: it seems like this PR would report the same metric both as histogram and as a group of gauges. Wouldn't that mean that the _sum timeseries would be repeated because histograms already expose it?

pnerg · 2021-05-06T05:51:23Z

There you go my lack of understanding the statistical distributions and how to interpret them...:blush:
The _sum part seems a bit misunderstood by me. But it will not be the same _sum as reported by the histogram as I create the gauge based on the snapshot, not the collected total.

Great thing that you're drafting a blog describing the range samplers, us noobs really need that... 👍
quote "Here is where most people get lost. Reasoning about why a queue size metric is producing a histogram distribution instead of a single value does not come easy at first" ... yup that's me

Yes I played around with summaries and histograms and I get that the histograms and range samplers are for statistical reporting over time but I also need something to measure the here and now.
We have situations where traffic goes from very little (almost nothing) to bursts so percentiles would be misleading or I can already guess they'd show that most of the time there's little utilisation.
Often in test plants where we want to see how well something performs in load situations.

Having read your excellent blog I'm wondering if the PR make sense if I remove the _sum gauge or perhaps replace it with a _avg (using the snapshot sum and count to create an avg). These three gauges (_min, _max, _avg) as they're based on the snapshot would then give some albeit rough interpretation on the here and now status.
The raw data is still there in the histogram should one want to perform more fancy analytics.

SimunKaracic · 2021-05-06T12:40:03Z

Having read your excellent blog I'm wondering if the PR make sense if I remove the _sum gauge or perhaps replace it with a _avg (using the snapshot sum and count to create an avg).

Probably adding count would be a good thing, so that sum/count at least gives the average of "something" tracked by the range sampler.. which happens to be very close to what summaries do in Prometheus.

So, I guess you implement that, and we're ready to merge :D

…ed metrics

pnerg · 2021-05-06T13:56:53Z

Updated the PR and replaced _sum with _avg makes more sense than publishing the sum and count as they're difficult to interpret.
The combo of _max, _min and _avg gives a quick indicator on what's going on. Yes one can easily miss spikes/bursts as the average will not see them.
I see these gauges as a complement for situations/metrics which the developer deems they make sense.

SimunKaracic · 2021-05-10T12:31:26Z

Done, merged, good job @pnerg !
I'll publish tomorrow or on wednesday 🎉

added the possibility to generate gauge metrics from distribution bas…

7d42c95

…ed metrics

pnerg force-pushed the distributions-as-gauges branch from a177b61 to 7d42c95 Compare May 6, 2021 13:53

SimunKaracic merged commit e668e82 into kamon-io:master May 10, 2021

pnerg deleted the distributions-as-gauges branch May 10, 2021 12:55

pnerg mentioned this pull request May 11, 2021

Figure out how to turn certain metrics as gauges for the prometheus reporter #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

pnerg commented Apr 29, 2021 •

edited

Loading

SimunKaracic commented May 3, 2021

ivantopo commented May 5, 2021

ivantopo commented May 5, 2021

pnerg commented May 6, 2021

SimunKaracic commented May 6, 2021

pnerg commented May 6, 2021

SimunKaracic commented May 10, 2021

(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

(prometheus) Added the possibility to generate gauge metrics from distribution based metrics #1011

Conversation

pnerg commented Apr 29, 2021 • edited Loading

SimunKaracic commented May 3, 2021

ivantopo commented May 5, 2021

ivantopo commented May 5, 2021

pnerg commented May 6, 2021

SimunKaracic commented May 6, 2021

pnerg commented May 6, 2021

SimunKaracic commented May 10, 2021

pnerg commented Apr 29, 2021 •

edited

Loading