Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counters collection question #1596

Closed
hryamzik opened this Issue Apr 25, 2016 · 17 comments

Comments

Projects
None yet
4 participants
@hryamzik
Copy link

hryamzik commented Apr 25, 2016

I have some error counters that are exported on service crash. So I cannot export them directly as my application is already dead or will be restarted shortly.

Putting counters to node_exporter's prom files won't help as they'll be zeroed on service start and non-zero values may not be collected.

I've tried the push gateway – it doesn't work either. I push 1, then 0, 0 is exported. +n syntax doesn't seem to be supported.

So how am I supposed to deliver such counters?

I'm currently going to move them to statsd, that requires statsd and statsd exporter or telegraf with statsd input and prometheus output.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

On 25 April 2016 at 22:06, Roman Belyakovsky notifications@github.com
wrote:

I have some error counters that are exported on service crash. So I cannot
export them directly as my application is already dead or will be restarted
shortly.

Putting counters to node_exporter's prom files won't help as they'll be
zeroed on service start and non-zero values may not be collected.

I've tried the push gateway – it doesn't work either. I push 1, then 0, 0
is exported. +n syntax doesn't seem to be supported.

So how am I supposed to deliver such counters?

These are effectively event logs, so Prometheus isn't a good choice for
them. I'd suggest looking at logging solutions such as the ELK stack, and
maybe including a snapshot of /metrics in the logs along with other debug
data right before you abort.

Brian

I'm currently going to move them to statsd, that requires statsd and
statsd exporter or telegraf https://github.com/influxdata/telegraf with statsd
input
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/statsd
and prometheus output.

Brian Brazil
www.robustperception.io

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 25, 2016

I do have logs but this is about number of errors and alerting them. Analyzing logs is a heavy operation and doing this just to alert a crash looks like an overkill.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

If you're trying to alert on the crashes themselves, I'd suggest seeing if your cluster manager/supervisor provides metrics that may be of use.

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 25, 2016

Already doing so. I'm catching panics and error labels could be different.

I'm missing the statsd styled +1 counter but not sure if a push gateway really misses this as it could be done with another tool.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

This is not something you can really do with Prometheus, it's a system for monitoring mostly working systems - not tracking crashes.

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 25, 2016

That's not only about crashes, any batch script counters meets this pattern.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 25, 2016

You can use the pushgateway for service-level batch jobs. We don't have a solution for general event logging.

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 25, 2016

Nope, I can't as it faces the same issue. 1,0,1 pushed to the gateway results in 1,0,1. I have to reset counters on start to declare them on the first run and to avoid down-counting them so I immediately replace my value from previous run with a new one (0).

Even more, 1 followed by 1 will not be treated as 2, so if prometheus misses the zero value counter will be broken.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 25, 2016

We had this use case coming up a number of times.

The Pushgateway is not a statsd-style aggregator, even if people often
think it is or should be.

Adding up the pushed numbers would break idempotency. (If pushes are
lost or duplicated, your results are off.) It just doesn't fit into
the Prometheus mindset of scraping at will and no harm done if a
scrape is lost. We thought about how to make it work, but any solution
we came up with was brittle.

At the moment, we have no plans to add any kind of push-aggregator to
the Prometheus ecosystem, simply because we wouldn't know how to do it
in a consistent way. Which doesn't prevent 3rd parties to create an
external solution that exposes Prometheus metrics.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 25, 2016

@beorn7 how about adding the + syntax to the push gateway protocol for these rare cases? I can make a PR for this.

In fact current implementation of push gateway can loose counters. Let's say you have a counter at 3. Prometheus collects this value from the gateway, then you restart your pushing service and it reports 1, 2, 3, 4. Prometheus will now collect 4 and won't treat it as 7. The + option could give us a nice alternative to this behavior.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 26, 2016

On 26 April 2016 at 01:53, Roman Belyakovsky notifications@github.com wrote:

@beorn7 how about adding the + syntax to the push gateway protocol for these
rare cases? I can make a PR for this.

In fact current implementation of push gateway can loose counters. Let's say
you have a counter at 3. Prometheus collects this value from the gateway,
then you restart your pushing service and it reports 1, 2, 3, 4. Prometheus
will now collect 4 and won't treat it as 7. The + option could give us a
nice alternative to this behavior.

What I tried to explain with my last mail is that this would be
contrary to the Prometheus semantics and data model.

So it will net be implemented as part of the Pushgateway.

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

@hryamzik hryamzik closed this Apr 26, 2016

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 26, 2016

@beorn7, @brian-brazil thanks for clarifying!

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 26, 2016

@hryamzik FWIW, if you do need some intermediary counter aggregator, there is the StatsD Bridge: https://github.com/prometheus/statsd_exporter - this takes StatsD protocol as input and outputs Prometheus metrics, more or less. We usually only recommend it as a transition solution though.

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 26, 2016

@juliusv that's a good point but statsd_exporter doesn't have any persistence. Push gateway seems to be a more smart solution here.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 26, 2016

@hryamzik Yeah, in Prometheus you wouldn't need persistence for that usually, since you don't care about a counter's absolute value, but only its rate of increase. And the rate()/increase()/... functions handle counter resets automatically.

@hryamzik

This comment has been minimized.

Copy link
Author

hryamzik commented Apr 26, 2016

@juliusv the issue is that metric gets lost after statsd_exporter restart, it doesn't reset to 0 or NaN, just disappears. That's what prometheus doesn't really like.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.