Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a range variant of absent() #2882

Open
brian-brazil opened this Issue Jun 27, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@brian-brazil
Copy link
Member

brian-brazil commented Jun 27, 2017

Currently absent() implicitly looks back 5 minutes due to old staleness, with new staleness it looks only at the last sample so alerts based on it are likely to be much more sensitive.
The general solution to this is to use a longer FOR clause, but for longer time periods this presumes that Prometheus was up during that time as #422 isn't resolved.

Where I see this as relevant is alerts for batch jobs, for it to catch both the job not running recently and not running ever you'd either have two alerts (with users often forgetting the absent alert) or do time() - last_push time > 3600 or absent(count_over_time(last_push_time[1h])) to combine it into one. The problem with the latter being that the RHS won't return the labels of the range vector.

Thus I propose adding a range variant of absent(), I'm unsure on the name, absent_over_time is my first thought.

Arguments against this are similarly named functions (though no worse than we already have), that this is doable with two alerts plus a FOR clause once #422 is fixed, and what I propose is also possible in a verbose way using vector and label_replace.

Thoughts?

@niravshah2705

This comment has been minimized.

Copy link

niravshah2705 commented Aug 28, 2018

This would help you : -
https://niravshah2705-software-engineering.blogspot.com/2018/08/prometheus-monitoring.html

Recording rule:

  • record: stackdriver_pubsub:scraptime
    expr: timestamp(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count)

Alert rule:
time() - max_over_time(stackdriver_pubsub:scraptime[5h]) > 3600 or sum_over_time(stackdriver_pubsub_topic_pubsub_googleapis_com_topic_send_request_count[30m]) < 5

@lilred

This comment has been minimized.

Copy link

lilred commented Feb 5, 2019

This would be very handy right now. I'm trying to write a pre-deployment check for our alerting rules. The idea is that metric selectors underlying alerting expression should be defined. So for example if the alert expression is up{job="postgres"} == 0, the pre-deployment check should fail if absent(up{job="postgres"}).

Unfortunately some of our metrics are flaky (Particularly relating to the HAProxy exporter) so it's not uncommon for a metric to be undefined for some time even though the monitored service is operating normally. If there were a way to do absent_over_time(up{job="postgres"}[1d]) that would be ideal for our purposes!

Do you know of any workaround that might do instead?

@ekarak

This comment has been minimized.

Copy link

ekarak commented Apr 2, 2019

I once made an experimental implementation of absent_over_time in https://github.com/ekarak/prometheus/tree/feature/absent-over-time, the code is ekarak@827c027
However the approach taken was a bit complex, I used FFT to get the scrape frequency for each timeseries, then I'd evaluate abset_over_time to 1 whenever a sample was not received near the next expected timestamp.
This would allow me to raise alerts if a critical timeseries disappeared - and BTW adding state to the exporter wasn't an option at that point.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 2, 2019

This wouldn't be looking for gaps, it's looking for samples being completely missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.