You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The problem I have is, that some of the nodes I monitor are quite unstable, that is scraping fails every now and then (or even much more often than that) for something like a few seconds (I have a scrape interval of 10s) up to 1,5 min.
What happens then is, that any alerts that were firing, stop doing so as - for Prometheus - the node basically disappeared and the alert's query gives no result for it anymore.
Once the flapping stops, the alert pops up back.
Since Prometheus wrongly considered the disappearing of the metrics used by the alert rule as "alert condition no longer exists", it sends of course a fresh notification (like an email), even if the repeat interval hasn't been reached yet.
Now if that flapping happens every 10 mins, things get quite noisy.
(Oh, and I cannot fix the flapping nodes, some because I just monitor them and they're otherwise out of my control, some because I actually fail to do so ^^ see [0]).
Asking the community, I was told about keep_firing_for (well actually I already knew about it when writing my post but was already not happy with it).
I think there are at least two kinds of flapping (copy and pasting from my Google Groups post):
For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.
Type 1 Flapping:
The scraping of the metrics works all the time (i.e. up is all the time 1).
But IO is happening, that just causes the 80% to be exceeded and then fallen below every now and then.
Type 2 Flapping
Scrapes fail every now and then.
There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time.
As I understand keep_firing_for, it will be (mostly[1]) perfect for type 1:
If my flapping is just that the threshold is exceeded and than no longer, I can set keep_firing_for to whatever I consider a reasonable period, where the alert condition is still considered to be ongoing rather than being a new one. Say 10 mins.
If then it clears for 9 mins but comes shortly back and then again clears... it will be still like one continuous alert. If it clears for 10 mins (of course assuming the evaluation interval fits), a new alert condition will be considered a fresh alert.
Now what are the general downsides of keep_firing_for - when using them for type 2 flapping (not for type 1 - or at least for that, there's no real way around these downsides)?
Alerts get artificially extended, while they may actually have already cleared. And the longer one needs to set the period, the worse that gets.
One may actually miss true new alerts, i.e. in cases where the number of alerts is actually interesting, a 2nd alert could be missed as a new one, because the first (which has however actually already cleared) is still ongoing because of keep_firing_for.
And if one introduces a new use cases, namely downtimes, the "flapping" can not really be cured with keep_firing_for anymore.
Consider a reboot with firmware updates, which may take quite a while, the same would happen as in my Type 2 flapping scenario.
Scrapes would fail, alert would clear, scrapes would come back, notifications would be resent.
And I think it's not reasonable to set keep_firing_for longer than perhaps 1h or so.
Sure one can manually silence the alerts ahead of a downtime, but why putting that burden on the admin?
So what could be done better for Type 2 flapping and for downtimes?
Idea:
I think, the above issues might be fixed by an alternative keep_firing_for_with_considerig_up that takes into account whether the metrics (well actually the targets) used by an alert query are all up or not.
If an alert, for which keep_firing_for_with_considerig_up is set is firing, and then suddenly the alert condition is gone... it would continue to fire as long any of the queried targets/metrics are not up... and as long as the time span hasn't been exceeded (there probably should be an infinite value).
If only once the the metrics come back up, the keep_firing_for_with_considerig_up would stop (though the alert may of course still fire, if it was still firing, with the up metrics).
In simple words: a possibly infinite version of keep_firing_for that takes up into account and whether an alert likely just "vanished" temporarily because the scraping failed.
I would think that the current keep_firing_for is still needed, especially for use cases like type 1 flapping, where my keep_firing_for_with_considerig_up would fail if both types of flapping happen at the same time.
What might be tricky in implementing a keep_firing_for_with_considerig_up is to determine which metrics/targets needs to be taken into account.
For a simple alert, like when having an alert that checks for free space on the root fs like:
things are probably clear:
If the alert fires for foo.example.com, then suddenly no longer fires but the node_* metrics are down for foo.example.com, well then it's probably just no longer firing because of that being down, so keep artificially going.
But what if an alert mixes metrics from different instance?
Simply saying "if any of them fails", do the keep_firing_for_with_considerig_up may be to simple.
Cause there could be alerts like (pseudo code):
fire if: metric-on-node-1 = 1 and metric-on-node-2 = 1
where the alert should stop firing if the metric on node-1 gets 0 (and the node is up), even if node-2 is currently down, because it's anyway clear that the overall condition would no longer produce something.
The text was updated successfully, but these errors were encountered:
calestyo
changed the title
an alternative to keep_firing_for that takes the upness of metrics/targets into account.
an alternative to keep_firing_for that takes the upness of metrics/targets into account
May 13, 2024
If the underlying problem is that metrics might not have arrived "on-time" for a brief period of time, rule evaluation offset could be a solution at the trade-off of slightly slower detection.
Proposal
Hey there.
(Spoiler: IMO the current
keep_firing_for
is not a proper solution for what I'm dealing with).This originates from a discussion at: https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw
The problem I have is, that some of the nodes I monitor are quite unstable, that is scraping fails every now and then (or even much more often than that) for something like a few seconds (I have a scrape interval of 10s) up to 1,5 min.
What happens then is, that any alerts that were firing, stop doing so as - for Prometheus - the node basically disappeared and the alert's query gives no result for it anymore.
Once the flapping stops, the alert pops up back.
Since Prometheus wrongly considered the disappearing of the metrics used by the alert rule as "alert condition no longer exists", it sends of course a fresh notification (like an email), even if the repeat interval hasn't been reached yet.
Now if that flapping happens every 10 mins, things get quite noisy.
(Oh, and I cannot fix the flapping nodes, some because I just monitor them and they're otherwise out of my control, some because I actually fail to do so ^^ see [0]).
Asking the community, I was told about
keep_firing_for
(well actually I already knew about it when writing my post but was already not happy with it).I think there are at least two kinds of flapping (copy and pasting from my Google Groups post):
For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.
Type 1 Flapping:
up
is all the time 1).Type 2 Flapping
As I understand
keep_firing_for
, it will be (mostly[1]) perfect for type 1:keep_firing_for
to whatever I consider a reasonable period, where the alert condition is still considered to be ongoing rather than being a new one. Say 10 mins.If then it clears for 9 mins but comes shortly back and then again clears... it will be still like one continuous alert. If it clears for 10 mins (of course assuming the evaluation interval fits), a new alert condition will be considered a fresh alert.
Now what are the general downsides of
keep_firing_for
- when using them for type 2 flapping (not for type 1 - or at least for that, there's no real way around these downsides)?keep_firing_for
.And if one introduces a new use cases, namely downtimes, the "flapping" can not really be cured with
keep_firing_for
anymore.Consider a reboot with firmware updates, which may take quite a while, the same would happen as in my Type 2 flapping scenario.
Scrapes would fail, alert would clear, scrapes would come back, notifications would be resent.
And I think it's not reasonable to set
keep_firing_for
longer than perhaps 1h or so.Sure one can manually silence the alerts ahead of a downtime, but why putting that burden on the admin?
So what could be done better for Type 2 flapping and for downtimes?
Idea:
I think, the above issues might be fixed by an alternative
keep_firing_for_with_considerig_up
that takes into account whether the metrics (well actually the targets) used by an alert query are allup
or not.If an alert, for which
keep_firing_for_with_considerig_up
is set is firing, and then suddenly the alert condition is gone... it would continue to fire as long any of the queried targets/metrics are not up... and as long as the time span hasn't been exceeded (there probably should be an infinite value).If only once the the metrics come back up, the
keep_firing_for_with_considerig_up
would stop (though the alert may of course still fire, if it was still firing, with the up metrics).In simple words: a possibly infinite version of
keep_firing_for
that takesup
into account and whether an alert likely just "vanished" temporarily because the scraping failed.I would think that the current
keep_firing_for
is still needed, especially for use cases like type 1 flapping, where mykeep_firing_for_with_considerig_up
would fail if both types of flapping happen at the same time.What might be tricky in implementing a
keep_firing_for_with_considerig_up
is to determine which metrics/targets needs to be taken into account.For a simple alert, like when having an alert that checks for free space on the root fs like:
things are probably clear:
If the alert fires for foo.example.com, then suddenly no longer fires but the node_* metrics are down for foo.example.com, well then it's probably just no longer firing because of that being down, so keep artificially going.
But what if an alert mixes metrics from different instance?
Simply saying "if any of them fails", do the
keep_firing_for_with_considerig_up
may be to simple.Cause there could be alerts like (pseudo code):
where the alert should stop firing if the metric on node-1 gets 0 (and the node is up), even if node-2 is currently down, because it's anyway clear that the overall condition would no longer produce something.
Thanks,
Chris.
[0] https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw/m/cNaVrEjqAQAJ
at the very bottom
[1] #14084
The text was updated successfully, but these errors were encountered: