an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

calestyo · 2024-05-13T02:21:36Z

Proposal

Hey there.

(Spoiler: IMO the current keep_firing_for is not a proper solution for what I'm dealing with).

This originates from a discussion at: https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw

The problem I have is, that some of the nodes I monitor are quite unstable, that is scraping fails every now and then (or even much more often than that) for something like a few seconds (I have a scrape interval of 10s) up to 1,5 min.

What happens then is, that any alerts that were firing, stop doing so as - for Prometheus - the node basically disappeared and the alert's query gives no result for it anymore.
Once the flapping stops, the alert pops up back.

Since Prometheus wrongly considered the disappearing of the metrics used by the alert rule as "alert condition no longer exists", it sends of course a fresh notification (like an email), even if the repeat interval hasn't been reached yet.

Now if that flapping happens every 10 mins, things get quite noisy.
(Oh, and I cannot fix the flapping nodes, some because I just monitor them and they're otherwise out of my control, some because I actually fail to do so ^^ see [0]).

Asking the community, I was told about keep_firing_for (well actually I already knew about it when writing my post but was already not happy with it).

I think there are at least two kinds of flapping (copy and pasting from my Google Groups post):

For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. up is all the time 1).
- But IO is happening, that just causes the 80% to be exceeded and then fallen below every now and then.
Type 2 Flapping
- Scrapes fail every now and then.
- There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time.

As I understand keep_firing_for, it will be (mostly[1]) perfect for type 1:

If my flapping is just that the threshold is exceeded and than no longer, I can set keep_firing_for to whatever I consider a reasonable period, where the alert condition is still considered to be ongoing rather than being a new one. Say 10 mins.
If then it clears for 9 mins but comes shortly back and then again clears... it will be still like one continuous alert. If it clears for 10 mins (of course assuming the evaluation interval fits), a new alert condition will be considered a fresh alert.

Now what are the general downsides of keep_firing_for - when using them for type 2 flapping (not for type 1 - or at least for that, there's no real way around these downsides)?

Alerts get artificially extended, while they may actually have already cleared. And the longer one needs to set the period, the worse that gets.
One may actually miss true new alerts, i.e. in cases where the number of alerts is actually interesting, a 2nd alert could be missed as a new one, because the first (which has however actually already cleared) is still ongoing because of keep_firing_for.

And if one introduces a new use cases, namely downtimes, the "flapping" can not really be cured with keep_firing_for anymore.
Consider a reboot with firmware updates, which may take quite a while, the same would happen as in my Type 2 flapping scenario.
Scrapes would fail, alert would clear, scrapes would come back, notifications would be resent.
And I think it's not reasonable to set keep_firing_for longer than perhaps 1h or so.
Sure one can manually silence the alerts ahead of a downtime, but why putting that burden on the admin?

So what could be done better for Type 2 flapping and for downtimes?

Idea:

I think, the above issues might be fixed by an alternative keep_firing_for_with_considerig_up that takes into account whether the metrics (well actually the targets) used by an alert query are all up or not.

If an alert, for which keep_firing_for_with_considerig_up is set is firing, and then suddenly the alert condition is gone... it would continue to fire as long any of the queried targets/metrics are not up... and as long as the time span hasn't been exceeded (there probably should be an infinite value).

If only once the the metrics come back up, the keep_firing_for_with_considerig_up would stop (though the alert may of course still fire, if it was still firing, with the up metrics).

In simple words: a possibly infinite version of keep_firing_for that takes up into account and whether an alert likely just "vanished" temporarily because the scraping failed.

I would think that the current keep_firing_for is still needed, especially for use cases like type 1 flapping, where my keep_firing_for_with_considerig_up would fail if both types of flapping happen at the same time.

What might be tricky in implementing a keep_firing_for_with_considerig_up is to determine which metrics/targets needs to be taken into account.
For a simple alert, like when having an alert that checks for free space on the root fs like:

      expr: '1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs"} / node_filesystem_size_bytes  >  0.80'
      for:  4m

things are probably clear:
If the alert fires for foo.example.com, then suddenly no longer fires but the node_* metrics are down for foo.example.com, well then it's probably just no longer firing because of that being down, so keep artificially going.

But what if an alert mixes metrics from different instance?
Simply saying "if any of them fails", do the keep_firing_for_with_considerig_up may be to simple.
Cause there could be alerts like (pseudo code):

fire if: metric-on-node-1 = 1  and  metric-on-node-2 = 1

where the alert should stop firing if the metric on node-1 gets 0 (and the node is up), even if node-2 is currently down, because it's anyway clear that the overall condition would no longer produce something.

Thanks,
Chris.

[0] https://groups.google.com/g/prometheus-users/c/Ihh1Dk0puQw/m/cNaVrEjqAQAJ
at the very bottom
[1] #14084

The text was updated successfully, but these errors were encountered:

gotjosh · 2024-05-28T08:42:45Z

Have you considered #14061 ?

If the underlying problem is that metrics might not have arrived "on-time" for a brief period of time, rule evaluation offset could be a solution at the trade-off of slightly slower detection.

calestyo changed the title ~~an alternative to keep_firing_for that takes the upness of metrics/targets into account.~~ an alternative to keep_firing_for that takes the upness of metrics/targets into account May 13, 2024

calestyo mentioned this issue May 28, 2024

Add 'keep_firing_for' field to alerting rules #11827

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

calestyo commented May 13, 2024

gotjosh commented May 28, 2024

an alternative to keep_firing_for that takes the upness of metrics/targets into account #14085

an alternative to keep_firing_for that takes the upness of metrics/targets into account #14085

Comments

calestyo commented May 13, 2024

Proposal

gotjosh commented May 28, 2024

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085

an alternative to `keep_firing_for` that takes the `up`ness of metrics/targets into account #14085