Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upcmd option "-rules.alert.resend-delay" is ignored #5048
Comments
This comment has been minimized.
This comment has been minimized.
|
That's sounds like a (relatively predictable) race condition between the 30s rule evaluation interval and the resend interval. When the two are exactly the same, then whether an alert is sent twice in a row during rule evaluations will depend on exact details of when the resend delay is checked vs. when the last resend is recorded. Prometheus might think that a resend happened 29.999s earlier (in the last rule evaluation iteration) and thus not send the alert again. Probably indeed if you depend on Prometheus sending alerts on every rule evaluation iteration, you'll want to set the re-send interval to at least a couple of seconds lower than the evaluation interval. |
This comment has been minimized.
This comment has been minimized.
|
I thought that alert resend interval is decoupled from rule_evaluation after Prometheus 2.4.0. Why is there a race condition and can it be considered a bug? |
This comment has been minimized.
This comment has been minimized.
|
@zuzzas Alerting rules are evaluated periodically, according to the configured rule evaluation interval. Any resulting alerts are then sent to the Alertmanager right after evaluation. At the time that sending happens, Prometheus checks whether a previous send for the same alert was too recent (according to the configured resend delay) and records the new "last send" time if there is a re-send. So alert re-sends only happen right after the alerting rule was evaluated and simply get dropped if the resend delay hasn't been reached yet. I think the intention of the resend delay flag is mainly for situations where you have a really short evaluation interval, like 5s (so you can detect broken things fast), but you don't need to re-notify Alertmanager about already broken things every 5s (rather, every minute or couple of minutes is sufficient). |
This comment has been minimized.
This comment has been minimized.
|
@juliusv |
This comment has been minimized.
This comment has been minimized.
|
What's the actual problem this is causing for you? We should be resilient to a failed notification. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil Right now I'd like to get your opinion on the following:
|
This comment has been minimized.
This comment has been minimized.
|
I don't see anything that need clarifying here. There's only a problem if the EndsAt doesn't allow enough slack to handle a failed resend. |
This comment has been minimized.
This comment has been minimized.
|
Sounds like we should include the resend delay in the endsat calculation.
…On Sat 26 Jan 2019, 09:51 Stefan Büringer ***@***.*** wrote:
We have a similar problem. We're using Prometheus together with
Alertmanager. We use resend-delay and evaluation-interval both with the
default value of 1m.
In our case this results in alerts send every 2 minutes to Alertmanager.
Because endsAt of the alerts is set to 3m (I think 3x1m if I understand the
code correctly), if just 1 alert is dropped on the way from Prometheus to
Alertmanager the alert is resolved in Alertmanager. After another minute
the alert goes back to active. I guess this wouldn't happen that frequently
if we would run our Prometheus HA or if we would have set resend-delay to
something like 45s.
It's no real problem, but in my opinion it's a bit surprising for the
default configuration.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5048 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGyTdvfTSN4MalrpOxOHZdGCVBzn4enFks5vHBcFgaJpZM4Zj1f_>
.
|
This comment has been minimized.
This comment has been minimized.
sbueringer
commented
Jan 26, 2019
•
|
@brian-brazil Sorry deleted the comment for a moment :). I had to find out if that's really the problem in our case, because I found other errors in our Prometheus log. But I now set the resend-interval to 45 seconds and I get an alert in Alertmanager every minute instead of every 2 minutes. I don't know if it's a good idea, but might make sense to set the endAt a little bit higher then 3x the interval (maybe a second or two more). So the alert is only resolved if you really lose 3 messages instead of 2. Might make sense because we're actually waiting 3x the interval anyway before resolving the Alert right now. Example:
This would resolve the alert if your unlucky, I guess. |
zuzzas commentedDec 28, 2018
•
edited
Bug Report
What did you do?
I've set the
--rules.alert.resend-delay=30soption.What did you expect to see?
I've expected to see alerts POST's going at 30 seconds interval.
What did you see instead? Under which circumstances?
Alerts were sent at 1 minute intervals, instead of 30s. When I set it to 29s, they were sent a 30s interval. I guess, it's related to
evaluation_interval, but I haven't been able to confirm that in source code.Here are the relevant Wireshark captures at
--rules.alert.resend-delayset to 30 and 29 seconds respectively:Environment
System information:
Linux kube-a-1 4.10.0-42-generic #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Prometheus version:
Prometheus v2.6.0
Prometheus configuration file:
Everything else is a bit confidential.