Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus endsAt timeout is too short #5277

Open
steve-exley opened this Issue Feb 27, 2019 · 5 comments

Comments

Projects
None yet
3 participants
@steve-exley
Copy link

steve-exley commented Feb 27, 2019

Proposal

Improve the calculation used to generate the endsAt parameter sent to AlertManager?

We have encountered AlertManager closing and re-opening alerts on multiple occasions even though Prometheus itself does not close the alert.

Currently the endsAt parameter sent to AlertManager defaults to a short time frame which results in AlertManager closing and then re-opening alerts if Prometheus is unable to update the alert within the time frame.

The endsAt parameter is set to 3x the greater of the evaluation_interval or resend-delay values. This results in a default maximum value 3 minutes in the future. In practice, this is often less due to the delay between evaluation and the message being sent to AlertManager.

This is fine under normal running operations. However under normal running operations, Prometheus will send [resolved] messages to AlertManager and the endsAt field adds little value.

Under other operational processes, such as restarting Prometheus or reloading a large configuration, these processes may start sometime after the last message was sent to AlertManager and not complete in time for Prometheus to complete it's evaluation cycle and send the next expected message to AlertManager.

Additionally, the 'rules.alert.for-grace-period' was specifically introduced to allow persisting of alert state, so the endsAt field should account for this too.

If this proposal is agreed, it should be a fairly easy improvement.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 27, 2019

This results in a default maximum value 3 minutes in the future. In practice, this is often less due to the delay between evaluation and the message being sent to AlertManager.

That sounds like plenty of time to me. Can you share the timings you're seeing?

Additionally, the 'rules.alert.for-grace-period' was specifically introduced to allow persisting of alert state, so the endsAt field should account for this too.

That's a separate system.

@steve-exley

This comment has been minimized.

Copy link
Author

steve-exley commented Feb 27, 2019

Thanks for your response. I have the logs from the AlertManager perspective.

Both Prometheus and AlertManager time is in sync, the resend-delay on Prometheus is set to 1m.

When the Prometheus configuration is reloaded, it skips one message to AlertManager and this is enough for AlertManager to mark the Alert as resolved. Prometheus is actually only 1s too slow in notifying AlertManager of the continued firing Alert.

The Alertmanager logs at debug level (ts and aggrGroup removed to reduce noise):

Feb 25 19:55:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:57:29 caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 19:58:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 19:58:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:01:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][resolved]]
Feb 25 20:01:57 caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:03:26 caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=LinkDown[48042ac][active]
Feb 25 20:04:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]
Feb 25 20:04:56 caller=dispatch.go:444 component=dispatcher msg=Flushing alerts=[LinkDown[48042ac][active]]

AlertManager incorrectly marks the alert as resolved because the endsAt period has passed. The Alert itself remains firing on Prometheus the entire time.

When I query AlertManager using the /api/v1/alerts endpoint, the endsAt value is rarely more than 2 minutes in the future even though the resend-delay is 1m. I assume this may be due to internal delays between the evaluation cycle and re-sending alerts.

The reload event is our scenario, but it's possible to foresee others such as load spikes or connectivity issues which occur in production environments and it's exactly times like that where you don't need incorrect information.

Regarding the rules.alert.for-grace-period, the reason I mention that is that it is broken by the endsAt value being set using a lower time period.

  1. Prometheus is firing an alert with a for value of 5 minutes. It sends firing messages with an endsAt 3 minutes in the future.
  2. Prometheus is restarted for some reason. The grace-period is 10minutes and all alerts with a for value of less than 10 minutes are not automatically recalculated. They must wait until the 5 minute for time period expires.
  3. Any alerts with a for value of greater than 10 minutes are recalculated on the next evaluation and put straight back into the firing state. AlertManager maintains those Alerts in the firing state.
  4. After 3 minutes, the endsAt on AlertManager passes and resolved notifications are sent.
  5. After 5 minutes Prometheus puts the Alert back into firing state and more notifications are sent.

If the endsAt value was greater than the time+rules.alert.for-grace-period then all firing alerts on AlertManager would correctly remain in the firing state. This seems more desirable than the current behavior and would also mitigate the specific issue we are having.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 27, 2019

What's your evaluation interval, and how long is that group taking to evaluate?

This seems more desirable than the current behavior and would also mitigate the specific issue we are having.

It's not unexpected that alerts will flap on a restart, the persistence stuff is more for alerts with long fors.

@steve-exley

This comment has been minimized.

Copy link
Author

steve-exley commented Feb 28, 2019

Our global evaluation_interval is 15s, some rule groups use interval 30s.

The max prometheus_rule_group_last_duration_seconds for all rule groups is consistently under 1s.

We currently use 7 Prometheus instances, each monitoring thousands of targets, many with hundreds of metrics. The inventory is dynamic and bespoke so there is no off the shelf service discovery. Instead the file_sd_config are regenerated frequently and each Prometheus instance may reload the config numerous times per day. At any time, there will be many alerts firing at the Prometheus level, although these are grouped, silenced and routed differently up the stack.

If there's a possible race condition or edge case, chances are it will show up sooner or later on our environment.

Basically, the file configuration and distributed nature of Prometheus is perfect for our use case. Flapping alerts are not so good because they will bother somebody who is not even aware Prometheus has just restarted or had a brief glitch.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 28, 2019

Looking more closely at your logs, Prometheus skipped two messages. The settings are only meant to be resilient to one. #5126 should help with that, which will be in 2.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.