Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus endsAt timeout is too short #5277
Comments
This comment has been minimized.
This comment has been minimized.
That sounds like plenty of time to me. Can you share the timings you're seeing?
That's a separate system. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for your response. I have the logs from the AlertManager perspective. Both Prometheus and AlertManager time is in sync, the resend-delay on Prometheus is set to 1m. When the Prometheus configuration is reloaded, it skips one message to AlertManager and this is enough for AlertManager to mark the Alert as resolved. Prometheus is actually only 1s too slow in notifying AlertManager of the continued firing Alert. The Alertmanager logs at debug level (ts and aggrGroup removed to reduce noise):
AlertManager incorrectly marks the alert as resolved because the endsAt period has passed. The Alert itself remains firing on Prometheus the entire time. When I query AlertManager using the The reload event is our scenario, but it's possible to foresee others such as load spikes or connectivity issues which occur in production environments and it's exactly times like that where you don't need incorrect information. Regarding the
If the |
This comment has been minimized.
This comment has been minimized.
|
What's your evaluation interval, and how long is that group taking to evaluate?
It's not unexpected that alerts will flap on a restart, the persistence stuff is more for alerts with long fors. |
steve-exley
referenced this issue
Feb 27, 2019
Open
Firing -> Resolved -> Firing issue with prometheus + alertmanager #952
This comment has been minimized.
This comment has been minimized.
|
Our global The max We currently use 7 Prometheus instances, each monitoring thousands of targets, many with hundreds of metrics. The inventory is dynamic and bespoke so there is no off the shelf service discovery. Instead the file_sd_config are regenerated frequently and each Prometheus instance may reload the config numerous times per day. At any time, there will be many alerts firing at the Prometheus level, although these are grouped, silenced and routed differently up the stack. If there's a possible race condition or edge case, chances are it will show up sooner or later on our environment. Basically, the file configuration and distributed nature of Prometheus is perfect for our use case. Flapping alerts are not so good because they will bother somebody who is not even aware Prometheus has just restarted or had a brief glitch. |
This comment has been minimized.
This comment has been minimized.
|
Looking more closely at your logs, Prometheus skipped two messages. The settings are only meant to be resilient to one. #5126 should help with that, which will be in 2.8. |
steve-exley commentedFeb 27, 2019
Proposal
Improve the calculation used to generate the endsAt parameter sent to AlertManager?
We have encountered AlertManager closing and re-opening alerts on multiple occasions even though Prometheus itself does not close the alert.
Currently the
endsAtparameter sent to AlertManager defaults to a short time frame which results in AlertManager closing and then re-opening alerts if Prometheus is unable to update the alert within the time frame.The
endsAtparameter is set to 3x the greater of theevaluation_intervalorresend-delayvalues. This results in a default maximum value 3 minutes in the future. In practice, this is often less due to the delay between evaluation and the message being sent to AlertManager.This is fine under normal running operations. However under normal running operations, Prometheus will send [resolved] messages to AlertManager and the
endsAtfield adds little value.Under other operational processes, such as restarting Prometheus or reloading a large configuration, these processes may start sometime after the last message was sent to AlertManager and not complete in time for Prometheus to complete it's evaluation cycle and send the next expected message to AlertManager.
Additionally, the 'rules.alert.for-grace-period' was specifically introduced to allow persisting of alert state, so the
endsAtfield should account for this too.If this proposal is agreed, it should be a fairly easy improvement.