-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resolve events in pagerduty #44
Comments
I also agree that this is the correct semantic. It also matches the behavior of the Nagios/Icinga integration. |
👍 |
I think that in general what needs to be added is just an event queue of 'deleted' or 'resolved' events. |
👍 To the general issue. When alerts disappear from alertmanager (whether due to not being refreshed or getting explicitly resolved from Prometheus), alertmanager should try to resolve any existing issues in PagerDuty. |
If prometheus is down for 2 minutes, its alerts will disappear from alertmanager. It would be unfortunate if that short outage caused a notified problem to be marked resolved prematurely. Issues shouldn't be resolved on PagerDuty until (a) the alert origin has been heard from again, (b) the origin has completed its first evaluation following any restart and possible recovery, and (c) the alert is neither firing nor pending. |
But if there is still a problem, Prometheus coming back would still fire them again, right? I'm leaning towards resolving them … they've paged someone already anyway, so the risk is limited. On the other hand, we constantly have alerts in PD that the alerting "forgot" to actively clear for any number of reasons. We deal with them by reaping everything after 24h and never using the PD default of un-acknowledging but I really dislike that state of affairs. |
But with explicit clear notifications Prometheus -> Alertmanager, the timeout could be much longer in Alertmanager. |
Yeah, and what about Prometheus servers going away completely? At some point, there still needs to be a timeout for deleting stale alerts in alertmanager, methinks. But as @matthiasr said, that timeout could be much longer than it currently is once it is only used as a fallback and explicit Prometheus->AM "resolved" notifications are used as the primary mechanism for deleting inactive alerts from AM. |
I've just submitted #51, it's a rather naive implementation so it doesn't address the corner cases you discussed here (like prometheus being down causing a invalid resolution of the issues etc) but it's a step in the right direct imo. |
This appears to be implemented. |
…penshift-4.8-golang-github-prometheus-alertmanager Updating golang-github-prometheus-alertmanager builder & base images to be consistent with ART
After the min refresh period mark events resolve in pagerduty.
https://developer.pagerduty.com/documentation/integration/events/resolve
I feel like a recommended workflow would be to acknowledge the incident in pagerduty, then fix the problem, and let the system resolve itself naturally.
This way if I think that I fixed it, but then in the "Incident Ack Timeout" from pager duty it is not actually resolved I will get notified again.
The text was updated successfully, but these errors were encountered: