Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve events in pagerduty #44

Closed
camerondavison opened this issue Apr 9, 2015 · 10 comments
Closed

Resolve events in pagerduty #44

camerondavison opened this issue Apr 9, 2015 · 10 comments

Comments

@camerondavison
Copy link
Contributor

After the min refresh period mark events resolve in pagerduty.
https://developer.pagerduty.com/documentation/integration/events/resolve

I feel like a recommended workflow would be to acknowledge the incident in pagerduty, then fix the problem, and let the system resolve itself naturally.

This way if I think that I fixed it, but then in the "Incident Ack Timeout" from pager duty it is not actually resolved I will get notified again.

@matthiasr
Copy link

I also agree that this is the correct semantic. It also matches the behavior of the Nagios/Icinga integration.

@grobie
Copy link
Member

grobie commented Apr 9, 2015

👍

@camerondavison
Copy link
Contributor Author

I think that in general what needs to be added is just an event queue of 'deleted' or 'resolved' events.

@juliusv
Copy link
Member

juliusv commented Apr 13, 2015

👍 To the general issue. When alerts disappear from alertmanager (whether due to not being refreshed or getting explicitly resolved from Prometheus), alertmanager should try to resolve any existing issues in PagerDuty.

@aecolley
Copy link

If prometheus is down for 2 minutes, its alerts will disappear from alertmanager. It would be unfortunate if that short outage caused a notified problem to be marked resolved prematurely. Issues shouldn't be resolved on PagerDuty until (a) the alert origin has been heard from again, (b) the origin has completed its first evaluation following any restart and possible recovery, and (c) the alert is neither firing nor pending.

@matthiasr
Copy link

But if there is still a problem, Prometheus coming back would still fire them again, right?

I'm leaning towards resolving them … they've paged someone already anyway, so the risk is limited. On the other hand, we constantly have alerts in PD that the alerting "forgot" to actively clear for any number of reasons. We deal with them by reaping everything after 24h and never using the PD default of un-acknowledging but I really dislike that state of affairs.

@matthiasr
Copy link

But with explicit clear notifications Prometheus -> Alertmanager, the timeout could be much longer in Alertmanager.

@juliusv
Copy link
Member

juliusv commented Apr 24, 2015

Yeah, and what about Prometheus servers going away completely? At some point, there still needs to be a timeout for deleting stale alerts in alertmanager, methinks. But as @matthiasr said, that timeout could be much longer than it currently is once it is only used as a fallback and explicit Prometheus->AM "resolved" notifications are used as the primary mechanism for deleting inactive alerts from AM.

@discordianfish
Copy link
Member

I've just submitted #51, it's a rather naive implementation so it doesn't address the corner cases you discussed here (like prometheus being down causing a invalid resolution of the issues etc) but it's a step in the right direct imo.

@brian-brazil
Copy link
Contributor

This appears to be implemented.

paulfantom pushed a commit to paulfantom/alertmanager that referenced this issue Jul 23, 2021
…penshift-4.8-golang-github-prometheus-alertmanager

Updating golang-github-prometheus-alertmanager builder & base images to be consistent with ART
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants