Resolve events in pagerduty #44

camerondavison · 2015-04-09T20:57:07Z

After the min refresh period mark events resolve in pagerduty.
https://developer.pagerduty.com/documentation/integration/events/resolve

I feel like a recommended workflow would be to acknowledge the incident in pagerduty, then fix the problem, and let the system resolve itself naturally.

This way if I think that I fixed it, but then in the "Incident Ack Timeout" from pager duty it is not actually resolved I will get notified again.

matthiasr · 2015-04-09T21:15:23Z

I also agree that this is the correct semantic. It also matches the behavior of the Nagios/Icinga integration.

grobie · 2015-04-09T22:10:08Z

👍

camerondavison · 2015-04-10T20:00:18Z

I think that in general what needs to be added is just an event queue of 'deleted' or 'resolved' events.

juliusv · 2015-04-13T16:17:35Z

👍 To the general issue. When alerts disappear from alertmanager (whether due to not being refreshed or getting explicitly resolved from Prometheus), alertmanager should try to resolve any existing issues in PagerDuty.

aecolley · 2015-04-24T11:57:49Z

If prometheus is down for 2 minutes, its alerts will disappear from alertmanager. It would be unfortunate if that short outage caused a notified problem to be marked resolved prematurely. Issues shouldn't be resolved on PagerDuty until (a) the alert origin has been heard from again, (b) the origin has completed its first evaluation following any restart and possible recovery, and (c) the alert is neither firing nor pending.

matthiasr · 2015-04-24T12:27:04Z

But if there is still a problem, Prometheus coming back would still fire them again, right?

I'm leaning towards resolving them … they've paged someone already anyway, so the risk is limited. On the other hand, we constantly have alerts in PD that the alerting "forgot" to actively clear for any number of reasons. We deal with them by reaping everything after 24h and never using the PD default of un-acknowledging but I really dislike that state of affairs.

matthiasr · 2015-04-24T12:27:59Z

But with explicit clear notifications Prometheus -> Alertmanager, the timeout could be much longer in Alertmanager.

juliusv · 2015-04-24T13:21:55Z

Yeah, and what about Prometheus servers going away completely? At some point, there still needs to be a timeout for deleting stale alerts in alertmanager, methinks. But as @matthiasr said, that timeout could be much longer than it currently is once it is only used as a fallback and explicit Prometheus->AM "resolved" notifications are used as the primary mechanism for deleting inactive alerts from AM.

discordianfish · 2015-04-24T14:02:56Z

I've just submitted #51, it's a rather naive implementation so it doesn't address the corner cases you discussed here (like prometheus being down causing a invalid resolution of the issues etc) but it's a step in the right direct imo.

brian-brazil · 2015-08-23T11:01:24Z

This appears to be implemented.

…penshift-4.8-golang-github-prometheus-alertmanager Updating golang-github-prometheus-alertmanager builder & base images to be consistent with ART

camerondavison mentioned this issue Apr 13, 2015

wait for refresh before sending alert after repeat_rate_seconds #46

Closed

brian-brazil closed this as completed Aug 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve events in pagerduty #44

Resolve events in pagerduty #44

camerondavison commented Apr 9, 2015

matthiasr commented Apr 9, 2015

grobie commented Apr 9, 2015

camerondavison commented Apr 10, 2015

juliusv commented Apr 13, 2015

aecolley commented Apr 24, 2015

matthiasr commented Apr 24, 2015

matthiasr commented Apr 24, 2015

juliusv commented Apr 24, 2015

discordianfish commented Apr 24, 2015

brian-brazil commented Aug 23, 2015

Resolve events in pagerduty #44

Resolve events in pagerduty #44

Comments

camerondavison commented Apr 9, 2015

matthiasr commented Apr 9, 2015

grobie commented Apr 9, 2015

camerondavison commented Apr 10, 2015

juliusv commented Apr 13, 2015

aecolley commented Apr 24, 2015

matthiasr commented Apr 24, 2015

matthiasr commented Apr 24, 2015

juliusv commented Apr 24, 2015

discordianfish commented Apr 24, 2015

brian-brazil commented Aug 23, 2015