Proposal
This is going to be a long issue 😉
How Things Work Today
Before we talk about what I think we should change, let me give a brief overview of how notifications work in Alertmanager.
Alertmanager notifications are driven by the Notification Pipeline that’s defined in notify.go. Every group_interval (more or less), an entire Alert Group of alerts is flushed into the notification pipeline. Alerts are captured at the point of flush, so they cannot change during the execution of the pipeline. This essentially gives the pipeline a point-in-time view of a single group of alerts.
The pipeline completes several steps. At a high level, there are basically three operations:
- Apply any muting behavior (silences, inhibits, time intervals)
- Compare the current state of the Alert Group to the previous state
- Send the notification and record the result
My focus is on the first two.
For muting stages, the alerts in the Alert Group are processed and Alertmanager decides which ones are currently muted. Alerts which are muted are removed from the Alert Group for the remainder of the pipeline. For example, if my alertgroup is made of A, B, and C, if B is silenced, the pipeline will continue with just A and C. Since muting happens before comparison to the previous group state, we compare only active alerts.
Comparison is fairly simple, at least conceptually. Alertmanager writes every successful notification into the Notification Log. Each notification is represented by a log entry:
|
// Entry holds information about a successful notification |
|
// sent to a receiver. |
|
message Entry { |
|
// The key identifying the dispatching group. |
|
bytes group_key = 1; |
|
// The receiver that was notified. |
|
Receiver receiver = 2; |
|
// Hash over the state of the group at notification time. |
|
// Deprecated in favor of FiringAlerts field, but kept for compatibility. |
|
bytes group_hash = 3; |
|
// Whether the notification was about a resolved alert. |
|
// Deprecated in favor of ResolvedAlerts field, but kept for compatibility. |
|
bool resolved = 4; |
|
// Timestamp of the succeeding notification. |
|
google.protobuf.Timestamp timestamp = 5; |
|
// FiringAlerts list of hashes of firing alerts at the last notification time. |
|
repeated uint64 firing_alerts = 6; |
|
// ResolvedAlerts list of hashes of resolved alerts at the last notification time. |
|
repeated uint64 resolved_alerts = 7; |
|
// Data specific to the receiver which sent the notification |
|
map<string, ReceiverDataValue> receiver_data = 8; |
|
} |
As you can see, the log entry essentially models the state of alerts in the group. When the DedupStage performs comparison, it reads the latest nflog entry and compares the active and resolved alerts to the current sets of active and resolved alerts. If anything changed, the the DedupStage returns a result indicating that the pipeline should continue. If nothing has changed, the DedupStage short circuits the pipeline.
The Problem
The interaction of these two mechanisms creates a problem: muted alerts are never considered when deciding if the state of the Alert Group has changed. In fact, when an alert is muted, it essentially disappears from the nflog completely. This is how Alertmanager has worked for a very long time, but I believe this is fundamentally a bug.
It can lead to some weird outcomes: For example, an alert that triggers a notification can become muted after the notification is fired. When that alert resolves, the notification will not be resolved because the alert is filtered from the notification pipeline before reaching the DedupStage. For some integrations, this is acceptable while for others it is not. One of our most upvoted issues, #226, is directly caused by this behavior.
For the following examples, let’s consider a deduplicating integration like PagerDuty. Only the first notification for a group will trigger a “real” notification for humans. Several integrations can be configured to behave this way, and generally I think this is a popular way to manage notifications. Consider the following sequence:
- A begins firing and a notification is sent
- B beings firing
- A & B both resolve and a resolved notification is sent
I think this is pretty intuitive. However, things can get a little more complicated when we introduce silences:
If A is silenced before B begins firing, Alertmanager does not send a new notification for B. For many reading this, that will be intuitive behavior. However, I'd argue this behavior is a bit problematic: The time between A being silenced and B firing could be very long. For example, the person who receives the notification for A may not be the same person who is on-call when B fires. By silencing A, the Alert Group has been completely hidden by the time B begins firing. However, for integrations like PagerDuty, Alertmanager will not send any notifications.
To see truly strange behavior, we just need to change the order of operations. Imagine B resolves before A is unsilenced:
In this case, Alertmanager will send a resolved notification for the group when B resolves even though Alertmanager never actually sent a new notification for B! When A is unsilenced, it will send another notification! This is very unintuitive given the behavior discussed above. If you don't believe me, take a look at the logic in the DedupStage:
|
// Notify about all alerts being resolved. |
|
// This is done irrespective of the send_resolved flag to make sure that |
|
// the firing alerts are cleared from the notification log. |
|
if len(firing) == 0 { |
|
// If the current alert group and last notification contain no firing |
|
// alert, it means that some alerts have been fired and resolved during the |
|
// last interval. In this case, there is no need to notify the receiver |
|
// since it doesn't know about them. |
|
if len(entry.FiringAlerts) > 0 { |
|
return ReasonAllAlertsResolved |
|
} |
|
return ReasonDoNotNotify |
|
} |
When we enter this block, entry will contain firing alert B, firing will be empty, and resolved will contain the now resolved B.
Proposal
So, what should happen? I’d argue that the only consistent behavior is for Alertmanager to consider muted alerts when sending notifications. The nflog should be changed to include a muted alerts field to go with firing and resolved.
For integrations tracking state, we may be able to communicate a “muted” state or just send a resolved notification:
We'll need to support three ways to handle muted alerts:
- The existing behavior. Alertmanager has been behaving this way for years so we cannot make such a big change by default
- New behavior that's muted alert aware. For example, the PagerDuty integration could snooze incidents when all alerts are muted or the Slack integration could send a message indicating which alerts are newly silenced.
- Resolved notifications. For example, the PagerDuty integration could resolve an incident when all active alerts are no longer active for any reason.
Modeling
I propose that we introduce a new NotificationSequence type which represents the state of the sequence of notifications about an Alert Group at a point in time. The Notification Pipeline will construct the NotificationSequence during the DedupStage. No new structures would be exposed in the nflog (aside from the new muted field): the NotificationSequence can be constructed from the existing data.
This would semantically model the way I think most of us already think about notifications for a group: as a sequence of related events over time. Alertmanager doesn’t model anything like this, but it is implicit in the existing behavior.
The NotificationSequence is a very simple state machine with two states:
open - representing a group with active alerts that
closed - representing a group with no active alerts
The states of the sequence form intervals where a sequence is opened and then eventually closed. At any point in time, we can be in an open interval or the sequence is closed.
Each of the existing NotificationReasons will be state transitions for the NotificationSequence state machine:
I’ve also added a few state transitions for muted alerts. Most importantly, muting all alerts in the group should close the NotificationSequence because there are no more active alerts. This allows integrations to send resolved messages for muted groups. We will keep the NotificationReason, so integrations can also use special behavior for “closed by reason of muting” to preserve legacy behavior or introduce new behavior that’s mute aware.
If you've made it this far, thanks for reading! I'm very interested in feedback on this proposal!
Proposal
This is going to be a long issue 😉
How Things Work Today
Before we talk about what I think we should change, let me give a brief overview of how notifications work in Alertmanager.
Alertmanager notifications are driven by the Notification Pipeline that’s defined in
notify.go. Everygroup_interval(more or less), an entire Alert Group of alerts is flushed into the notification pipeline. Alerts are captured at the point of flush, so they cannot change during the execution of the pipeline. This essentially gives the pipeline a point-in-time view of a single group of alerts.The pipeline completes several steps. At a high level, there are basically three operations:
My focus is on the first two.
For muting stages, the alerts in the Alert Group are processed and Alertmanager decides which ones are currently muted. Alerts which are muted are removed from the Alert Group for the remainder of the pipeline. For example, if my alertgroup is made of
A,B, andC, ifBis silenced, the pipeline will continue with justAandC. Since muting happens before comparison to the previous group state, we compare only active alerts.Comparison is fairly simple, at least conceptually. Alertmanager writes every successful notification into the Notification Log. Each notification is represented by a log entry:
alertmanager/nflog/nflogpb/nflog.proto
Lines 19 to 40 in 9aae092
As you can see, the log entry essentially models the state of alerts in the group. When the
DedupStageperforms comparison, it reads the latest nflog entry and compares the active and resolved alerts to the current sets of active and resolved alerts. If anything changed, the theDedupStagereturns a result indicating that the pipeline should continue. If nothing has changed, theDedupStageshort circuits the pipeline.The Problem
The interaction of these two mechanisms creates a problem: muted alerts are never considered when deciding if the state of the Alert Group has changed. In fact, when an alert is muted, it essentially disappears from the nflog completely. This is how Alertmanager has worked for a very long time, but I believe this is fundamentally a bug.
It can lead to some weird outcomes: For example, an alert that triggers a notification can become muted after the notification is fired. When that alert resolves, the notification will not be resolved because the alert is filtered from the notification pipeline before reaching the
DedupStage. For some integrations, this is acceptable while for others it is not. One of our most upvoted issues, #226, is directly caused by this behavior.For the following examples, let’s consider a deduplicating integration like PagerDuty. Only the first notification for a group will trigger a “real” notification for humans. Several integrations can be configured to behave this way, and generally I think this is a popular way to manage notifications. Consider the following sequence:
I think this is pretty intuitive. However, things can get a little more complicated when we introduce silences:
If A is silenced before B begins firing, Alertmanager does not send a new notification for B. For many reading this, that will be intuitive behavior. However, I'd argue this behavior is a bit problematic: The time between A being silenced and B firing could be very long. For example, the person who receives the notification for A may not be the same person who is on-call when B fires. By silencing A, the Alert Group has been completely hidden by the time B begins firing. However, for integrations like PagerDuty, Alertmanager will not send any notifications.
To see truly strange behavior, we just need to change the order of operations. Imagine B resolves before A is unsilenced:
In this case, Alertmanager will send a resolved notification for the group when B resolves even though Alertmanager never actually sent a new notification for B! When A is unsilenced, it will send another notification! This is very unintuitive given the behavior discussed above. If you don't believe me, take a look at the logic in the
DedupStage:alertmanager/notify/notify.go
Lines 782 to 794 in 9aae092
When we enter this block,
entrywill contain firing alert B,firingwill be empty, andresolvedwill contain the now resolved B.Proposal
So, what should happen? I’d argue that the only consistent behavior is for Alertmanager to consider muted alerts when sending notifications. The nflog should be changed to include a
mutedalerts field to go withfiringandresolved.For integrations tracking state, we may be able to communicate a “muted” state or just send a resolved notification:
We'll need to support three ways to handle muted alerts:
Modeling
I propose that we introduce a new
NotificationSequencetype which represents the state of the sequence of notifications about an Alert Group at a point in time. The Notification Pipeline will construct theNotificationSequenceduring theDedupStage. No new structures would be exposed in the nflog (aside from the newmutedfield): theNotificationSequencecan be constructed from the existing data.This would semantically model the way I think most of us already think about notifications for a group: as a sequence of related events over time. Alertmanager doesn’t model anything like this, but it is implicit in the existing behavior.
The
NotificationSequenceis a very simple state machine with two states:open- representing a group with active alerts thatclosed- representing a group with no active alertsThe states of the sequence form intervals where a sequence is opened and then eventually closed. At any point in time, we can be in an open interval or the sequence is closed.
Each of the existing
NotificationReasons will be state transitions for theNotificationSequencestate machine:I’ve also added a few state transitions for muted alerts. Most importantly, muting all alerts in the group should close the
NotificationSequencebecause there are no more active alerts. This allows integrations to send resolved messages for muted groups. We will keep theNotificationReason, so integrations can also use special behavior for “closed by reason of muting” to preserve legacy behavior or introduce new behavior that’s mute aware.If you've made it this far, thanks for reading! I'm very interested in feedback on this proposal!