Improve Alertmanager Semantics for Partially Muted Alert Groups

### Proposal

This is going to be a long issue 😉

## How Things Work Today

Before we talk about what I think we should change, let me give a brief overview of how notifications work in Alertmanager.

Alertmanager notifications are driven by the Notification Pipeline that’s defined in `notify.go`. Every `group_interval` (more or less), an entire _Alert Group_ of alerts is flushed into the notification pipeline. Alerts are captured at the point of flush, so they cannot change during the execution of the pipeline. This essentially gives the pipeline a point-in-time view of a single group of alerts.

The pipeline completes several steps. At a high level, there are basically three operations: 

1. Apply any muting behavior (silences, inhibits, time intervals)
2. Compare the current state of the Alert Group to the previous state
3. Send the notification and record the result

My focus is on the first two. 

For muting stages, the alerts in the Alert Group are processed and Alertmanager decides which ones are currently muted. Alerts which are muted are removed from the Alert Group for the remainder of the pipeline. For example, if my alertgroup is made of `A`, `B`, and `C`, if `B` is silenced, the pipeline will continue with just `A` and `C`. Since muting happens before comparison to the previous group state, we compare only active alerts. 

Comparison is fairly simple, at least conceptually. Alertmanager writes every successful notification into the Notification Log. Each notification is represented by a log entry: 

https://github.com/prometheus/alertmanager/blob/9aae092dab0eb454a53e10b151295739db7334a0/nflog/nflogpb/nflog.proto#L19-L40

As you can see, the log entry essentially models the state of alerts in the group. When the `DedupStage` performs comparison, it reads the latest nflog entry and compares the active and resolved alerts to the current sets of active and resolved alerts. If anything changed, the the `DedupStage` returns a result indicating that the pipeline should continue. If nothing has changed, the `DedupStage` short circuits the pipeline.

## The Problem

The interaction of these two mechanisms creates a problem: muted alerts are never considered when deciding if the state of the Alert Group has changed. In fact, when an alert is muted, it essentially disappears from the nflog completely. This is how Alertmanager has worked for a very long time, but I believe this is fundamentally a bug.

It can lead to some weird outcomes: For example, an alert that triggers a notification can become muted after the notification is fired. When that alert resolves, the notification will not be resolved because the alert is filtered from the notification pipeline before reaching the `DedupStage`. For some integrations, this is acceptable while for others it is not. One of our most upvoted issues, #226, is directly caused by this behavior.

For the following examples, let’s consider a deduplicating integration like PagerDuty. Only the first notification for a group will trigger a “real” notification for humans. Several integrations can be configured to behave this way, and generally I think this is a popular way to manage notifications. Consider the following sequence:

1. A begins firing and a notification is sent
3. B beings firing
4. A & B both resolve and a resolved notification is sent

<img width="696" height="431" alt="Image" src="https://github.com/user-attachments/assets/d7e8fd02-9db1-4cd0-b08b-54a4a8ebd185" />

I think this is pretty intuitive. However, things can get a little more complicated when we introduce silences:

<img width="856" height="435" alt="Image" src="https://github.com/user-attachments/assets/4182fca9-e386-4154-a93b-4cd59b4e28d0" />

If A is silenced before B begins firing, Alertmanager does not send a new notification for B. For many reading this, that will be intuitive behavior. However, I'd argue this behavior is a bit problematic: The time between A being silenced and B firing could be very long. For example, the person who receives the notification for A may not be the same person who is on-call when B fires. By silencing A, the Alert Group has been completely hidden by the time B begins firing. However, for integrations like PagerDuty, Alertmanager will not send any notifications.

To see truly strange behavior, we just need to change the order of operations. Imagine B resolves before A is unsilenced:

<img width="1289" height="435" alt="Image" src="https://github.com/user-attachments/assets/45b4503b-a832-423e-871b-9472a9dfb129" />

In this case, Alertmanager will send a resolved notification for the group when B resolves even though Alertmanager never actually sent a new notification for B! When A is unsilenced, it will send another notification! This is very unintuitive given the behavior discussed above. If you don't believe me, take a look at the logic in the `DedupStage`:

https://github.com/prometheus/alertmanager/blob/9aae092dab0eb454a53e10b151295739db7334a0/notify/notify.go#L782-L794

When we enter this block, `entry` will contain firing alert B, `firing` will be empty, and `resolved` will contain the now resolved B.

## Proposal

So, what should happen? I’d argue that the only consistent behavior is for Alertmanager to consider muted alerts when sending notifications. The nflog should be changed to include a `muted` alerts field to go with `firing` and `resolved`. 

For integrations tracking state, we may be able to communicate a “muted” state or just send a resolved notification:

<img width="856" height="435" alt="Image" src="https://github.com/user-attachments/assets/9584bb5e-0404-4430-9362-02e4eb761735" />

We'll need to support three ways to handle muted alerts:

1. The existing behavior. Alertmanager has been behaving this way for years so we cannot make such a big change by default
2. New behavior that's muted alert aware. For example, the PagerDuty integration could snooze incidents when all alerts are muted or the Slack integration could send a message indicating which alerts are newly silenced.
3. Resolved notifications. For example, the PagerDuty integration could resolve an incident when all active alerts are no longer active for any reason.

### Modeling

I propose that we introduce a new `NotificationSequence` type which represents the state of the sequence of notifications about an Alert Group at a point in time. The Notification Pipeline will construct the `NotificationSequence` during the `DedupStage`. No new structures would be exposed in the nflog (aside from the new `muted` field): the `NotificationSequence` can be constructed from the existing data.

This would semantically model the way I think most of us already think about notifications for a group: as a sequence of related events over time. Alertmanager doesn’t model anything like this, but it is implicit in the existing behavior.

The `NotificationSequence` is a very simple state machine with two states:
 
1. `open` - representing a group with active alerts that
2. `closed` - representing a group with no active alerts

The states of the sequence form _intervals_ where a sequence is opened and then eventually closed. At any point in time, we can be in an open interval or the sequence is closed.

Each of the existing `NotificationReason`s will be state transitions for the `NotificationSequence` state machine:

<img width="217" height="331" alt="Image" src="https://github.com/user-attachments/assets/da6e4570-94aa-4c21-aae6-fccf9e9b71e5" />

I’ve also added a few state transitions for muted alerts. Most importantly, muting all alerts in the group should close the `NotificationSequence` because there are no more active alerts. This allows integrations to send resolved messages for muted groups. We will keep the `NotificationReason`, so integrations can also use special behavior for “closed by reason of muting” to preserve legacy behavior or introduce new behavior that’s mute aware. 

If you've made it this far, thanks for reading! I'm very interested in feedback on this proposal!

	// Entry holds information about a successful notification
	// sent to a receiver.
	message Entry {
	// The key identifying the dispatching group.
	bytes group_key = 1;
	// The receiver that was notified.
	Receiver receiver = 2;
	// Hash over the state of the group at notification time.
	// Deprecated in favor of FiringAlerts field, but kept for compatibility.
	bytes group_hash = 3;
	// Whether the notification was about a resolved alert.
	// Deprecated in favor of ResolvedAlerts field, but kept for compatibility.
	bool resolved = 4;
	// Timestamp of the succeeding notification.
	google.protobuf.Timestamp timestamp = 5;
	// FiringAlerts list of hashes of firing alerts at the last notification time.
	repeated uint64 firing_alerts = 6;
	// ResolvedAlerts list of hashes of resolved alerts at the last notification time.
	repeated uint64 resolved_alerts = 7;
	// Data specific to the receiver which sent the notification
	map<string, ReceiverDataValue> receiver_data = 8;
	}

	// Notify about all alerts being resolved.
	// This is done irrespective of the send_resolved flag to make sure that
	// the firing alerts are cleared from the notification log.
	if len(firing) == 0 {
	// If the current alert group and last notification contain no firing
	// alert, it means that some alerts have been fired and resolved during the
	// last interval. In this case, there is no need to notify the receiver
	// since it doesn't know about them.
	if len(entry.FiringAlerts) > 0 {
	return ReasonAllAlertsResolved
	}
	return ReasonDoNotNotify
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Alertmanager Semantics for Partially Muted Alert Groups #5247

Proposal

How Things Work Today

The Problem

Proposal

Modeling

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Improve Alertmanager Semantics for Partially Muted Alert Groups #5247

Description

Proposal

How Things Work Today

The Problem

Proposal

Modeling

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions