Fix flapping notifications in HA mode #3283

yuri-tceretian · 2023-03-07T01:30:57Z

In some situations in high-availability mode, a cluster of Alertmanagers in a normal state (e.g. notifications are sent immediately, no delays in state propagation, etc) can send multiple (flapping) notifications for the same alert because of the unfortunate composition of parameters peer_timeout and group_interval.

How to reproduce a bug:

3 instances of Alertamanager peer_timeout=60s
route with group_wait=30s, group_interval=70s, and repeat_interval=2d (to exclude the possibility of repeat notifications) and receiver that sends a webhook
Start the cluster.
Create an alert at time X
Wait 50 seconds and resolve the alert
Wait 110 seconds
Check the webhook target server. It should register 4 notifications: 1 - firing (at time X), 2 - resolve (at time X+100s), 3 - firing (at time X+150s), 4 - resolve (at X+160s)

Diagram

alertmanager-bug.zip contains everything that is needed to reproduce the bug. To reproduce it, unpack the archive and run docker compose up. When cluster is created, run script ./send.sh test that will send an alert to all 3 instances, wait for 50s and then send the same alert but with EndsAt=now.

This happens because instance that spends on stage WaitStage for more than group_intervals can encounter with a state "from future".
In the diagram above, Alertmanager 3 that processes tick now=30 with 1 firing alert during the DedupStage compares the current aggregation group with state produced by Alertmanager 1 while it was processing tick now=100 where alert was resolved.

This PR proposes a fix for this behavior. It introduces an additional check if predicate DedupStage.needsUpdate returns true and notification log entry exists.

It compares the current tick time (the timestamp of the aggregation group tick) with the notification log timestamp, and if former is in the past, which means that there was Log event after the flushing happened, emits an info log and returns empty slice of alerts, which means that the current pipeline should be stopped.

yuri-tceretian · 2023-03-07T14:35:31Z

I do not think failing CI is related to my change. I inspected logs and did not find a message that I introduced.

yuri-tceretian · 2023-03-07T15:59:14Z

I ran tests locally (go clean -testcache && make test) and they all pass.

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>

yuri-tceretian · 2023-03-07T21:33:59Z

I think I might have figured out the problem! My version of the main branch was ancient. I synced my fork and rebased PR to the head of the main. CI is still failing but now it's another test (seems to be flaky).

gotjosh · 2023-03-08T12:31:31Z

I have restarted the CI and filed #3287.

simonpasquier · 2023-03-13T16:44:57Z

notify/notify.go

@@ -640,6 +640,17 @@ func (n *DedupStage) Exec(ctx context.Context, _ log.Logger, alerts ...*types.Al
 	}

 	if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) {
+		if entry != nil {


why not moving this verification to needsUpdate()?

Initially I wanted to do that there. I had an idea to emit a log message only when needsUpdate is positive (to better indicate the race). To achieve that in needsUpdate I would have to rewrite the method, which I thought would bring less clarity. I can refactor that method if you think that is ok for understanding the change.

gotjosh · 2023-04-06T16:19:31Z

I don't think this is the correct change we should make. If we're confident this only happens under certain conditions, then I'd say we need to protect against that at a configuration level rather than dropping evaluations.

I don't fully understand, or I'm confident on what kind of side-effects this produces.

yuri-tceretian · 2023-05-24T17:20:35Z

Closing as this does not seem to have any potential to be merged.

stop processing pipeline if dedup stage sees event from the future

470846d

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>

This was referenced Mar 11, 2023

Fix: DedupStage to stop pipeline if event in notification log is from the future grafana/prometheus-alertmanager#30

Merged

Alerting: Update scheduler to get updates only from database grafana/grafana#64635

Merged

simonpasquier reviewed Mar 13, 2023

View reviewed changes

yuri-tceretian mentioned this pull request Mar 14, 2023

Chore: Update Grafana to use Alertmanager v0.25.1-0.20230308154952-78fedf89728b grafana/grafana#64778

Merged

yuri-tceretian mentioned this pull request Apr 4, 2023

alertmanager sent repeated firing and resolved message after a resolved message ，the interval between two resolved is group_interval #2768

Open

yuri-tceretian mentioned this pull request Apr 14, 2023

Fix config reloads not respecting group interval #3074

Open

yuri-tceretian closed this May 24, 2023

yuri-tceretian deleted the fix-obsolete-tick-dedup branch May 24, 2023 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flapping notifications in HA mode #3283

Fix flapping notifications in HA mode #3283

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

gotjosh commented Mar 8, 2023 •

edited

Loading

simonpasquier Mar 13, 2023

yuri-tceretian Mar 13, 2023 •

edited

Loading

gotjosh commented Apr 6, 2023 •

edited

Loading

yuri-tceretian commented May 24, 2023

Fix flapping notifications in HA mode #3283

Fix flapping notifications in HA mode #3283

Conversation

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

yuri-tceretian commented Mar 7, 2023

gotjosh commented Mar 8, 2023 • edited Loading

simonpasquier Mar 13, 2023

Choose a reason for hiding this comment

yuri-tceretian Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

gotjosh commented Apr 6, 2023 • edited Loading

yuri-tceretian commented May 24, 2023

gotjosh commented Mar 8, 2023 •

edited

Loading

yuri-tceretian Mar 13, 2023 •

edited

Loading

gotjosh commented Apr 6, 2023 •

edited

Loading