Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flapping notifications in HA mode #3283

Closed
wants to merge 1 commit into from
Closed

Fix flapping notifications in HA mode #3283

wants to merge 1 commit into from

Conversation

yuri-tceretian
Copy link
Contributor

In some situations in high-availability mode, a cluster of Alertmanagers in a normal state (e.g. notifications are sent immediately, no delays in state propagation, etc) can send multiple (flapping) notifications for the same alert because of the unfortunate composition of parameters peer_timeout and group_interval.

How to reproduce a bug:

  • 3 instances of Alertamanager peer_timeout=60s
  • route with group_wait=30s, group_interval=70s, and repeat_interval=2d (to exclude the possibility of repeat notifications) and receiver that sends a webhook
  • Start the cluster.
  • Create an alert at time X
  • Wait 50 seconds and resolve the alert
  • Wait 110 seconds
  • Check the webhook target server. It should register 4 notifications: 1 - firing (at time X), 2 - resolve (at time X+100s), 3 - firing (at time X+150s), 4 - resolve (at X+160s)
Diagram

image

alertmanager-bug.zip contains everything that is needed to reproduce the bug. To reproduce it, unpack the archive and run docker compose up. When cluster is created, run script ./send.sh test that will send an alert to all 3 instances, wait for 50s and then send the same alert but with EndsAt=now.

This happens because instance that spends on stage WaitStage for more than group_intervals can encounter with a state "from future".
In the diagram above, Alertmanager 3 that processes tick now=30 with 1 firing alert during the DedupStage compares the current aggregation group with state produced by Alertmanager 1 while it was processing tick now=100 where alert was resolved.

This PR proposes a fix for this behavior. It introduces an additional check if predicate DedupStage.needsUpdate returns true and notification log entry exists.

It compares the current tick time (the timestamp of the aggregation group tick) with the notification log timestamp, and if former is in the past, which means that there was Log event after the flushing happened, emits an info log and returns empty slice of alerts, which means that the current pipeline should be stopped.

@yuri-tceretian
Copy link
Contributor Author

I do not think failing CI is related to my change. I inspected logs and did not find a message that I introduced.

@yuri-tceretian
Copy link
Contributor Author

I ran tests locally (go clean -testcache && make test) and they all pass.

Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
@yuri-tceretian
Copy link
Contributor Author

I think I might have figured out the problem! My version of the main branch was ancient. I synced my fork and rebased PR to the head of the main. CI is still failing but now it's another test (seems to be flaky).

@gotjosh
Copy link
Member

gotjosh commented Mar 8, 2023

I have restarted the CI and filed #3287.

@@ -640,6 +640,17 @@ func (n *DedupStage) Exec(ctx context.Context, _ log.Logger, alerts ...*types.Al
}

if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) {
if entry != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not moving this verification to needsUpdate()?

Copy link
Contributor Author

@yuri-tceretian yuri-tceretian Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wanted to do that there. I had an idea to emit a log message only when needsUpdate is positive (to better indicate the race). To achieve that in needsUpdate I would have to rewrite the method, which I thought would bring less clarity. I can refactor that method if you think that is ok for understanding the change.

@gotjosh
Copy link
Member

gotjosh commented Apr 6, 2023

I don't think this is the correct change we should make. If we're confident this only happens under certain conditions, then I'd say we need to protect against that at a configuration level rather than dropping evaluations.

I don't fully understand, or I'm confident on what kind of side-effects this produces.

@yuri-tceretian
Copy link
Contributor Author

Closing as this does not seem to have any potential to be merged.

@yuri-tceretian yuri-tceretian deleted the fix-obsolete-tick-dedup branch May 24, 2023 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants