-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flapping notifications in HA mode #3283
Fix flapping notifications in HA mode #3283
Conversation
I do not think failing CI is related to my change. I inspected logs and did not find a message that I introduced. |
I ran tests locally ( |
Signed-off-by: Yuri Tseretyan <yuriy.tseretyan@grafana.com>
I think I might have figured out the problem! My version of the main branch was ancient. I synced my fork and rebased PR to the head of the main. CI is still failing but now it's another test (seems to be flaky). |
I have restarted the CI and filed #3287. |
@@ -640,6 +640,17 @@ func (n *DedupStage) Exec(ctx context.Context, _ log.Logger, alerts ...*types.Al | |||
} | |||
|
|||
if n.needsUpdate(entry, firingSet, resolvedSet, repeatInterval) { | |||
if entry != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not moving this verification to needsUpdate()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I wanted to do that there. I had an idea to emit a log message only when needsUpdate is positive (to better indicate the race). To achieve that in needsUpdate I would have to rewrite the method, which I thought would bring less clarity. I can refactor that method if you think that is ok for understanding the change.
I don't think this is the correct change we should make. If we're confident this only happens under certain conditions, then I'd say we need to protect against that at a configuration level rather than dropping evaluations. I don't fully understand, or I'm confident on what kind of side-effects this produces. |
Closing as this does not seem to have any potential to be merged. |
In some situations in high-availability mode, a cluster of Alertmanagers in a normal state (e.g. notifications are sent immediately, no delays in state propagation, etc) can send multiple (flapping) notifications for the same alert because of the unfortunate composition of parameters
peer_timeout
andgroup_interval
.How to reproduce a bug:
peer_timeout=60s
group_wait=30s
,group_interval=70s
, andrepeat_interval=2d
(to exclude the possibility of repeat notifications) and receiver that sends a webhookDiagram
alertmanager-bug.zip contains everything that is needed to reproduce the bug. To reproduce it, unpack the archive and run
docker compose up
. When cluster is created, run script./send.sh test
that will send an alert to all 3 instances, wait for 50s and then send the same alert but with EndsAt=now.This happens because instance that spends on stage
WaitStage
for more thangroup_intervals
can encounter with a state "from future".In the diagram above,
Alertmanager 3
that processes ticknow=30
with 1 firing alert during the DedupStage compares the current aggregation group with state produced byAlertmanager 1
while it was processing ticknow=100
where alert was resolved.This PR proposes a fix for this behavior. It introduces an additional check if predicate DedupStage.needsUpdate returns true and notification log entry exists.
It compares the current tick time (the timestamp of the aggregation group tick) with the notification log timestamp, and if former is in the past, which means that there was Log event after the flushing happened, emits an info log and returns empty slice of alerts, which means that the current pipeline should be stopped.