Fixes early notifications after config reload (#2492) #3835

zecke · 2024-05-05T09:11:10Z

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring.

One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail.

I considered the following ways to fix it:

1.) Modify the nflog.proto to capture the flush time in addition
to the successful notification time.
2.) In addition to hasFlushed capture the last flush time and pass
it to the context like we do for Now.
3.) Set hashFlushed and reset the timer after the flush has been
done.

I started with #3 as it seemeded to have the fewest downsides with things like drift.

needsUpdate is based on:
#3074 (comment)

zecke · 2024-05-05T09:36:59Z

I was trying to find an answer of why we set hasFlushed before we actually started flushing (leave alone finished it).

The flag was introduced in (#1301) to mitigate duplicate notifications by avoiding to execute ag.run() more frequently.

With this change the problem is solved inside the DedupStage. A ag.next.Reset(0) will flush the group early but as it happens before the groupInterval should not lead to extra notifications. The only visible impact should be jitter. If this matters we can replace the timer reset with a write into a channel.

grobinson-grafana · 2024-05-07T08:51:17Z

dispatch/dispatch.go

 			ctx = notify.WithGroupKey(ctx, ag.GroupKey())
 			ctx = notify.WithGroupLabels(ctx, ag.labels)
 			ctx = notify.WithReceiverName(ctx, ag.opts.Receiver)
 			ctx = notify.WithRepeatInterval(ctx, ag.opts.RepeatInterval)
 			ctx = notify.WithMuteTimeIntervals(ctx, ag.opts.MuteTimeIntervals)
 			ctx = notify.WithActiveTimeIntervals(ctx, ag.opts.ActiveTimeIntervals)

+			ag.flush(func(alerts ...*types.Alert) bool {


I haven't had time to look at the other changes, but I think there is a subtle bug here. Prior to this change, the timer was reset before the flush, so Group Interval is unaffected by the duration of the notify function. However, now the timer is reset after the flush. This means Group Interval is reset relative to the duration of the notify function instead of relative to the time of the previous flush. Won't this cause Group interval to drift over time?

I argue s/bug/trade-off/. It picks "success" (or timeout) as the point of last group notification. It might even match user experience better. It counts from the time I got paged (over the time it started to try to page me). On top of that we have no test case that tries to capture the drifting behavior.

In terms of drift we also have the small race for ag.next.Reset(0) and configuration reload (that ends up creating a whole new Dispatcher). The drift is larger in case of errors (where might end up skipping a notification anyway).

The alternative to the above is that we start tracking both the success and start time in the nflogpb.

The problem is that it breaks Alertmanager HA which relies on the Group wait and Group interval timers being in lock-step on all replicas for the same aggregation group. If those timers drift, then de-duplication in dedupStage will stop working.

Maybe I am missing but to my understanding the only communication between the replicas in the HA set-up is the nflog? In specific the dedup stage today relies on return entry.Timestamp.Before(n.now().Add(-repeat)) if no alerts changed and config was reloaded.

Today config reload is probably one of the likeliest places the group interval can drift. We re-create the dispatcher and the pipeline can run earlier than the group interval. And that is kind of the bug we see during config reload. The pipeline runs earlier and sees unrelated changes makes a notification, makes a new notification.

What do you think about recording the start of the notification and then using it to set the timer accordingly?

Maybe I am missing but to my understanding the only communication between the replicas in the HA set-up is the nflog? In specific the dedup stage today relies on return entry.Timestamp.Before(n.now().Add(-repeat)) if no alerts changed and config was reloaded.

That's correct!

However, there is also an implicit assumption in Alertmanager that the Group interval for each aggregration group is synchronized across all Alertmanagers in an HA set-up for dedup to work. It looks like this:

You can see here that AM2 and AM3 wait peer_timeout and 2*peer_timeout seconds in the wait stage. This is how Alertmanager failover works in case AM1 or AM2 cannot send a notification. However, this behavior relies on Group wait and Group interval timers firing at the same time. Alertmanagers do not communicate with each other to synchronize these timers, but instead use the time an alert was received from Prometheus as Prometheus broadcasts each alert to all Alertmanagers.

However, if the timers become de-synchronized then the wait stage stops working and each Alertmanager could enter the dedup and retry stages at the same time, causing duplicate notifications.

There are a number of issues with this, but the important issue here is that config reloads can cause the timers to be de-synchronized if the config is reloaded at different times. I think this is where the fix needs to be.

Yes. The above can happen. We could run flush more frequently and the downside is more CPU and the notification would still be off.

What makes this a bit hard is that the dispatcher doesn't have access to the nflogpb.Entry and that the entry is per receiver.

Given config reload, restart of alertmanager instances I can see two approaches.

We extend Stage.Exec to return the time the pipeline should run the next time. The FanoutState/MultiStage can then return the closest non-zero timestamp.

Extend the Stage interface to have a function that returns the next run time. We can either synchronize once (like we do this today) or continously.

I opted for the 2nd option. One of the acceptance tests required updates now that we "restore" the timer. I also changed that the repeat interval is based on dispatch time and not when the notification succeeded.

I plan on adding a few more tests but would appreciate some early comments.

Can you add a test for the following configuration:

route: group_wait: 30s group_interval: 5m repeat_interval: 4h

Where the reload happens 9 mins after the notification is sent. I think in this case the code will do an immediate flush due to Line 455 of dispatch.go when I assume it should restore the timer instead?

The issue is that nflog is not updated each time the aggregation group is flushed, just when a notification is sent.

This is where I'm not sure storing the flush timestamp in the nflog makes sense because the nflog is specific to notification deduplication.

What I'm thinking instead is can we get the existing timers from the old aggregation groups at reload time and passing it to the new groups? Groups which have had their time intervals changed will need to start again from Group wait.

Existing timers: Ideally I would like this to behave the same across config reload and restart of an instance.

Let me create a test and think about it. While we could gossip more it is not clear which instance(s) should gossip. I think there are two cases to handle

Avoid notification when nothing in the group has changed and repeat_interval hasn't passed yet

Synchronize back on the timer (with small-ish) jitter. What do you think we should be doing after having "missed" the right time to flush? The current code does ag.next.Reset(0) to get going quickly.

Updated to check how much time is left to the next group interval. This should also do the right thing when the group interval is changing.

Got the following log from the extra test (decided to still use seconds over running it for minutes)

Add alert (07:12:41)

Group Wait passes (07:12:42)

First notification

GroupInterval fired (07:12:50)

Config Reload (07:12:52)

Sync to the next run ~6s in the future

GroupInterval fires (07:12:58)

Add a second alert to the group (07:12:59)

GroupInterval (07:13:06)

ts=2024-05-26T07:12:42.496Z caller=dispatch.go:570 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test1\"}" msg=flushing alerts=[test1[1a9707e][active]] ts=2024-05-26T07:12:50.497Z caller=dispatch.go:570 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test1\"}" msg=flushing alerts=[test1[1a9707e][active]] ts=2024-05-26T07:12:52.400Z caller=coordinator.go:113 level=info component=configuration msg="Loading configuration file" file=/var/folders/nc/2kkt6df104q27x5z00xf32480000gn/T/am_test4278641254/config.yml ts=2024-05-26T07:12:52.403Z caller=dispatch.go:456 level=error component=dispatcher aggrGroup="{}:{alertname=\"test1\"}" msg=SYNC sleep=6.091589375s ts=2024-05-26T07:12:58.495Z caller=dispatch.go:570 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test1\"}" msg=flushing alerts=[test1[1a9707e][active]] ts=2024-05-26T07:12:59.433Z caller=dispatch.go:166 level=debug component=dispatcher msg="Received alert" alert=test1[6345262][active] ts=2024-05-26T07:13:06.495Z caller=dispatch.go:570 level=debug component=dispatcher aggrGroup="{}:{alertname=\"test1\"}" msg=flushing alerts="[test1[1a9707e][active] test1[6345262][active]]" ts=2024-05-26T07:13:06.497Z caller=notify.go:958 level=debug component=dispatcher receiver=default integration=webhook[0] aggrGroup="{}:{alertname=\"test1\"}" alerts="[test1[1a9707e][active] test1[6345262][active]]" msg="Notify success" attempts=1 duration=1.878875ms

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

When creating an AggregationGroup consult the Stage about the next expected time to `Exec` and use it to set the initial timer. Update the acceptance test to match the new behavior and add a test to receive the updated group at group_interval timeout. Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

Use the dispatch time instead of the last successful notification for the repeat interval. This avoids a drift and also keeps the same time during notifications. Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

zecke · 2024-09-11T12:07:17Z

Any chance we can make progress here?

grobinson-grafana reviewed May 7, 2024

View reviewed changes

zecke force-pushed the freyth-cfg-re-notify branch from e1464ba to e4c97e3 Compare May 10, 2024 11:09

zecke force-pushed the freyth-cfg-re-notify branch 2 times, most recently from 91dfd29 to c54e386 Compare May 19, 2024 00:58

zecke added 3 commits May 24, 2024 12:11

notify: Use the DispatchTime and now as base for the repeat

369954b

Use the dispatch time instead of the last successful notification for the repeat interval. This avoids a drift and also keeps the same time during notifications. Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

zecke force-pushed the freyth-cfg-re-notify branch from c54e386 to 369954b Compare May 26, 2024 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes early notifications after config reload (#2492) #3835

Fixes early notifications after config reload (#2492) #3835

zecke commented May 5, 2024

zecke commented May 5, 2024

grobinson-grafana May 7, 2024

zecke May 7, 2024 •

edited

Loading

grobinson-grafana May 7, 2024

zecke May 8, 2024

grobinson-grafana May 8, 2024

zecke May 17, 2024

zecke May 19, 2024

grobinson-grafana May 20, 2024

zecke May 25, 2024

zecke May 26, 2024

zecke commented Sep 11, 2024

Fixes early notifications after config reload (#2492) #3835

Are you sure you want to change the base?

Fixes early notifications after config reload (#2492) #3835

Conversation

zecke commented May 5, 2024

zecke commented May 5, 2024

Choose a reason for hiding this comment

zecke May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zecke commented Sep 11, 2024

zecke May 7, 2024 •

edited

Loading