emails are resent on config reload if group contents has changed #2492

asuffield · 2021-02-25T15:28:52Z

What did you do?

Write a route with a group match, of the form:

group_by:
- alertname
group_wait: 10s
group_interval: 1h
repeat_interval: 4h

Perform these operations, in order:

amtool alert add alertname=test tag=1
sleep 11
# observe the notification being sent
amtool alert add alertname=test tag=2
killall -HUP alertmanager

What did you expect to see?

One hour later, notifications should be resent, when group_interval expires.

What did you see instead? Under which circumstances?

Notifications for tag=1 and tag=2 are resent immediately when SIGHUP is received.

This is notably specific to the email receiver, because the other ones can deduplicate on group_key, which hides the problem.

Environment

Alertmanager version:

0.21

So, here's what's going on:

Grouping is implemented in the dispatcher, while deduplication is implemented in the notification pipeline. The notification pipeline is stateful, via nflog. The dispatcher only keeps its state in memory.

Config reloading is implemented by completely stopping the dispatcher and constructing a new one from config. This also clears all aggregation group state.

When the config is reloaded, the new dispatcher recreates all the aggregation groups from the in-memory alert state. This results in dispatching all the events into the notification pipeline again.

If the set of alerts is unchanged since the last time this occurred, the DedupStage in the notification pipeline will suppress them, based on the contents of nflog, respecting repeat_interval as expected. However, if there has been any change to the set of alerts in the group, then DedupStage can't handle that and the notifications will be sent to the receiver.

If the receiver is anything other than email, the group key and timestamps will not have changed so the duplicate notifications can be suppressed. The email receiver is unable to do this (because it's email...) so duplicate emails are sent.

Half of this story is taken up by explaining why we don't see this happen all the time, but the grouping functionality doesn't really work across config reloads, and probably also behaves badly across cluster gossip. This isn't noticed in most configurations because the other features are effective at hiding the problem.

I'm thinking that the correct solution here would be for aggregation grouping to be implemented in the notification pipeline based on nflog. However, a crude and immediate solution is likely achievable by having the dispatcher query nflog when creating an aggregation group, and adjusting its timings if this group key was successfully sent in the last group_interval; this would solve the config reload case, but I'm not sure it's the right way to go in the presence of cluster gossip.

Opinions?

The text was updated successfully, but these errors were encountered:

zecke · 2022-09-14T01:00:50Z

What about group_wait being larger than the time between config reloads? Lowering the aggregation pipeline seems feasible but also propagating the state from the old to the new dispatcher.

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>

zecke mentioned this issue Aug 13, 2022

Config reload causes notifications of new alerts within a group (doesn't respect group_interval) #3037

Closed

andrejbranch mentioned this issue Sep 14, 2022

Fix config reloads not respecting group interval #3074

Open

artli mentioned this issue Jun 27, 2023

Reloading the config leads to incorrect notifications being sent due to a race condition #3407

Open

RichardWarburton mentioned this issue Jul 1, 2023

Feature request: graceful configuration reload #2146

Open

artli mentioned this issue Jul 1, 2023

Alertmanager resends the notifications after reloading the configuration #3410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

emails are resent on config reload if group contents has changed #2492

emails are resent on config reload if group contents has changed #2492

asuffield commented Feb 25, 2021 •

edited

zecke commented Sep 14, 2022

emails are resent on config reload if group contents has changed #2492

emails are resent on config reload if group contents has changed #2492

Comments

asuffield commented Feb 25, 2021 • edited

zecke commented Sep 14, 2022

asuffield commented Feb 25, 2021 •

edited