Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emails are resent on config reload if group contents has changed #2492

Open
asuffield opened this issue Feb 25, 2021 · 1 comment
Open

emails are resent on config reload if group contents has changed #2492

asuffield opened this issue Feb 25, 2021 · 1 comment

Comments

@asuffield
Copy link

asuffield commented Feb 25, 2021

What did you do?

Write a route with a group match, of the form:

group_by:
- alertname
group_wait: 10s
group_interval: 1h
repeat_interval: 4h

Perform these operations, in order:

amtool alert add alertname=test tag=1
sleep 11
# observe the notification being sent
amtool alert add alertname=test tag=2
killall -HUP alertmanager

What did you expect to see?

One hour later, notifications should be resent, when group_interval expires.

What did you see instead? Under which circumstances?

Notifications for tag=1 and tag=2 are resent immediately when SIGHUP is received.

This is notably specific to the email receiver, because the other ones can deduplicate on group_key, which hides the problem.

Environment

  • Alertmanager version:

0.21

So, here's what's going on:

Grouping is implemented in the dispatcher, while deduplication is implemented in the notification pipeline. The notification pipeline is stateful, via nflog. The dispatcher only keeps its state in memory.

Config reloading is implemented by completely stopping the dispatcher and constructing a new one from config. This also clears all aggregation group state.

When the config is reloaded, the new dispatcher recreates all the aggregation groups from the in-memory alert state. This results in dispatching all the events into the notification pipeline again.

If the set of alerts is unchanged since the last time this occurred, the DedupStage in the notification pipeline will suppress them, based on the contents of nflog, respecting repeat_interval as expected. However, if there has been any change to the set of alerts in the group, then DedupStage can't handle that and the notifications will be sent to the receiver.

If the receiver is anything other than email, the group key and timestamps will not have changed so the duplicate notifications can be suppressed. The email receiver is unable to do this (because it's email...) so duplicate emails are sent.

Half of this story is taken up by explaining why we don't see this happen all the time, but the grouping functionality doesn't really work across config reloads, and probably also behaves badly across cluster gossip. This isn't noticed in most configurations because the other features are effective at hiding the problem.

I'm thinking that the correct solution here would be for aggregation grouping to be implemented in the notification pipeline based on nflog. However, a crude and immediate solution is likely achievable by having the dispatcher query nflog when creating an aggregation group, and adjusting its timings if this group key was successfully sent in the last group_interval; this would solve the config reload case, but I'm not sure it's the right way to go in the presence of cluster gossip.

Opinions?

@zecke
Copy link
Contributor

zecke commented Sep 14, 2022

What about group_wait being larger than the time between config reloads? Lowering the aggregation pipeline seems feasible but also propagating the state from the old to the new dispatcher.

zecke added a commit to zecke/alertmanager that referenced this issue May 5, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
zecke added a commit to zecke/alertmanager that referenced this issue May 10, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
zecke added a commit to zecke/alertmanager that referenced this issue May 18, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
zecke added a commit to zecke/alertmanager that referenced this issue May 26, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants