Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix config reloads not respecting group interval #3074

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

andrejbranch
Copy link

@andrejbranch andrejbranch commented Sep 14, 2022

We have been running this modification at Shopify for the last few weeks and it has solved two issues for us.

  1. Config reload causes notifications of new alerts within a group (doesn't respect group_interval)
  2. When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

Related issues that this addresses:

This change does the following:

  • Changes an alerts group ticker interval to group wait (theres a TODO here), this results in notification evaluation to happen every 30s (default group wait) for each alert group.
  • Moves group interval logic to dedupe stage utilizing the persistent notification entries. If the last notification entry for an alert group was a firing event and it was sent within the group interval window, we prevent notification.

TODO:
This change adds to even further the confusion of group interval, and group_wait. Three separate configs seems more appropriate: group_wait (how long we wait to start the evaluation background job for an alert group), group_interval (how frequent an alert group is evaluated for dispatch), group_repeat_interval (how long to wait before sending a notification about new alerts that are added to a group of alerts. This would allow for more fine tuning alert dispatching.

@andrejbranch andrejbranch force-pushed the group-interval-fix branch 2 times, most recently from dc9abf9 to c2fc518 Compare September 14, 2022 17:11
notify/notify.go Outdated
// If we haven't notified about the alert group before, notify right away
// unless we only have resolved alerts.
if entry == nil {
return len(firing) > 0
}

if !entry.IsFiringSubset(firing) {
groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we want to use keyNow from the context here?

@simonpasquier
Copy link
Member

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications.

@andrejbranch
Copy link
Author

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete

This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications.

Agreed, but would it be better to make it work like that by default with the ability to configure?

Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>
@yuri-tceretian
Copy link
Contributor

yuri-tceretian commented Apr 14, 2023

In #3283 I try to solve another problem a similar way. There are some differences though:

  • my PR drops the positive result of needsUpdate if the tick expires, no matter if it resolve or firing.
  • I do not use group_interval and a magic constant 0.99. Why did you choose 0.99 and not 1.0 or 0.95?

@andrejbranch
Copy link
Author

andrejbranch commented Apr 14, 2023

I have modified this quite a bit in Shopify's alertmanager fork, i do not use 0.99 constant anymore and I also had to solve alert flapping with hard coded check if its a recent entry. The flappy behavior was amplified in how we are running alertmanager because we change https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go#L452 to ag.next.Reset(ag.opts.GroupWait) so that we can send resolved alerts before group interval is up

for what its worth here is the needsUpdate in our forked version

func (n *DedupStage) needsUpdate(entry *nflogpb.Entry, firing, resolved map[uint64]struct{}, repeat, groupInterval time.Duration) bool {
	// If we haven't notified about the alert group before, notify right away
	// unless we only have resolved alerts.
	if entry == nil {
		return len(firing) > 0
	}

	if recentEntry := entry.Timestamp.After(time.Now().Add(-30 * time.Second)); recentEntry {
		return false
	}

	groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))

	if !entry.IsFiringSubset(firing) && !groupIntervalMuted {
		return true
	}

	// Notify about all alerts being resolved.
	// This is done irrespective of the send_resolved flag to make sure that
	// the firing alerts are cleared from the notification log.
	if len(firing) == 0 {
		// If the current alert group and last notification contain no firing
		// alert, it means that some alerts have been fired and resolved during the
		// last interval. In this case, there is no need to notify the receiver
		// since it doesn't know about them.
		return len(entry.FiringAlerts) > 0
	}

	if n.rs.SendResolved() && !entry.IsResolvedSubset(resolved) && !groupIntervalMuted {
		return true
	}

	// Nothing changed, only notify if the repeat interval has passed.
	return entry.Timestamp.Before(n.now().Add(-repeat))
}

@yuri-tceretian
Copy link
Contributor

for what its worth here is the needsUpdate in our forked version

Interesting, in your code you use time.Now() which is a wall-clock

groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval))

but in this PR you use the Now(ctx) which is the time when flushing happened. Probably, for your use-case\PR this does not really make a sensible difference becasue time.Now() and Now(ctx) are pretty close but in the case of HA (my PR fixes) the diff between those two timestamps could be times of peer_timeout.

zecke added a commit to zecke/alertmanager that referenced this pull request May 5, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
@grobinson-grafana
Copy link
Contributor

Hi! 👋 I found this issue from another PR.

Config reload causes notifications of new alerts within a group (doesn't respect group_interval)

Config reload will re-create all existing aggregation groups. This causes an immediate flush because of this code here:

// Immediately trigger a flush if the wait duration for this
// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
	ag.next.Reset(0)
}

I think we should wait Group wait seconds. I'm not sure if we need to persist and wait the remainder of the previous Group interval?

When all alerts in a group are resolved, resolved notifications have to wait till group interval is complete.

Like Simon said, I think this is intended behavior. If you have a group of size 1 (i.e. grouping is disabled) that is flapping between firing and resolved, won't this change make Alertmanager send a notification each time it flaps, rather than smooth it out?

zecke added a commit to zecke/alertmanager that referenced this pull request May 10, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
zecke added a commit to zecke/alertmanager that referenced this pull request May 18, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
zecke added a commit to zecke/alertmanager that referenced this pull request May 26, 2024
Add an acceptance test that triggers a config reload and verifies
that no early notification is occurring.

One of the challenges is which time to use to check for a previous
notification. The nflog captures about the time all notifications
were sent. That conflicts with the ag.next timer that get's reset
before the ag is being flushed. Delays and retries can make these
two point in time be different enough for the integration tests to
fail.

I considered the following ways to fix it:

  1.) Modify the nflog.proto to capture the flush time in addition
      to the successful notification time.
  2.) In addition to hasFlushed capture the last flush time and pass
      it to the context like we do for Now.
  3.) Set hashFlushed and reset the timer after the flush has been
      done.

I started with prometheus#3 as it seemeded to have the fewest downsides with
things like drift. Based on comments this is no prometheus#1.

needsUpdate is based on:
prometheus#3074 (comment)

Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants