Alerting: Fix alert flapping in the internal alertmanager #38648

gotjosh · 2021-08-27T15:35:23Z

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of EndsAt and resend delay.

The Alertmanager uses EndsAt as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.

Because Grafana sent the alert with an EndsAt which is equal to the For of the alert itself,
and we had a hard-coded 1-minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that is still firing.

This commit, increases the EndsAt by 3x the for duration in the case where the For is > interval e.g. For 30s each 10s, but also lowers the resend delay - 1 minute seemed too high when we support any For`.

Fixes #38419 and fixes #37461

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager. Mostly due to a combination of `EndsAt` and resend delay. The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard back from the alert generation system. Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself, and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing. This commit, increases the `EndsAt` by 3x the for duration in the case where the `For` is > `internval` e.g. For 30s each 10s, but also lowers the resend delay - 1 minute seemed too high when we support any `For`.

gotjosh · 2021-08-27T15:38:04Z

pkg/services/ngalert/state/manager.go

@@ -33,7 +33,7 @@ func NewManager(logger log.Logger, metrics *metrics.Metrics, ruleStore store.Rul
 	manager := &Manager{
 		cache:         newCache(logger, metrics),
 		quit:          make(chan struct{}),
-		ResendDelay:   1 * time.Minute, // TODO: make this configurable
+		ResendDelay:   10 * time.Second, // TODO: make this configurable


I have added a task to https://github.com/grafana/alerting-squad/issues/59 to make sure we make this configurable via the admin options so that users can change this at runtime.

This change doesn't mean that notifications will be sent once per elapsed 10 seconds instead of once per minute right? This is just the interval at which we resend the alert to Alertmanager which will then itself choose whether to re-send the notification or not?

@gerobinson Correct, it is like a keep-alive msg from the state manager to alert manager (since in AM, if endsAt is passed, the alert instance is resolved).

gotjosh · 2021-08-27T15:41:40Z

pkg/services/ngalert/state/state.go

 func (a *State) setEndsAt(alertRule *ngModels.AlertRule, result eval.Result) {
 	if int64(alertRule.For.Seconds()) > alertRule.IntervalSeconds {
 		// For is set and longer than IntervalSeconds
-		a.EndsAt = result.EvaluatedAt.Add(alertRule.For)
+		a.EndsAt = result.EvaluatedAt.Add(alertRule.For * 3)


I think 3 the evaluation interval seems like a good enough trade-off between flapping and quickly resolving the alert once it has cleared but please do let me know if you have any other opinions.

I'm wondering this should be >= since if For = Interval it only gets 2x below.

kylebrandt

One comment on when For == eval interval, but LGTM.

I was able to repro the firinng/resolve with the reported "bad run/for timings" of 10s/1m and 10s/30s on main (as long as I set the 1s wait timings on the policy), and it stops with this branch.

gotjosh · 2021-09-02T13:26:56Z

@kylebrandt I have opted for changing how this works. I believe that taking into account the resend delay or the interval of the alert is what really matters when it comes to determining when it should be considered as resolved in the Alertmanager.

Therefore, I have simplified the calculation for it. In the end, the outcome should be the same and fix any floppiness we do have.

I have one last question regarding this area and is that I don't see how we would always send a resolve alert (an alert going from alerting -> normal with an EndsAt of now) irrespective of whether we've notified recently for this alert because NeedsSending does not take this into account but there might be some time play with dates to make sure it always matches the last conditional - I need to investigate.

func (a *State) NeedsSending(resendDelay time.Duration) bool {
	if a.State != eval.Alerting && a.State != eval.Normal {
		return false
	}

	if a.State == eval.Normal && !a.Resolved {
		return false
	}
	// if LastSentAt is before or equal to LastEvaluationTime + resendDelay, send again
	return a.LastSentAt.Add(resendDelay).Before(a.LastEvaluationTime) ||
		a.LastSentAt.Add(resendDelay).Equal(a.LastEvaluationTime)
}

kylebrandt · 2021-09-02T14:04:16Z

Retried the two "bad timings" I referenced in my previous comment on the updated branch, things still behaving well with the updates 🎉

gotjosh · 2021-09-02T15:21:58Z

To answer my last comment, this works as I expected it to be as well - no further concerns from my side. The dates matches and at most, we'll get a delay of resendDelay on sending the resolve notifications in the worst case which is not that bad.

* Alerting: Fix alert flapping in the alertmanager fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager. Mostly due to a combination of `EndsAt` and resend delay. The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard back from the alert generation system. Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself, and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing. This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds. (cherry picked from commit dd502f2)

…38829) * Alerting: Fix alert flapping in the alertmanager fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager. Mostly due to a combination of `EndsAt` and resend delay. The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard back from the alert generation system. Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself, and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing. This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds. (cherry picked from commit dd502f2) Co-authored-by: gotjosh <josue@grafana.com>

…-github to revdev * commit '0d046c46b480ffc5fd602142a2b0e5dfc765b620': (63 commits) "Release: Updated versions in package to 8.1.3" (grafana#38965) Change grabpl version to 2.3.4 (grafana#38967) PieChart: Display "No data" when there is no data (grafana#38808) (grafana#38960) track signature files + add warn log (grafana#38938) (grafana#38958) Docs: Clarify delta value (grafana#38824) (grafana#38916) Postgres/MySQL/MSSQL: Fix region annotations not displayed correctly (grafana#38936) (grafana#38953) uPlot: Fix default value for plot legend visibility (grafana#36660) (grafana#38930) Prometheus: Fix validate selector in metrics browser (grafana#38921) (grafana#38926) Dashboard: Forces panel re-render when exiting panel edit (grafana#38913) (grafana#38919) [v8.1.x] Dashboard: Fix UIDs are not preserved when importing/creating dashboards thru importing .json file (grafana#38892) OAuth: add docs for disableAutoLogin param (grafana#38752) (grafana#38896) LibraryPanels: Prevents duplicate repeated panels from being created (grafana#38804) (grafana#38863) Adding missing information, more than just the manual backport. (grafana#38837) Elasticsearch: Prevent pipeline aggregations to show up in terms order by options (grafana#38448) (grafana#38830) Build: Upgrade grabpl to 2.4.2 (grafana#38820) (grafana#38828) Alerting: Fix alert flapping in the internal alertmanager (grafana#38648) (grafana#38829) [v8.1.x] Chore: Update to alpine:3.14.2 (grafana#38821) Update Dockerfile (grafana#38785) (grafana#38815) Deprecate browser access mode for the Graphite data source. (grafana#38783) (grafana#38809) Live: prepend orgId when publishing from HTTP (grafana#38775) (grafana#38793) ...

gotjosh requested a review from a team as a code owner August 27, 2021 15:35

grafanabot added the area/backend label Aug 27, 2021

gotjosh commented Aug 27, 2021

View reviewed changes

update manager_test

6071491

kylebrandt approved these changes Aug 30, 2021

View reviewed changes

kylebrandt added this to the 8.1.3 milestone Aug 30, 2021

kylebrandt added add to changelog area/unified-alerting labels Aug 30, 2021

gotjosh added 2 commits September 2, 2021 10:57

Rework the logic for endsAt

2c3f2e1

Fix more tests

cbd8014

gotjosh merged commit dd502f2 into main Sep 2, 2021

gotjosh deleted the alert-flap branch September 2, 2021 15:23

gotjosh added the old backport v8.1.x Mark PR for automatic backport to v8.1.x label Sep 2, 2021

grafanabot mentioned this pull request Sep 2, 2021

[v8.1.x] Alerting: Fix alert flapping in the internal alertmanager #38829

Merged

grobinson-grafana mentioned this pull request Sep 22, 2021

Alerting: Fixed alerts with evaluation interval more than 30 seconds resolving before notification #39513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Fix alert flapping in the internal alertmanager #38648

Alerting: Fix alert flapping in the internal alertmanager #38648

gotjosh commented Aug 27, 2021

gotjosh Aug 27, 2021

grobinson-grafana Sep 1, 2021

kylebrandt Sep 1, 2021

gotjosh Aug 27, 2021

kylebrandt Aug 30, 2021

kylebrandt left a comment

gotjosh commented Sep 2, 2021

kylebrandt commented Sep 2, 2021

gotjosh commented Sep 2, 2021

Alerting: Fix alert flapping in the internal alertmanager #38648

Alerting: Fix alert flapping in the internal alertmanager #38648

Conversation

gotjosh commented Aug 27, 2021

gotjosh Aug 27, 2021

Choose a reason for hiding this comment

grobinson-grafana Sep 1, 2021

Choose a reason for hiding this comment

kylebrandt Sep 1, 2021

Choose a reason for hiding this comment

gotjosh Aug 27, 2021

Choose a reason for hiding this comment

kylebrandt Aug 30, 2021

Choose a reason for hiding this comment

kylebrandt left a comment

Choose a reason for hiding this comment

gotjosh commented Sep 2, 2021

kylebrandt commented Sep 2, 2021

gotjosh commented Sep 2, 2021