Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Fix alert flapping in the internal alertmanager #38648

Merged
merged 4 commits into from Sep 2, 2021
Merged

Conversation

gotjosh
Copy link
Contributor

@gotjosh gotjosh commented Aug 27, 2021

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of EndsAt and resend delay.

The Alertmanager uses EndsAt as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.

Because Grafana sent the alert with an EndsAt which is equal to the For of the alert itself,
and we had a hard-coded 1-minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that is still firing.

This commit, increases the EndsAt by 3x the for duration in the case where the For is > interval e.g. For 30s each 10s, but also lowers the resend delay - 1 minute seemed too high when we support any For`.

Fixes #38419 and fixes #37461

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of `EndsAt` and resend delay.

The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.

Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself,
and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing.

This commit, increases the `EndsAt` by 3x the for duration in the case where the `For` is > `internval` e.g. For 30s each 10s, but also lowers the resend delay - 1 minute seemed too high when we support any `For`.
@gotjosh gotjosh requested a review from a team as a code owner August 27, 2021 15:35
@@ -33,7 +33,7 @@ func NewManager(logger log.Logger, metrics *metrics.Metrics, ruleStore store.Rul
manager := &Manager{
cache: newCache(logger, metrics),
quit: make(chan struct{}),
ResendDelay: 1 * time.Minute, // TODO: make this configurable
ResendDelay: 10 * time.Second, // TODO: make this configurable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a task to https://github.com/grafana/alerting-squad/issues/59 to make sure we make this configurable via the admin options so that users can change this at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change doesn't mean that notifications will be sent once per elapsed 10 seconds instead of once per minute right? This is just the interval at which we resend the alert to Alertmanager which will then itself choose whether to re-send the notification or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gerobinson Correct, it is like a keep-alive msg from the state manager to alert manager (since in AM, if endsAt is passed, the alert instance is resolved).

func (a *State) setEndsAt(alertRule *ngModels.AlertRule, result eval.Result) {
if int64(alertRule.For.Seconds()) > alertRule.IntervalSeconds {
// For is set and longer than IntervalSeconds
a.EndsAt = result.EvaluatedAt.Add(alertRule.For)
a.EndsAt = result.EvaluatedAt.Add(alertRule.For * 3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 3 the evaluation interval seems like a good enough trade-off between flapping and quickly resolving the alert once it has cleared but please do let me know if you have any other opinions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering this should be >= since if For = Interval it only gets 2x below.

Copy link
Contributor

@kylebrandt kylebrandt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on when For == eval interval, but LGTM.

I was able to repro the firinng/resolve with the reported "bad run/for timings" of 10s/1m and 10s/30s on main (as long as I set the 1s wait timings on the policy), and it stops with this branch.

@gotjosh
Copy link
Contributor Author

gotjosh commented Sep 2, 2021

@kylebrandt I have opted for changing how this works. I believe that taking into account the resend delay or the interval of the alert is what really matters when it comes to determining when it should be considered as resolved in the Alertmanager.

Therefore, I have simplified the calculation for it. In the end, the outcome should be the same and fix any floppiness we do have.

I have one last question regarding this area and is that I don't see how we would always send a resolve alert (an alert going from alerting -> normal with an EndsAt of now) irrespective of whether we've notified recently for this alert because NeedsSending does not take this into account but there might be some time play with dates to make sure it always matches the last conditional - I need to investigate.

func (a *State) NeedsSending(resendDelay time.Duration) bool {
	if a.State != eval.Alerting && a.State != eval.Normal {
		return false
	}

	if a.State == eval.Normal && !a.Resolved {
		return false
	}
	// if LastSentAt is before or equal to LastEvaluationTime + resendDelay, send again
	return a.LastSentAt.Add(resendDelay).Before(a.LastEvaluationTime) ||
		a.LastSentAt.Add(resendDelay).Equal(a.LastEvaluationTime)
}

@kylebrandt
Copy link
Contributor

Retried the two "bad timings" I referenced in my previous comment on the updated branch, things still behaving well with the updates 🎉

@gotjosh
Copy link
Contributor Author

gotjosh commented Sep 2, 2021

To answer my last comment, this works as I expected it to be as well - no further concerns from my side. The dates matches and at most, we'll get a delay of resendDelay on sending the resolve notifications in the worst case which is not that bad.

@gotjosh gotjosh merged commit dd502f2 into main Sep 2, 2021
@gotjosh gotjosh deleted the alert-flap branch September 2, 2021 15:23
@gotjosh gotjosh added the old backport v8.1.x Mark PR for automatic backport to v8.1.x label Sep 2, 2021
grafanabot pushed a commit that referenced this pull request Sep 2, 2021
* Alerting: Fix alert flapping in the alertmanager

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of `EndsAt` and resend delay.

The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.

Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself,
and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing.

This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds.

(cherry picked from commit dd502f2)
gotjosh added a commit that referenced this pull request Sep 2, 2021
…38829)

* Alerting: Fix alert flapping in the alertmanager

fixes a bug that caused Alerts that are evaluated at low intervals (sub 1 minute), to flap in the Alertmanager.
Mostly due to a combination of `EndsAt` and resend delay.

The Alertmanager uses `EndsAt` as a heuristic to know whenever it should resolve a firing alert, in the case that it hasn't heard
back from the alert generation system.

Because grafana sent the alert with an `EndsAt` which is equal to the `For` of the alert itself,
and we had a hard-coded 1 minute re-send delay (only applicable to firing alerts) this meant that a firing alert would resolve in the Alertmanager before we re-notify that it still firing.

This commit, increases the `EndsAt` by 3x the the resend delay or alert interval (depending on which one is higher). The resendDelay has been decreased to 30 seconds.

(cherry picked from commit dd502f2)

Co-authored-by: gotjosh <josue@grafana.com>
stevepostill pushed a commit to Reveal-International/grafana that referenced this pull request Nov 3, 2021
…-github to revdev

* commit '0d046c46b480ffc5fd602142a2b0e5dfc765b620': (63 commits)
  "Release: Updated versions in package to 8.1.3" (grafana#38965)
  Change grabpl version to 2.3.4 (grafana#38967)
  PieChart: Display "No data" when there is no data (grafana#38808) (grafana#38960)
  track signature files + add warn log (grafana#38938) (grafana#38958)
  Docs: Clarify delta value (grafana#38824) (grafana#38916)
  Postgres/MySQL/MSSQL: Fix region annotations not displayed correctly (grafana#38936) (grafana#38953)
  uPlot: Fix default value for plot legend visibility (grafana#36660) (grafana#38930)
  Prometheus: Fix validate selector in metrics browser (grafana#38921) (grafana#38926)
  Dashboard: Forces panel re-render when exiting panel edit (grafana#38913) (grafana#38919)
  [v8.1.x] Dashboard: Fix UIDs are not preserved when importing/creating dashboards thru importing .json file (grafana#38892)
  OAuth: add docs for disableAutoLogin param (grafana#38752) (grafana#38896)
  LibraryPanels: Prevents duplicate repeated panels from being created (grafana#38804) (grafana#38863)
  Adding missing information, more than just the manual backport. (grafana#38837)
  Elasticsearch: Prevent pipeline aggregations to show up in terms order by options (grafana#38448) (grafana#38830)
  Build: Upgrade grabpl to 2.4.2 (grafana#38820) (grafana#38828)
  Alerting: Fix alert flapping in the internal alertmanager (grafana#38648) (grafana#38829)
  [v8.1.x] Chore: Update to alpine:3.14.2 (grafana#38821)
  Update Dockerfile (grafana#38785) (grafana#38815)
  Deprecate browser access mode for the Graphite data source. (grafana#38783) (grafana#38809)
  Live: prepend orgId when publishing from HTTP (grafana#38775) (grafana#38793)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
add to changelog area/backend old backport v8.1.x Mark PR for automatic backport to v8.1.x
Projects
None yet
4 participants