Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

yuri-tceretian · 2023-05-09T19:37:47Z

What is this feature?
This PR changes how executions of NoData\Error as OK\Alerting are handled by alerting the state manager. In those cases, the state manager updates all current states according to the execution settings.

Old Behavior

firefox_ogGiPhA9F7.1.mp4

New Behavior

firefox_tcWWtRVV7z-output.mp4

The feature is behind the flag alertingNoDataErrorExecution

Why do we need this feature?
To fix execution of NoData\Error when it is set to Alerting\OK

Who is this feature for?
Alerting users who would like to treat exceptional states (NoData\Error) by maintaining the current state.

Which issue(s) does this PR fix?:

Fixes #66790

Special notes for your reviewer:

~~Please review PR by commit. I copied test cases introduced in #73019 and fixed to pass when the feature flag is enabled. See commit 0ec9a02 to understand the difference.~~
After Matt's suggestion, I decided to amend to the existing tests instead of copying the whole suite. So, I updated the tests to run each test case twice - when flag is enabled and disabled. This shows that normal transitions are not affected, as well as Execution of NoData\Error as NoData\Error. The test cases where the difference exists, are overridden by a new set of assertions.
Tests that have overridden assertions are marked by [*]

Example

Please check that:

It works as expected from a user's perspective.
If this is a pre-GA feature, it is behind a feature toggle.
The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

grafanabot · 2023-06-10T02:01:36Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

…executes as Ok\Alerting

JacobsonMT · 2023-08-09T17:16:34Z

Instead of copying the tests, what do you think about something like this:

func TestProcessEvalResults_StateTransitions(t *testing.T) {
	t.Run("Without applyNoDataAndErrorToAllStates", func(t *testing.T) {
		stateTransitions(t, false)
	})
	t.Run("With applyNoDataAndErrorToAllStates", func(t *testing.T) {
		stateTransitions(t, true)
	})
}

func stateTransitions(t *testing.T, applyNoDataAndErrorToAllStates bool) {
...
}

Then making the differences explicit based on the value of applyNoDataAndErrorToAllStates.

I think this will make the PR much simpler to read and the tests should be less at risk of drift between feature flag true and false.

yuri-tceretian · 2023-08-09T20:04:09Z

pkg/services/ngalert/state/manager.go

@@ -288,7 +325,7 @@ func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRu
 	// Usually, it happens in the case of classic conditions when the evalResult does not have labels.
 	//
 	// This is temporary change to make sure that the labels are not persistent in the state after it was in Error state
-	// TODO yuri. Remove it in https://github.com/grafana/grafana/pull/68142
+	// TODO yuri. Remove it when correct Error result with labels is provided


Can't remove it in this PR because it breaks execution of Error as Error.

JacobsonMT

LGTM, great job 🚀

JacobsonMT · 2023-08-09T22:46:20Z

pkg/services/ngalert/state/manager.go

-// Set the current state based on evaluation results
-func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRule, result eval.Result, extraLabels data.Labels, logger log.Logger) StateTransition {
-	currentState := st.cache.getOrCreate(ctx, logger, alertRule, result, extraLabels, st.externalURL)
+func (st *Manager) setNextStateForRule(ctx context.Context, alertRule *ngModels.AlertRule, results eval.Results, extraLabels data.Labels, logger log.Logger) []StateTransition {


NIT: Add a method doc for setNextStateForRule & setNextStateForAll

grobinson-grafana

I see this change is behind a feature flag, which is great! However, I'm requesting changes not because I want actual changes to be done to the code, but instead I want to ask a question before we add this to main.

tl;dr the question:

I understand the sentiment of this change is to make all states for a rule consistent when either No Data or an error occurs. What concerns me is that for some of our larger customers this can create massive alert storms. If their datasource is down, it could be 10,000s of alerts firing at once, perhaps 100,000s. If their grouping is generous, a number of contact points will also fail (such as email) as the email will be too large. I'm not sure about others such as Slack or Pagerduty.

I'm not sure this is the right feature to add moving forward, but I can support having it behind a feature flag for experimentation.

@yuri-tceretian and @JacobsonMT, could you share your opinions on what I've said above?

yuri-tceretian · 2023-08-11T14:25:47Z

My thoughts about what George asked:
Large customers that have many rules with many dimensions and do not want to be hammered with all dimensions start firing in the case of Error\NoData, there is the default option to execute them as Error\NoData, which was introduced exactly for this use-case.

The bigger problem, in my opinion, is the side effect of the current behavior - after 2 evaluations resulted as Error\NoData it will cause all existing alert instances to be considered stale and get resolved, which also can cause an avalanche of notifications. Therefore, in my opinion, the execution of Error\NoData should maintain the current states (the same way the legacy KeepLastState did), or at least we should let the user pick this option. This is out of the scope of this PR, though.

Also, there is an alternative: use the pending state. If this PR is merged, the user can set the For interval and execution of NoData as Alerting, and the current dimensions will be armed in the case of NoData or Error. Also, it will help maintain the Pending state of an instance that was armed during a regular evaluation.

The purpose of this PR is two-fold:

To solve inconsistencies in Error and NoData handling behavior between "classic condition" and "multi-dimensional" rules.
In legacy alerting, and even in the unified alerting before the introduction of labels for NoData, which broke the way it worked in legacy (execution of Error for single dimension still works as expected, though), the behavior for the execution of exceptional results (NoData\Error) as Alerting\OK was that they affected the existing state instead of creating a new one. That was reflected in the documentation as well as the tests (see example). In the tests, however, the input parameters were done the wrong way but the assertion reflected the desired result. In reality, alerting just creates a new state because the exceptional results, which do not have labels returned during a normal evaluation, are handled as normal results.
to diverge from the default executions. The execution of Error\NoData as they are creates a separate state, the same as the execution as Alerting or OK with a few differences: in the former case alerts have special names and special labels and annotations, whereas in the latter case, the alert name is the rule's name and only rule's labels (no templated annotations, no dimension's labels). Therefore, the execution of results as Alerting\OK does not make any sense because they are the worse, less informative alternative of the default execution.

I do not agree that this should be an experimental feature. I added the feature flag so we could evaluate that the fix works properly because the state management is the most unclear part of alerting, and test coverage was not great. However, after adding more tests in #73019 I am much more confident that it works as it should.

grobinson-grafana · 2023-08-14T21:55:40Z

I tested a number of scenarios and seem to work as expected!

* main: (233 commits) PublicDashboards: Query order bug fixed (#73293) PostgreSQL: bump lib/pq to latest version (#72416) InfluxDB: Tests for #73247 (#73250) Docs: Add plugin dev documentation for logs to trace (#73225) Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142) Docs: correct SAML docs (#73281) CloudWatch: Add missing AppFlow metrics (#73149) docs: What’s New & Upgrade Guide 10.1 (#70636) Dashboard: Fix repeated row panel placement with larger number of rows (#72011) Geomap: Fix crosshair glitch (#72909) Logs: Fix scrolling with `exploreScrollableLogsContainer` feature (#73272) CodeEditor: Correctly fires onChange handler (#73030) InfluxDB: make influxql options the default if nothing defined (#73247) Cloudwatch: Upgrade aws-sdk and display external ids for temporary credentials (#72821) Cloudwatch: reorg files in components (#73176) Elasticsearch: Enable running of queries trough data source backend (#73222) Chore: fix some more types (#72726) Loki: Migrate HTTP settings to new components (#72831) Tracing: Split name column in search results (#72449) Plugins: Remove unnecessary error result from env vars interface (#73224) ...

…se when Error\NoData is executed as Ok\Nomal (grafana#68142)

grafanabot added the area/backend label May 9, 2023

yuri-tceretian self-assigned this May 9, 2023

yuri-tceretian added the area/alerting Grafana Alerting label May 9, 2023

armandgrillet requested review from JacobsonMT and grobinson-grafana May 10, 2023 07:30

grafanabot added stale Issue with no recent activity and removed stale Issue with no recent activity labels Jun 10, 2023

yuri-tceretian force-pushed the yuri-tceretian/fix-exec-state-mapping branch from f42e28a to abc530a Compare July 10, 2023 20:44

yuri-tceretian mentioned this pull request Jul 24, 2023

Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error #72216

Merged

3 tasks

grafana-delivery-bot bot mentioned this pull request Jul 26, 2023

[v10.0.x] Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error #72393

Merged

3 tasks

This was referenced Aug 3, 2023

Alerting: Refactor of state manager tests #72849

Merged

Alerting: Fix NoData and Error test cases for state manager #72964

Closed

yuri-tceretian force-pushed the yuri-tceretian/fix-exec-state-mapping branch from abc530a to 4fa1499 Compare August 9, 2023 16:25

yuri-tceretian added 5 commits August 9, 2023 12:26

update state manager to map all states in the case when Error\NoData …

5a27771

…executes as Ok\Alerting

add setting

016d6f3

copy tests

68b9ad2

update tests

0ec9a02

Feature flag

a1bb009

grafana-pr-automation bot added the area/frontend label Aug 9, 2023

yuri-tceretian force-pushed the yuri-tceretian/fix-exec-state-mapping branch from 4fa1499 to a1bb009 Compare August 9, 2023 16:26

yuri-tceretian added this to the 10.2.x milestone Aug 9, 2023

yuri-tceretian marked this pull request as ready for review August 9, 2023 16:32

yuri-tceretian requested review from grafanabot and a team as code owners August 9, 2023 16:32

yuri-tceretian requested review from PoorlyDefinedBehaviour and IbrahimCSAE and removed request for a team August 9, 2023 16:32

yuri-tceretian requested a review from rwwiv August 9, 2023 16:32

yuri-tceretian added add to changelog no-backport Skip backport of PR labels Aug 9, 2023

yuri-tceretian changed the title ~~Alerting: update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal~~ Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal Aug 9, 2023

yuri-tceretian added 2 commits August 9, 2023 15:33

amend tests instead of replace

dedbdb8

add [*] to test cases that have overridden assertions

e05f9dc

yuri-tceretian commented Aug 9, 2023

View reviewed changes

JacobsonMT approved these changes Aug 9, 2023

View reviewed changes

github-actions bot added the levitate breaking change A label indicating a breaking change and assigned by Levitate. label Aug 10, 2023

grobinson-grafana requested changes Aug 10, 2023

View reviewed changes

grobinson-grafana approved these changes Aug 14, 2023

View reviewed changes

Merge branch 'up/main' into yuri-tceretian/fix-exec-state-mapping

974eb28

github-actions bot removed the levitate breaking change A label indicating a breaking change and assigned by Levitate. label Aug 15, 2023

yuri-tceretian merged commit 0717ec1 into main Aug 15, 2023
18 checks passed

yuri-tceretian deleted the yuri-tceretian/fix-exec-state-mapping branch August 15, 2023 14:27

aishyandapalli pushed a commit to aishyandapalli/grafana that referenced this pull request Aug 16, 2023

Alerting: Update state manager to change all current states in the ca…

c833092

…se when Error\NoData is executed as Ok\Nomal (grafana#68142)

chauchausoup pushed a commit to chauchausoup/grafana that referenced this pull request Sep 15, 2023

Alerting: Update state manager to change all current states in the ca…

5617e87

…se when Error\NoData is executed as Ok\Nomal (grafana#68142)

zerok modified the milestones: 10.2.x, 10.2.0 Oct 23, 2023

dnhn mentioned this pull request Oct 24, 2023

grafana 10.2.0 Homebrew/homebrew-core#152264

Closed

BrewTestBot mentioned this pull request Oct 25, 2023

grafana 10.2.0 Homebrew/homebrew-core#152321

Closed

yuri-tceretian mentioned this pull request Oct 26, 2023

Alerting: Enable feature flag alertingNoDataErrorExecution by default #77242

Merged

3 tasks

This was referenced Oct 30, 2023

[v9.5.x] Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error #77391

Merged

[v9.4.x] Alerting: Fix state manager to not keep datasource_uid and ref_id labels in state after Error #77392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

yuri-tceretian commented May 9, 2023 •

edited

grafanabot commented Jun 10, 2023

JacobsonMT commented Aug 9, 2023

yuri-tceretian Aug 9, 2023

JacobsonMT left a comment

JacobsonMT Aug 9, 2023

grobinson-grafana left a comment •

edited

yuri-tceretian commented Aug 11, 2023

grobinson-grafana commented Aug 14, 2023

Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

Conversation

yuri-tceretian commented May 9, 2023 • edited

grafanabot commented Jun 10, 2023

JacobsonMT commented Aug 9, 2023

yuri-tceretian Aug 9, 2023

Choose a reason for hiding this comment

JacobsonMT left a comment

Choose a reason for hiding this comment

JacobsonMT Aug 9, 2023

Choose a reason for hiding this comment

grobinson-grafana left a comment • edited

Choose a reason for hiding this comment

yuri-tceretian commented Aug 11, 2023

grobinson-grafana commented Aug 14, 2023

yuri-tceretian commented May 9, 2023 •

edited

grobinson-grafana left a comment •

edited