Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142

Merged
merged 8 commits into from Aug 15, 2023

Conversation

yuri-tceretian
Copy link
Contributor

@yuri-tceretian yuri-tceretian commented May 9, 2023

What is this feature?
This PR changes how executions of NoData\Error as OK\Alerting are handled by alerting the state manager. In those cases, the state manager updates all current states according to the execution settings.

Old Behavior
firefox_ogGiPhA9F7.1.mp4
New Behavior
firefox_tcWWtRVV7z-output.mp4

The feature is behind the flag alertingNoDataErrorExecution

Why do we need this feature?
To fix execution of NoData\Error when it is set to Alerting\OK

Who is this feature for?
Alerting users who would like to treat exceptional states (NoData\Error) by maintaining the current state.

Which issue(s) does this PR fix?:

Fixes #66790

Special notes for your reviewer:

Please review PR by commit. I copied test cases introduced in #73019 and fixed to pass when the feature flag is enabled. See commit 0ec9a02 to understand the difference.
After Matt's suggestion, I decided to amend to the existing tests instead of copying the whole suite. So, I updated the tests to run each test case twice - when flag is enabled and disabled. This shows that normal transitions are not affected, as well as Execution of NoData\Error as NoData\Error. The test cases where the difference exists, are overridden by a new set of assertions.
Tests that have overridden assertions are marked by [*]

Example

image

Please check that:

  • It works as expected from a user's perspective.
  • If this is a pre-GA feature, it is behind a feature toggle.
  • The docs are updated, and if this is a notable improvement, it's added to our What's New doc.

@grafanabot
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@yuri-tceretian yuri-tceretian force-pushed the yuri-tceretian/fix-exec-state-mapping branch from 4fa1499 to a1bb009 Compare August 9, 2023 16:26
@yuri-tceretian yuri-tceretian added this to the 10.2.x milestone Aug 9, 2023
@yuri-tceretian yuri-tceretian marked this pull request as ready for review August 9, 2023 16:32
@yuri-tceretian yuri-tceretian requested review from grafanabot and a team as code owners August 9, 2023 16:32
@yuri-tceretian yuri-tceretian requested review from PoorlyDefinedBehaviour and IbrahimCSAE and removed request for a team August 9, 2023 16:32
@yuri-tceretian yuri-tceretian changed the title Alerting: update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal Aug 9, 2023
@JacobsonMT
Copy link
Member

Instead of copying the tests, what do you think about something like this:

func TestProcessEvalResults_StateTransitions(t *testing.T) {
	t.Run("Without applyNoDataAndErrorToAllStates", func(t *testing.T) {
		stateTransitions(t, false)
	})
	t.Run("With applyNoDataAndErrorToAllStates", func(t *testing.T) {
		stateTransitions(t, true)
	})
}

func stateTransitions(t *testing.T, applyNoDataAndErrorToAllStates bool) {
...
}

Then making the differences explicit based on the value of applyNoDataAndErrorToAllStates.

I think this will make the PR much simpler to read and the tests should be less at risk of drift between feature flag true and false.

@@ -288,7 +325,7 @@ func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRu
// Usually, it happens in the case of classic conditions when the evalResult does not have labels.
//
// This is temporary change to make sure that the labels are not persistent in the state after it was in Error state
// TODO yuri. Remove it in https://github.com/grafana/grafana/pull/68142
// TODO yuri. Remove it when correct Error result with labels is provided
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remove it in this PR because it breaks execution of Error as Error.

Copy link
Member

@JacobsonMT JacobsonMT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job 🚀

image

// Set the current state based on evaluation results
func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRule, result eval.Result, extraLabels data.Labels, logger log.Logger) StateTransition {
currentState := st.cache.getOrCreate(ctx, logger, alertRule, result, extraLabels, st.externalURL)
func (st *Manager) setNextStateForRule(ctx context.Context, alertRule *ngModels.AlertRule, results eval.Results, extraLabels data.Labels, logger log.Logger) []StateTransition {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Add a method doc for setNextStateForRule & setNextStateForAll

@github-actions github-actions bot added the levitate breaking change A label indicating a breaking change and assigned by Levitate. label Aug 10, 2023
Copy link
Contributor

@grobinson-grafana grobinson-grafana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this change is behind a feature flag, which is great! However, I'm requesting changes not because I want actual changes to be done to the code, but instead I want to ask a question before we add this to main.

tl;dr the question:

I understand the sentiment of this change is to make all states for a rule consistent when either No Data or an error occurs. What concerns me is that for some of our larger customers this can create massive alert storms. If their datasource is down, it could be 10,000s of alerts firing at once, perhaps 100,000s. If their grouping is generous, a number of contact points will also fail (such as email) as the email will be too large. I'm not sure about others such as Slack or Pagerduty.

I'm not sure this is the right feature to add moving forward, but I can support having it behind a feature flag for experimentation.

@yuri-tceretian and @JacobsonMT, could you share your opinions on what I've said above?

@yuri-tceretian
Copy link
Contributor Author

My thoughts about what George asked:
Large customers that have many rules with many dimensions and do not want to be hammered with all dimensions start firing in the case of Error\NoData, there is the default option to execute them as Error\NoData, which was introduced exactly for this use-case.

The bigger problem, in my opinion, is the side effect of the current behavior - after 2 evaluations resulted as Error\NoData it will cause all existing alert instances to be considered stale and get resolved, which also can cause an avalanche of notifications. Therefore, in my opinion, the execution of Error\NoData should maintain the current states (the same way the legacy KeepLastState did), or at least we should let the user pick this option. This is out of the scope of this PR, though.

Also, there is an alternative: use the pending state. If this PR is merged, the user can set the For interval and execution of NoData as Alerting, and the current dimensions will be armed in the case of NoData or Error. Also, it will help maintain the Pending state of an instance that was armed during a regular evaluation.

The purpose of this PR is two-fold:

  1. To solve inconsistencies in Error and NoData handling behavior between "classic condition" and "multi-dimensional" rules.
    In legacy alerting, and even in the unified alerting before the introduction of labels for NoData, which broke the way it worked in legacy (execution of Error for single dimension still works as expected, though), the behavior for the execution of exceptional results (NoData\Error) as Alerting\OK was that they affected the existing state instead of creating a new one. That was reflected in the documentation as well as the tests (see example). In the tests, however, the input parameters were done the wrong way but the assertion reflected the desired result. In reality, alerting just creates a new state because the exceptional results, which do not have labels returned during a normal evaluation, are handled as normal results.
  2. to diverge from the default executions. The execution of Error\NoData as they are creates a separate state, the same as the execution as Alerting or OK with a few differences: in the former case alerts have special names and special labels and annotations, whereas in the latter case, the alert name is the rule's name and only rule's labels (no templated annotations, no dimension's labels). Therefore, the execution of results as Alerting\OK does not make any sense because they are the worse, less informative alternative of the default execution.

I do not agree that this should be an experimental feature. I added the feature flag so we could evaluate that the fix works properly because the state management is the most unclear part of alerting, and test coverage was not great. However, after adding more tests in #73019 I am much more confident that it works as it should.

@grobinson-grafana
Copy link
Contributor

I tested a number of scenarios and seem to work as expected!

@github-actions github-actions bot removed the levitate breaking change A label indicating a breaking change and assigned by Levitate. label Aug 15, 2023
@yuri-tceretian yuri-tceretian merged commit 0717ec1 into main Aug 15, 2023
18 checks passed
@yuri-tceretian yuri-tceretian deleted the yuri-tceretian/fix-exec-state-mapping branch August 15, 2023 14:27
bergquist added a commit that referenced this pull request Aug 15, 2023
* main: (233 commits)
  PublicDashboards: Query order bug fixed (#73293)
  PostgreSQL: bump lib/pq to latest version (#72416)
  InfluxDB: Tests for #73247 (#73250)
  Docs: Add plugin dev documentation for logs to trace (#73225)
  Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142)
  Docs: correct SAML docs (#73281)
  CloudWatch: Add missing AppFlow metrics (#73149)
  docs: What’s New & Upgrade Guide 10.1 (#70636)
  Dashboard: Fix repeated row panel placement with larger number of rows (#72011)
  Geomap: Fix crosshair glitch (#72909)
  Logs: Fix scrolling with `exploreScrollableLogsContainer` feature (#73272)
  CodeEditor: Correctly fires onChange handler (#73030)
  InfluxDB: make influxql options the default if nothing defined (#73247)
  Cloudwatch: Upgrade aws-sdk and display external ids for temporary credentials (#72821)
  Cloudwatch: reorg files in components (#73176)
  Elasticsearch: Enable running of queries trough data source backend (#73222)
  Chore: fix some more types (#72726)
  Loki: Migrate HTTP settings to new components (#72831)
  Tracing: Split name column in search results (#72449)
  Plugins: Remove unnecessary error result from env vars interface (#73224)
  ...
aishyandapalli pushed a commit to aishyandapalli/grafana that referenced this pull request Aug 16, 2023
chauchausoup pushed a commit to chauchausoup/grafana that referenced this pull request Sep 15, 2023
@zerok zerok modified the milestones: 10.2.x, 10.2.0 Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Incorrect mapping of NoData\Error to multidimensional rules
5 participants