New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal #68142
Conversation
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
f42e28a
to
abc530a
Compare
abc530a
to
4fa1499
Compare
4fa1499
to
a1bb009
Compare
Instead of copying the tests, what do you think about something like this: func TestProcessEvalResults_StateTransitions(t *testing.T) {
t.Run("Without applyNoDataAndErrorToAllStates", func(t *testing.T) {
stateTransitions(t, false)
})
t.Run("With applyNoDataAndErrorToAllStates", func(t *testing.T) {
stateTransitions(t, true)
})
}
func stateTransitions(t *testing.T, applyNoDataAndErrorToAllStates bool) {
...
} Then making the differences explicit based on the value of I think this will make the PR much simpler to read and the tests should be less at risk of drift between feature flag true and false. |
@@ -288,7 +325,7 @@ func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRu | |||
// Usually, it happens in the case of classic conditions when the evalResult does not have labels. | |||
// | |||
// This is temporary change to make sure that the labels are not persistent in the state after it was in Error state | |||
// TODO yuri. Remove it in https://github.com/grafana/grafana/pull/68142 | |||
// TODO yuri. Remove it when correct Error result with labels is provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't remove it in this PR because it breaks execution of Error as Error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Set the current state based on evaluation results | ||
func (st *Manager) setNextState(ctx context.Context, alertRule *ngModels.AlertRule, result eval.Result, extraLabels data.Labels, logger log.Logger) StateTransition { | ||
currentState := st.cache.getOrCreate(ctx, logger, alertRule, result, extraLabels, st.externalURL) | ||
func (st *Manager) setNextStateForRule(ctx context.Context, alertRule *ngModels.AlertRule, results eval.Results, extraLabels data.Labels, logger log.Logger) []StateTransition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Add a method doc for setNextStateForRule
& setNextStateForAll
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this change is behind a feature flag, which is great! However, I'm requesting changes not because I want actual changes to be done to the code, but instead I want to ask a question before we add this to main
.
tl;dr the question:
I understand the sentiment of this change is to make all states for a rule consistent when either No Data or an error occurs. What concerns me is that for some of our larger customers this can create massive alert storms. If their datasource is down, it could be 10,000s of alerts firing at once, perhaps 100,000s. If their grouping is generous, a number of contact points will also fail (such as email) as the email will be too large. I'm not sure about others such as Slack or Pagerduty.
I'm not sure this is the right feature to add moving forward, but I can support having it behind a feature flag for experimentation.
@yuri-tceretian and @JacobsonMT, could you share your opinions on what I've said above?
My thoughts about what George asked: The bigger problem, in my opinion, is the side effect of the current behavior - after 2 evaluations resulted as Error\NoData it will cause all existing alert instances to be considered stale and get resolved, which also can cause an avalanche of notifications. Therefore, in my opinion, the execution of Error\NoData should maintain the current states (the same way the legacy KeepLastState did), or at least we should let the user pick this option. This is out of the scope of this PR, though. Also, there is an alternative: use the The purpose of this PR is two-fold:
I do not agree that this should be an experimental feature. I added the feature flag so we could evaluate that the fix works properly because the state management is the most unclear part of alerting, and test coverage was not great. However, after adding more tests in #73019 I am much more confident that it works as it should. |
I tested a number of scenarios and seem to work as expected! |
* main: (233 commits) PublicDashboards: Query order bug fixed (#73293) PostgreSQL: bump lib/pq to latest version (#72416) InfluxDB: Tests for #73247 (#73250) Docs: Add plugin dev documentation for logs to trace (#73225) Alerting: Update state manager to change all current states in the case when Error\NoData is executed as Ok\Nomal (#68142) Docs: correct SAML docs (#73281) CloudWatch: Add missing AppFlow metrics (#73149) docs: What’s New & Upgrade Guide 10.1 (#70636) Dashboard: Fix repeated row panel placement with larger number of rows (#72011) Geomap: Fix crosshair glitch (#72909) Logs: Fix scrolling with `exploreScrollableLogsContainer` feature (#73272) CodeEditor: Correctly fires onChange handler (#73030) InfluxDB: make influxql options the default if nothing defined (#73247) Cloudwatch: Upgrade aws-sdk and display external ids for temporary credentials (#72821) Cloudwatch: reorg files in components (#73176) Elasticsearch: Enable running of queries trough data source backend (#73222) Chore: fix some more types (#72726) Loki: Migrate HTTP settings to new components (#72831) Tracing: Split name column in search results (#72449) Plugins: Remove unnecessary error result from env vars interface (#73224) ...
…se when Error\NoData is executed as Ok\Nomal (grafana#68142)
…se when Error\NoData is executed as Ok\Nomal (grafana#68142)
What is this feature?
This PR changes how executions of NoData\Error as OK\Alerting are handled by alerting the state manager. In those cases, the state manager updates all current states according to the execution settings.
Old Behavior
firefox_ogGiPhA9F7.1.mp4
New Behavior
firefox_tcWWtRVV7z-output.mp4
The feature is behind the flag
alertingNoDataErrorExecution
Why do we need this feature?
To fix execution of NoData\Error when it is set to Alerting\OK
Who is this feature for?
Alerting users who would like to treat exceptional states (NoData\Error) by maintaining the current state.
Which issue(s) does this PR fix?:
Fixes #66790
Special notes for your reviewer:
Please review PR by commit. I copied test cases introduced in #73019 and fixed to pass when the feature flag is enabled. See commit 0ec9a02 to understand the difference.After Matt's suggestion, I decided to amend to the existing tests instead of copying the whole suite. So, I updated the tests to run each test case twice - when flag is enabled and disabled. This shows that normal transitions are not affected, as well as Execution of NoData\Error as NoData\Error. The test cases where the difference exists, are overridden by a new set of assertions.
Tests that have overridden assertions are marked by
[*]
Example
Please check that: