Alerting update eval engine to return errors and no data as separate models #59973

yuri-tceretian · 2022-12-07T16:44:26Z

Background
Alerting engine uses Grafana expression service to evaluate queries and transform the response. The engine transforms expression service's result to a list of results that contains the original value, labels, and the state (Alerting, Normal etc)

Execution of alert rule query can finish with generally 3 types of results:

Normal execution.
Error. There are two sources of the errors:
- expression service, which can be split to two more: query errors (network, data source etc) and transformation (when format of data produced by the previous expression node cannot be accepted by the following node)
- results transformation. When a normal result (not error or nodata) that is returned by expression service cannot be converted to evaluation results. For example, when the data frame format does not match the expected one (1 number field with 1 row).
NoData. Two ways of this happening:
- This can occur when one of the datasources queried by the expression service did not return any data. Generally, that will mean that all downstream transformations will result in no-data (one exception is Classic Condition with mapping of no-data).
- The expression service returned a heterogeneous result that contain results with numeric value as well as results with value null. The null value is treated as NoData but it is specific to only the dimension (alert instance) that is identified by the set of labels, or in other words, it is complete no data.

All those results are returned to the state manager as a list of eval.Result, where every element is treated as a separate state in the state manager and is identified by the set of labels. In the case of error the list will contain only one element with state Error. In the case of "global" no-data - a single element with state NoData. It is important to note that in both cases the set of labels is empty.

As mentioned above, the state manager processes every result from the list of results individually. In the case of Error or NoData states, it checks alert rule specification and determines what needs to be done with that "abnormal" result. The rule specification provides 3 mapping options:

map to OK.
map to Alerting.
create a special alert DatasourceNoData or DatasourceError.

Reference to the documentation

grafana/docs/sources/alerting/alerting-rules/create-grafana-managed-rule.md

Lines 72 to 86 in 0e4108f

    
           ### No data and error handling 
        
           Configure alerting behavior in the absence of data using information in the following tables. 
        
           | No Data Option | Description                                                                                                                               | 
        
           | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | 
        
           | No Data        | Create a new alert `DatasourceNoData` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels. | 
        
           | Alerting       | Set alert rule state to `Alerting`.                                                                                                       | 
        
           | Ok             | Set alert rule state to `Normal`.                                                                                                         | 
        
           | Error or timeout option | Description                                                                                                                                       | 
        
           | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | 
        
           | Alerting                | Set alert rule state to `Alerting`. From Grafana 8.5, the alert rule waits for the entire duration for which the condition is true before firing. | 
        
           | OK                      | Set alert rule state to `Normal`                                                                                                                  | 
        
           | Error                   | Create a new alert `DatasourceError` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels.          |

According to the documentation, if the abnormal state is mapped to either OK or Alerting it should switch the current state to OK or Alerting (or Pending depending on For setting). However, that is not true in the general case. The problem is that the abnormal result is still treated by the state manager as an individual dimension (aka state, aka instance). As I mentioned before, the abnormal result usually does not have any labels or due to its abnormality, the set of labels can be different than the current states. This causes the state manager to create a new state instead of switching existing instances to the desired state.

Therefore the outcome of mapping abnormal results to Normal and Alerting state is not what user expects and the documentation declares.

What is this feature?
This PR does two things:

Updates the alerting evaluator package to not return abnormal statuses as a normal result in the list of results. This is done to distinguish between abnormal and normal results so state manager does not need to figure that out by itself.
Fixes state manager to properly handle mapping of the abnormal states and pull all existing alert instances for the rule and switch to the desired state.

Why do we need this feature?
This fixes the bug of mapping Error|NoData results to OK|Alerting states.

grafanabot · 2023-01-08T02:02:24Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

grobinson-grafana · 2023-01-12T10:34:05Z

pkg/services/ngalert/eval/eval.go

+type EvaluationResult struct {
+	Error error
+	// NoData contains the DatasourceUID for RefIDs that returned no data.
+	NoData *NoDataResult


Should we instead have the following? I'm not sure we get much from having NoDataResult?

Suggested change

NoData *NoDataResult

NoData map[string][]string

grobinson-grafana · 2023-01-12T14:32:08Z

pkg/services/ngalert/eval/eval.go

-			})
-		}
-		return evalResults
+		result.NoData = &NoDataResult{DatasourceToRefID: datasourceUIDsToRefIDs(execResults.NoData)}


I think this has the same issue where a Classic Condition that checks for HasNoValue will be No Data instead of Firing?

grafanabot · 2023-02-03T17:55:23Z

This pull request was removed from the 9.4.0 milestone because 9.4.0 is currently being released.

grafanabot · 2023-03-06T02:06:18Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

grafanabot · 2023-04-04T07:43:57Z

This pull request was removed from the 9.5.0 milestone because 9.5.0 is currently being released.

grafanabot · 2023-06-02T02:08:21Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

grafanabot · 2023-07-03T02:10:31Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2023-08-03T01:51:12Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2023-09-04T01:47:37Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2023-09-18T01:47:54Z

This pull request has been automatically closed because it has not had activity in the last 2 weeks. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

yuri-tceretian added 3 commits December 7, 2022 10:52

update eval engine to return errors and no data as separate models

bfa5adf

update usages

3d62919

update manager to handle new result

cf0f2ff

grafanabot added the area/backend label Dec 7, 2022

yuri-tceretian added the area/alerting Grafana Alerting label Dec 7, 2022

yuri-tceretian self-assigned this Dec 7, 2022

yuri-tceretian added this to the 9.4.0 milestone Dec 7, 2022

yuri-tceretian added the type/bug label Dec 8, 2022

grafanabot added the stale Issue with no recent activity label Jan 8, 2023

grobinson-grafana reviewed Jan 12, 2023

View reviewed changes

grafanabot removed the stale Issue with no recent activity label Jan 13, 2023

yuri-tceretian mentioned this pull request Jan 25, 2023

Alerting: rule evaluation needs better retry semanatics #49621

Open

grafanabot removed this from the 9.4.0 milestone Feb 3, 2023

yuri-tceretian added this to the 9.5.0 milestone Feb 3, 2023

grafanabot added the stale Issue with no recent activity label Mar 6, 2023

yuri-tceretian removed the stale Issue with no recent activity label Mar 6, 2023

grafanabot removed this from the 9.5.0 milestone Apr 4, 2023

yuri-tceretian removed the type/bug label May 2, 2023

grafanabot added stale Issue with no recent activity and removed stale Issue with no recent activity labels Jun 2, 2023

grafanabot added the stale Issue with no recent activity label Jul 3, 2023

yuri-tceretian removed the stale Issue with no recent activity label Jul 3, 2023

github-actions bot added the stale Issue with no recent activity label Aug 3, 2023

yuri-tceretian removed the stale Issue with no recent activity label Aug 4, 2023

github-actions bot added the stale Issue with no recent activity label Sep 4, 2023

github-actions bot closed this Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting update eval engine to return errors and no data as separate models #59973

Alerting update eval engine to return errors and no data as separate models #59973

yuri-tceretian commented Dec 7, 2022 •

edited

grafanabot commented Jan 8, 2023

grobinson-grafana Jan 12, 2023

grobinson-grafana Jan 12, 2023

grafanabot commented Feb 3, 2023

grafanabot commented Mar 6, 2023

grafanabot commented Apr 4, 2023

grafanabot commented Jun 2, 2023

grafanabot commented Jul 3, 2023

github-actions bot commented Aug 3, 2023

github-actions bot commented Sep 4, 2023

github-actions bot commented Sep 18, 2023

	### No data and error handling

	Configure alerting behavior in the absence of data using information in the following tables.

	\| No Data Option \| Description \|
	\| -------------- \| ----------------------------------------------------------------------------------------------------------------------------------------- \|
	\| No Data \| Create a new alert `DatasourceNoData` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels. \|
	\| Alerting \| Set alert rule state to `Alerting`. \|
	\| Ok \| Set alert rule state to `Normal`. \|

	\| Error or timeout option \| Description \|
	\| ----------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| Alerting \| Set alert rule state to `Alerting`. From Grafana 8.5, the alert rule waits for the entire duration for which the condition is true before firing. \|
	\| OK \| Set alert rule state to `Normal` \|
	\| Error \| Create a new alert `DatasourceError` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels. \|

Alerting update eval engine to return errors and no data as separate models #59973

Alerting update eval engine to return errors and no data as separate models #59973

Conversation

yuri-tceretian commented Dec 7, 2022 • edited

grafanabot commented Jan 8, 2023

grobinson-grafana Jan 12, 2023

Choose a reason for hiding this comment

grobinson-grafana Jan 12, 2023

Choose a reason for hiding this comment

grafanabot commented Feb 3, 2023

grafanabot commented Mar 6, 2023

grafanabot commented Apr 4, 2023

grafanabot commented Jun 2, 2023

grafanabot commented Jul 3, 2023

github-actions bot commented Aug 3, 2023

github-actions bot commented Sep 4, 2023

github-actions bot commented Sep 18, 2023

yuri-tceretian commented Dec 7, 2022 •

edited