Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting update eval engine to return errors and no data as separate models #59973

Closed

Conversation

yuri-tceretian
Copy link
Contributor

@yuri-tceretian yuri-tceretian commented Dec 7, 2022

Background
Alerting engine uses Grafana expression service to evaluate queries and transform the response. The engine transforms expression service's result to a list of results that contains the original value, labels, and the state (Alerting, Normal etc)

Execution of alert rule query can finish with generally 3 types of results:

  • Normal execution.
  • Error. There are two sources of the errors:
    • expression service, which can be split to two more: query errors (network, data source etc) and transformation (when format of data produced by the previous expression node cannot be accepted by the following node)
    • results transformation. When a normal result (not error or nodata) that is returned by expression service cannot be converted to evaluation results. For example, when the data frame format does not match the expected one (1 number field with 1 row).
  • NoData. Two ways of this happening:
    • This can occur when one of the datasources queried by the expression service did not return any data. Generally, that will mean that all downstream transformations will result in no-data (one exception is Classic Condition with mapping of no-data).
    • The expression service returned a heterogeneous result that contain results with numeric value as well as results with value null. The null value is treated as NoData but it is specific to only the dimension (alert instance) that is identified by the set of labels, or in other words, it is complete no data.

All those results are returned to the state manager as a list of eval.Result, where every element is treated as a separate state in the state manager and is identified by the set of labels. In the case of error the list will contain only one element with state Error. In the case of "global" no-data - a single element with state NoData. It is important to note that in both cases the set of labels is empty.

As mentioned above, the state manager processes every result from the list of results individually. In the case of Error or NoData states, it checks alert rule specification and determines what needs to be done with that "abnormal" result. The rule specification provides 3 mapping options:

  • map to OK.
  • map to Alerting.
  • create a special alert DatasourceNoData or DatasourceError.

Reference to the documentation

### No data and error handling
Configure alerting behavior in the absence of data using information in the following tables.
| No Data Option | Description |
| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| No Data | Create a new alert `DatasourceNoData` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels. |
| Alerting | Set alert rule state to `Alerting`. |
| Ok | Set alert rule state to `Normal`. |
| Error or timeout option | Description |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| Alerting | Set alert rule state to `Alerting`. From Grafana 8.5, the alert rule waits for the entire duration for which the condition is true before firing. |
| OK | Set alert rule state to `Normal` |
| Error | Create a new alert `DatasourceError` with the name and UID of the alert rule, and UID of the datasource that returned no data as labels. |

According to the documentation, if the abnormal state is mapped to either OK or Alerting it should switch the current state to OK or Alerting (or Pending depending on For setting). However, that is not true in the general case. The problem is that the abnormal result is still treated by the state manager as an individual dimension (aka state, aka instance). As I mentioned before, the abnormal result usually does not have any labels or due to its abnormality, the set of labels can be different than the current states. This causes the state manager to create a new state instead of switching existing instances to the desired state.

Therefore the outcome of mapping abnormal results to Normal and Alerting state is not what user expects and the documentation declares.

What is this feature?
This PR does two things:

  1. Updates the alerting evaluator package to not return abnormal statuses as a normal result in the list of results. This is done to distinguish between abnormal and normal results so state manager does not need to figure that out by itself.
  2. Fixes state manager to properly handle mapping of the abnormal states and pull all existing alert instances for the rule and switch to the desired state.

Why do we need this feature?
This fixes the bug of mapping Error|NoData results to OK|Alerting states.

@yuri-tceretian yuri-tceretian added the area/alerting Grafana Alerting label Dec 7, 2022
@yuri-tceretian yuri-tceretian self-assigned this Dec 7, 2022
@yuri-tceretian yuri-tceretian added this to the 9.4.0 milestone Dec 7, 2022
@grafanabot
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@grafanabot grafanabot added the stale Issue with no recent activity label Jan 8, 2023
type EvaluationResult struct {
Error error
// NoData contains the DatasourceUID for RefIDs that returned no data.
NoData *NoDataResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead have the following? I'm not sure we get much from having NoDataResult?

Suggested change
NoData *NoDataResult
NoData map[string][]string

})
}
return evalResults
result.NoData = &NoDataResult{DatasourceToRefID: datasourceUIDsToRefIDs(execResults.NoData)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has the same issue where a Classic Condition that checks for HasNoValue will be No Data instead of Firing?

Screenshot 2023-01-12 at 14 31 50
Screenshot 2023-01-12 at 14 32 00

@grafanabot grafanabot removed the stale Issue with no recent activity label Jan 13, 2023
@grafanabot grafanabot removed this from the 9.4.0 milestone Feb 3, 2023
@grafanabot
Copy link
Contributor

This pull request was removed from the 9.4.0 milestone because 9.4.0 is currently being released.

@yuri-tceretian yuri-tceretian added this to the 9.5.0 milestone Feb 3, 2023
@grafanabot
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@grafanabot grafanabot added the stale Issue with no recent activity label Mar 6, 2023
@yuri-tceretian yuri-tceretian removed the stale Issue with no recent activity label Mar 6, 2023
@grafanabot grafanabot removed this from the 9.5.0 milestone Apr 4, 2023
@grafanabot
Copy link
Contributor

This pull request was removed from the 9.5.0 milestone because 9.5.0 is currently being released.

@grafanabot
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@grafanabot grafanabot added stale Issue with no recent activity and removed stale Issue with no recent activity labels Jun 2, 2023
@grafanabot
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@grafanabot grafanabot added the stale Issue with no recent activity label Jul 3, 2023
@yuri-tceretian yuri-tceretian removed the stale Issue with no recent activity label Jul 3, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2023

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale Issue with no recent activity label Aug 3, 2023
@yuri-tceretian yuri-tceretian removed the stale Issue with no recent activity label Aug 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 2 weeks if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale Issue with no recent activity label Sep 4, 2023
@github-actions
Copy link
Contributor

This pull request has been automatically closed because it has not had activity in the last 2 weeks. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting Grafana Alerting area/backend stale Issue with no recent activity
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants