Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: rule evaluation needs better retry semanatics #49621

Open
1 task done
yuri-tceretian opened this issue May 25, 2022 · 6 comments
Open
1 task done

Alerting: rule evaluation needs better retry semanatics #49621

yuri-tceretian opened this issue May 25, 2022 · 6 comments
Assignees
Labels
area/alerting Grafana Alerting effort/large internal for issues made by grafanistas postmortem prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. type/bug

Comments

@yuri-tceretian
Copy link
Contributor

yuri-tceretian commented May 25, 2022

What happened:
Sometimes there can be intermittent errors when a rule is evaluated (network, service-related). Although Grafana can notify users about errors via dedicated channels, it does not retry evaluations. It can be useful, especially for rules with a long evaluation interval.

Also, we need to make sure that retries do not affect rule evaluation. In other words, the ticks from the scheduler should be processed as soon as possible. As a dumb solution, we can retry for N times unless the total evaluation duration exceeds half of the interval.

Tasks

Grafana: 10.2.3

@yuri-tceretian
Copy link
Contributor Author

During the meeting, we decided that we need to distinguish between retriable errors and non-retriable ones.

@armandgrillet armandgrillet added area/alerting Grafana Alerting and removed area/alerting/unified labels Jun 22, 2022
@alexweav alexweav added prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. effort/large labels Jul 29, 2022
@eraac
Copy link
Contributor

eraac commented Jan 24, 2023

We're looking for this kind of feature, there are any hints to help anyone who would starting implementing this? I'll like to check and try something

@yuri-tceretian
Copy link
Contributor Author

yuri-tceretian commented Jan 25, 2023

@eraac

  1. We need to distinguish between retriable and non-retriable errors. Errors that come from the data source may be retriable whereas other errors that come from the evaluator - are almost certainly not.

    grafana/pkg/expr/nodes.go

    Lines 240 to 243 in 5e8866e

    resp, err := s.dataService.QueryData(ctx, req)
    if err != nil {
    return mathexp.Results{}, err
    }

    So, we basically need to distinguish errors produced in the code above. I think the code that determines whether an error is retriable or not is the most critical part of this feature. It does not have to be a comprehensive list of all possible cases and we can amend it later.

Then in
https://github.com/grafana/grafana/blob/ca6478b68ddf9baf1f306f29e34a66852efd9407/pkg/services/ngalert/schedule/schedule.go#L364-L367
we should analyze the error. The problem here is that we're put errors in results instead of err. It will change if PR #59973 is merged. But for now, we have to scan results and figure out that all errors are retriable. I think the logic that decides what is retriable and what is not, could live somewhere nearby, at least for now.

  1. We need to figure out how to know when to stop reties. Rules are evaluated asynchronously to their scheduling: if the evaluation of tick X of the rule takes longer than its evaluation interval, the scheduler can decide to skip evaluations. Therefore, I think there should be some smart logic that exits the retry loop of evaluation of tick X when the next tick Y is about to be scheduled. Basically, we can measure the evaluation duration, and if it is getting closer to the rule's evaluation interval we exit the loop.

@amurray2306
Copy link

Has there been any movement on this issue?

@yuri-tceretian
Copy link
Contributor Author

This has not been prioritized so far but we are moving in this direction slowly: there have been some improvements in expr package recently, which may help alerting solve the main problem of distinguishing repeatable and non-repeatable errors.

@amurray2306
Copy link

amurray2306 commented Aug 29, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/alerting Grafana Alerting effort/large internal for issues made by grafanistas postmortem prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. type/bug
Projects
Status: Backlog
Development

No branches or pull requests

7 participants