Alerting: rule evaluation needs better retry semanatics #49621

yuri-tceretian · 2022-05-25T14:38:21Z

What happened:
Sometimes there can be intermittent errors when a rule is evaluated (network, service-related). Although Grafana can notify users about errors via dedicated channels, it does not retry evaluations. It can be useful, especially for rules with a long evaluation interval.

Also, we need to make sure that retries do not affect rule evaluation. In other words, the ticks from the scheduler should be processed as soon as possible. As a dumb solution, we can retry for N times unless the total evaluation duration exceeds half of the interval.

Tasks

Give feedback

https://github.com/grafana/alerting-squad/issues/303
Options

Grafana: 10.2.3

yuri-tceretian · 2022-05-31T19:23:11Z

During the meeting, we decided that we need to distinguish between retriable errors and non-retriable ones.

eraac · 2023-01-24T13:28:58Z

We're looking for this kind of feature, there are any hints to help anyone who would starting implementing this? I'll like to check and try something

yuri-tceretian · 2023-01-25T21:22:26Z

@eraac

We need to distinguish between retriable and non-retriable errors. Errors that come from the data source may be retriable whereas other errors that come from the evaluator - are almost certainly not.

grafana/pkg/expr/nodes.go

Lines 240 to 243 in 5e8866e

    
           resp, err := s.dataService.QueryData(ctx, req) 
        
           if err != nil { 
        
           	return mathexp.Results{}, err 
        
           }

So, we basically need to distinguish errors produced in the code above. I think the code that determines whether an error is retriable or not is the most critical part of this feature. It does not have to be a comprehensive list of all possible cases and we can amend it later.

Then in
https://github.com/grafana/grafana/blob/ca6478b68ddf9baf1f306f29e34a66852efd9407/pkg/services/ngalert/schedule/schedule.go#L364-L367
we should analyze the error. The problem here is that we're put errors in results instead of err. It will change if PR #59973 is merged. But for now, we have to scan results and figure out that all errors are retriable. I think the logic that decides what is retriable and what is not, could live somewhere nearby, at least for now.

We need to figure out how to know when to stop reties. Rules are evaluated asynchronously to their scheduling: if the evaluation of tick X of the rule takes longer than its evaluation interval, the scheduler can decide to skip evaluations. Therefore, I think there should be some smart logic that exits the retry loop of evaluation of tick X when the next tick Y is about to be scheduled. Basically, we can measure the evaluation duration, and if it is getting closer to the rule's evaluation interval we exit the loop.

amurray2306 · 2023-08-29T12:16:26Z

Has there been any movement on this issue?

yuri-tceretian · 2023-08-29T13:56:01Z

This has not been prioritized so far but we are moving in this direction slowly: there have been some improvements in expr package recently, which may help alerting solve the main problem of distinguishing repeatable and non-repeatable errors.

amurray2306 · 2023-08-29T14:08:42Z

Awesome thanks for the update

…

On Tue, Aug 29, 2023 at 2:56 PM Yuri Tseretyan ***@***.***> wrote: This has not been prioritized so far but we are moving in this direction slowly: there have been some improvements in expr package recently, which may help alerting solve the main problem of distinguishing repeatable and non-repeatable errors. — Reply to this email directly, view it on GitHub <#49621 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUE5A2VILTJP2JPFFICHQQ3XXXYH3ANCNFSM5W5LXW7A> . You are receiving this because you commented.Message ID: ***@***.***>

-- [image: photo] *Andy Murray* Manager, Talent Acquisition R&D Cloud ***@***.*** ***@***.***> www.grafana.co <https://sales.grafana.com/api/mailings/click/PMRHK4TMEI5CE2DUORYDULZPO53XOLTHOJQWMYLOMEXGG33NF4RCYITJMQRDUMJQGUYTSMJMEJXXEZZCHIRGEMDFMZQTGOJTFU4TEMJXFU2GGNLCFU4TQM3GFUZWMOBUMVSDSNJQGUZDQIRMEJ3GK4TTNFXW4IR2EI2CELBCONUWOIR2EJIXUUL2PJBWK5LNPF4TM5KYJ4YEU4T2GQ4EW33SOB3XQ5ZQJB3VGT2OJBCFMLLQFVEDKVJ5EJ6Q====> m Series C funding announcement - $220M round with a $3B valuation <https://www.bloomberg.com/news/articles/2021-08-24/grafana-labs-raises-220-million-round-at-3-billion-valuation> The 7 Cultural Values that Drive Grafana Labs <https://grafana.com/blog/2020/12/09/the-7-cultural-values-that-drive-grafana-labs/>

yuri-tceretian added the area/alerting/unified label May 25, 2022

armandgrillet added the type/bug label May 31, 2022

yuri-tceretian assigned santihernandezc May 31, 2022

armandgrillet added area/alerting Grafana Alerting and removed area/alerting/unified labels Jun 22, 2022

alexweav added prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. effort/large labels Jul 29, 2022

alexweav unassigned santihernandezc Jul 29, 2022

amurray2306 added the postmortem label Aug 29, 2023

armandgrillet added the internal for issues made by grafanistas label Nov 6, 2023

gotjosh self-assigned this Dec 1, 2023

gotjosh mentioned this issue Dec 6, 2023

Alerting: Attempt to retry retryable errors #79037

Merged

3 tasks

grafana-delivery-bot bot mentioned this issue Dec 6, 2023

[v10.2.x] Alerting: Attempt to retry retryable errors #79152

Closed

3 tasks

gotjosh mentioned this issue Dec 6, 2023

Alerting: Attempt to retry retryable errors #79161

Merged

3 tasks

grafana-delivery-bot bot mentioned this issue Dec 6, 2023

[v10.2.x] Alerting: Attempt to retry retryable errors #79175

Merged

3 tasks

gotjosh changed the title ~~Alerting: rule evaluation should retry in the case of error~~ Alerting: rule evaluation needs better retry semanatics Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alerting: rule evaluation needs better retry semanatics #49621

Alerting: rule evaluation needs better retry semanatics #49621

yuri-tceretian commented May 25, 2022 •

edited by timlevett

Tasks

yuri-tceretian commented May 31, 2022

eraac commented Jan 24, 2023

yuri-tceretian commented Jan 25, 2023 •

edited

amurray2306 commented Aug 29, 2023

yuri-tceretian commented Aug 29, 2023

amurray2306 commented Aug 29, 2023 via email

Alerting: rule evaluation needs better retry semanatics #49621

Alerting: rule evaluation needs better retry semanatics #49621

Comments

yuri-tceretian commented May 25, 2022 • edited by timlevett

Tasks

yuri-tceretian commented May 31, 2022

eraac commented Jan 24, 2023

yuri-tceretian commented Jan 25, 2023 • edited

amurray2306 commented Aug 29, 2023

yuri-tceretian commented Aug 29, 2023

amurray2306 commented Aug 29, 2023 via email

yuri-tceretian commented May 25, 2022 •

edited by timlevett

yuri-tceretian commented Jan 25, 2023 •

edited