Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting: Fix database unavailable removes rules from scheduler #49874

Merged
merged 10 commits into from
Jun 7, 2022

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented May 31, 2022

What this PR does / why we need it:

This pull request fixes a bug in Grafana where intermittent failure of database, network between Grafana and the database, or error in querying the database would cause all alert rules to be unscheduled in Grafana. This pull request fixes this so alert rules are not changed unless the query is successful.

Which issue(s) this PR fixes:

Fixes #49855

Special notes for your reviewer:

Release notice breaking change

This change fixes a bug in Grafana where intermittent failure of database, network between Grafana and the database, or error in querying the database would cause all alert rules to be unscheduled in Grafana. Following this change scheduled alert rules are not updated unless the query is successful.

The get_alert_rules_duration_seconds metric has been renamed to schedule_query_alert_rules_duration_seconds.

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please see my comments I'm happy to take another look.

pkg/services/ngalert/schedule/fetcher.go Show resolved Hide resolved
// If the database is unavailable or the query returns an error then return
// the alert rules from the most recent tick
if err := sch.ruleStore.GetAlertRulesForScheduling(ctx, &q); err != nil {
sch.log.Error("failed to get most recent alert rules", "error", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about including the number of alerts we're about to return from the cache?

pkg/services/ngalert/metrics/ngalert.go Outdated Show resolved Hide resolved
@gotjosh
Copy link
Contributor

gotjosh commented May 31, 2022

We also need a changelog entry.

@grobinson-grafana grobinson-grafana changed the title Fix database unavailable removes rules from scheduler Alerting: Fix database unavailable removes rules from scheduler May 31, 2022
Copy link
Contributor

@yuri-tceretian yuri-tceretian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduler has two methods that provide the ability to update\delete rule (methods DeleteAlertRule and UpdateAlertRule). I think we should update the cache when DeleteAlertRule is called

@grafanabot
Copy link
Contributor

@grafanabot
Copy link
Contributor

@grafanabot
Copy link
Contributor

@grafanabot
Copy link
Contributor

@daniellee daniellee modified the milestones: 9.0.0, 9.0.0-beta3 Jun 3, 2022
@grafanabot
Copy link
Contributor

@grafanabot
Copy link
Contributor

@grafanabot
Copy link
Contributor

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/services/ngalert/CHANGELOG.md Outdated Show resolved Hide resolved
@grafanabot
Copy link
Contributor

@grobinson-grafana grobinson-grafana merged commit c83f843 into main Jun 7, 2022
@grobinson-grafana grobinson-grafana deleted the grobinson/issue-49855 branch June 7, 2022 15:20
grafanabot pushed a commit that referenced this pull request Jun 7, 2022
grobinson-grafana added a commit that referenced this pull request Jun 7, 2022
…) (#50345)

(cherry picked from commit c83f843)

Co-authored-by: George Robinson <george.robinson@grafana.com>
@dsotirakis dsotirakis added the breaking change Relevant for changelog generation label Jun 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alerting: Database not available will cause rule evaluation to be stopped
6 participants