Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ruler alert state restoration #2648

Merged
merged 4 commits into from
Aug 19, 2022
Merged

Conversation

dimitarvdimitrov
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov commented Aug 4, 2022

What this PR does

Addresses a bug in which alert states are not restored.

After starting the prometheus ruler tries to restore the "firing" state for all alerts that are active. The way it does this is by:

  1. evaluating all groups twice (code):
  • once so most evaluate to active or non-active
  • once more to make sure that it has scraped enough metrics for alerts to be able to evaluate to active/non-active (in case of Mimir "to make sure that Mimir has ingested enough metrics for alerts to be able to ...")
  1. compares the rule groups' "for" period to its for grace period (code)
  • if the "for" duration is less than the grace period, the ruler just lets the alert evaluate as many times as needed by the "for" period
  • if the "for" duration is more than the grace period, the ruler queries the ALERTS_FOR_STATE metric to calculate if the alert should already be firing or how much longer it should be pending

When the ruler queries the ALERTS_FOR_STATE metric it does not pass any hints to the querier implementation (code). The bug is introduced as Mimir's querier implementation assumes that no hints == series lookup. And a series lookup returns no samples. This assumption was introduced in a cortex PR from 2018 as an optimization for versions of prometheus that use the remote-read API and pass no hints when they want just the series (PR).

Because the querier code returns no samples for the series, the ruler cannot correctly restore the state.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -86,7 +86,10 @@ func (q *distributorQuerier) Select(_ bool, sp *storage.SelectHints, matchers ..
spanlog, ctx := spanlogger.NewWithLogger(q.ctx, q.logger, "distributorQuerier.Select")
defer spanlog.Finish()

minT, maxT := sp.Start, sp.End
minT, maxT := q.mint, q.maxt
if sp != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If querier.Select replaces nil sp with non-nil one, can sp be still nil here? (Just curious why is this change required)

Copy link
Contributor Author

@dimitarvdimitrov dimitarvdimitrov Aug 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needed purely for consistency. So that distributorQuerier is an independent implementation of Querier that works on its own. I may be talking nonsense in case the two are super coupled. I have not dived into the code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They look independant, but they are always used together right now. I don't mind the change, I was just curious if it is needed. Feel free to keep it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer to keep it, as Dimitar suggests.

CHANGELOG.md Outdated Show resolved Hide resolved
@pstibrany
Copy link
Member

Great find!

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@pracucci pracucci self-requested a review August 19, 2022 15:07
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked all calls to Select() both in Mimir and Prometheus, and LGTM. I just left a nit about a comment, thanks!

@@ -86,7 +86,10 @@ func (q *distributorQuerier) Select(_ bool, sp *storage.SelectHints, matchers ..
spanlog, ctx := spanlogger.NewWithLogger(q.ctx, q.logger, "distributorQuerier.Select")
defer spanlog.Finish()

minT, maxT := sp.Start, sp.End
minT, maxT := q.mint, q.maxt
if sp != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer to keep it, as Dimitar suggests.

pkg/querier/querier.go Outdated Show resolved Hide resolved
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@pracucci pracucci enabled auto-merge (squash) August 19, 2022 15:28
@pracucci pracucci merged commit 778bfe2 into main Aug 19, 2022
@pracucci pracucci deleted the dimitar/ruler-restore-state-fix branch August 19, 2022 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants