-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ruler alert state restoration #2648
Conversation
@@ -86,7 +86,10 @@ func (q *distributorQuerier) Select(_ bool, sp *storage.SelectHints, matchers .. | |||
spanlog, ctx := spanlogger.NewWithLogger(q.ctx, q.logger, "distributorQuerier.Select") | |||
defer spanlog.Finish() | |||
|
|||
minT, maxT := sp.Start, sp.End | |||
minT, maxT := q.mint, q.maxt | |||
if sp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If querier.Select
replaces nil
sp with non-nil one, can sp
be still nil
here? (Just curious why is this change required)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed purely for consistency. So that distributorQuerier
is an independent implementation of Querier
that works on its own. I may be talking nonsense in case the two are super coupled. I have not dived into the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They look independant, but they are always used together right now. I don't mind the change, I was just curious if it is needed. Feel free to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer to keep it, as Dimitar suggests.
Great find! |
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
f783eee
to
7076bb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked all calls to Select()
both in Mimir and Prometheus, and LGTM. I just left a nit about a comment, thanks!
@@ -86,7 +86,10 @@ func (q *distributorQuerier) Select(_ bool, sp *storage.SelectHints, matchers .. | |||
spanlog, ctx := spanlogger.NewWithLogger(q.ctx, q.logger, "distributorQuerier.Select") | |||
defer spanlog.Finish() | |||
|
|||
minT, maxT := sp.Start, sp.End | |||
minT, maxT := q.mint, q.maxt | |||
if sp != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also prefer to keep it, as Dimitar suggests.
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
What this PR does
Addresses a bug in which alert states are not restored.
After starting the prometheus ruler tries to restore the "firing" state for all alerts that are active. The way it does this is by:
ALERTS_FOR_STATE
metric to calculate if the alert should already be firing or how much longer it should be pendingWhen the ruler queries the
ALERTS_FOR_STATE
metric it does not pass any hints to the querier implementation (code). The bug is introduced as Mimir's querier implementation assumes that no hints == series lookup. And a series lookup returns no samples. This assumption was introduced in a cortex PR from 2018 as an optimization for versions of prometheus that use the remote-read API and pass no hints when they want just the series (PR).Because the querier code returns no samples for the series, the ruler cannot correctly restore the state.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]