-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ruler: Add timeout for remote rule evaluation queries. #2090
Conversation
During a rule group evaluation, if the remote query never completes for any reason, then the rule group will become stuck and never evaluate again.
e83ed94
to
d4e5a3e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I was wondering if we should set timeout
parameter when running the query, but setting context deadline works too thanks to using httpgrpc
. Furthermore query-frontend doesn't even seem to be using timeout
parameter (which looks like bug, but for separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -211,6 +222,9 @@ func (q *RemoteQuerier) query(ctx context.Context, query string, ts time.Time, l | |||
} | |||
} | |||
|
|||
ctx, cancel := context.WithTimeout(ctx, q.timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this whole block :225-:236
into a new function and then call it from query()
and Read()
, otherwise this portion is duplicate and this would also make it easier to reason about where the timeout starts to apply.
Potentially the initialization of the middlewares could also be in that new function, but I'm not sure if this wouldn't dilute it's scope too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually how I had it to start with, but it felt over complicated. I'll discuss with @ortuman if we want to do any refactoring here in a follow up PR.
@@ -1400,6 +1400,10 @@ query_frontend: | |||
# CLI flag: -ruler.query-frontend.address | |||
[address: <string> | default = ""] | |||
|
|||
# The timeout for a rule query being evaluated by the query-frontend. | |||
# CLI flag: -ruler.query-frontend.timeout | |||
[timeout: <duration> | default = 2m] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced about having yet another timeout config option. Why not reusing -querier.timeout
, documenting that needs to be set for ruler too when remote rule evaluation is used? We already use it in query-frontend too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue: #2129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-querier.timeout
is poorly named (perhaps "-query.timeout" would be better) but it would make sense to reuse it, since ruler already uses it when it's not configured to use query-frontend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's poorly named, but unfortunately changing it would be a breaking change (so to rename we would have to go through a deprecation process keeping backward compatibility for 2 releases). For the sake of config reduction, I think just reusing it is enough.
During a rule group evaluation, if the remote query never completes for any
reason, then the rule group will become stuck and never evaluate again.