-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Allow configuration of a rule evaluation delay #14061
Changes from all commits
7e9a108
2151a68
64ac4ad
3781b1b
ef54d80
6402e02
c8233f8
59dbf93
6cf8094
22c0812
b747eeb
11887ac
d8eea8c
2982c1f
8fb75cc
17cce29
975e040
ec0cefb
c869b67
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -47,6 +47,7 @@ type Group struct { | |
name string | ||
file string | ||
interval time.Duration | ||
queryOffset *time.Duration | ||
limit int | ||
rules []Rule | ||
seriesInPreviousEval []map[string]labels.Labels // One per Rule. | ||
|
@@ -90,6 +91,7 @@ type GroupOptions struct { | |
Rules []Rule | ||
ShouldRestore bool | ||
Opts *ManagerOptions | ||
QueryOffset *time.Duration | ||
done chan struct{} | ||
EvalIterationFunc GroupEvalIterationFunc | ||
} | ||
|
@@ -126,6 +128,7 @@ func NewGroup(o GroupOptions) *Group { | |
name: o.Name, | ||
file: o.File, | ||
interval: o.Interval, | ||
queryOffset: o.QueryOffset, | ||
limit: o.Limit, | ||
rules: o.Rules, | ||
shouldRestore: o.ShouldRestore, | ||
|
@@ -443,6 +446,8 @@ func (g *Group) Eval(ctx context.Context, ts time.Time) { | |
wg sync.WaitGroup | ||
) | ||
|
||
ruleQueryOffset := g.QueryOffset() | ||
|
||
for i, rule := range g.rules { | ||
select { | ||
case <-g.done: | ||
|
@@ -473,7 +478,7 @@ func (g *Group) Eval(ctx context.Context, ts time.Time) { | |
|
||
g.metrics.EvalTotal.WithLabelValues(GroupKey(g.File(), g.Name())).Inc() | ||
|
||
vector, err := rule.Eval(ctx, ts, g.opts.QueryFunc, g.opts.ExternalURL, g.Limit()) | ||
vector, err := rule.Eval(ctx, ruleQueryOffset, ts, g.opts.QueryFunc, g.opts.ExternalURL, g.Limit()) | ||
if err != nil { | ||
rule.SetHealth(HealthBad) | ||
rule.SetLastError(err) | ||
|
@@ -562,7 +567,7 @@ func (g *Group) Eval(ctx context.Context, ts time.Time) { | |
for metric, lset := range g.seriesInPreviousEval[i] { | ||
if _, ok := seriesReturned[metric]; !ok { | ||
// Series no longer exposed, mark it stale. | ||
_, err = app.Append(0, lset, timestamp.FromTime(ts), math.Float64frombits(value.StaleNaN)) | ||
_, err = app.Append(0, lset, timestamp.FromTime(ts.Add(-ruleQueryOffset)), math.Float64frombits(value.StaleNaN)) | ||
unwrappedErr := errors.Unwrap(err) | ||
if unwrappedErr == nil { | ||
unwrappedErr = err | ||
|
@@ -601,14 +606,27 @@ func (g *Group) Eval(ctx context.Context, ts time.Time) { | |
g.cleanupStaleSeries(ctx, ts) | ||
} | ||
|
||
func (g *Group) QueryOffset() time.Duration { | ||
if g.queryOffset != nil { | ||
return *g.queryOffset | ||
} | ||
|
||
if g.opts.DefaultRuleQueryOffset != nil { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we add some pre checks to ensure that we'll be able to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure I follow, can you please elaborate? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Adding some validation to make sure that the "append in the past" will be possible (with regard to minValidTime and OOO config as well) otherwise they would be rejected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think I get this. Why does this matter? Assuming we have a
Your initial rule evaluation would be delayed by 5m, but afterwards, it'll continue to be evaluated every 1m, but for a timestamp 5m in the past.
I'm not entirely sure this is needed; you'd have 3 signals to indicate that something went wrong:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I'm aware that everything will be shifted. I was just considering that perhaps we might want to "guarantee" that when delay is enabled and Prometheus is shut down, the window of data that won't be evaluated is
Yes, I just thought that "failing early" would be better, but I think some of the parameters that control the "append in the past" may change dynamically, which could make spotting that more difficult. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Your feedback is very valid. Would you mind opening up an issue with your first point? If it becomes a wider problem (I can't think of a world where this makes sense) to use offsets so long that this matters. I think we can fix this by looking at the last evaluation as we restart, but we've never had this feedback before, which makes me think that this is not as relevant as we think it is and is not a straightforward piece of work. |
||
return g.opts.DefaultRuleQueryOffset() | ||
} | ||
|
||
return time.Duration(0) | ||
} | ||
|
||
func (g *Group) cleanupStaleSeries(ctx context.Context, ts time.Time) { | ||
if len(g.staleSeries) == 0 { | ||
return | ||
} | ||
app := g.opts.Appendable.Appender(ctx) | ||
queryOffset := g.QueryOffset() | ||
for _, s := range g.staleSeries { | ||
// Rule that produced series no longer configured, mark it stale. | ||
_, err := app.Append(0, s, timestamp.FromTime(ts), math.Float64frombits(value.StaleNaN)) | ||
_, err := app.Append(0, s, timestamp.FromTime(ts.Add(-queryOffset)), math.Float64frombits(value.StaleNaN)) | ||
unwrappedErr := errors.Unwrap(err) | ||
if unwrappedErr == nil { | ||
unwrappedErr = err | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a pointer? For the
Interval
above, we just use the zero value to indicate that it's unset (and theomitempty
ensures that it's not serialized then).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this can potentially override the configuration value, we need to differentiate between a zero value (which means override whatever yo have in the global config) and no value (which means use the global configuration option). What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, right. So for example if you have a 5m offset in the global config, but you want a 0m offset in a specific rule group. Yeah, that makes sense, thanks for the explanation :) 👍