New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compactor: emit histogram of block max time delta to now #3240
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks correct to me. I just left a question to better understand it.
now := time.Now().Unix() | ||
for _, gr := range jobs { | ||
for _, meta := range gr.Metas() { | ||
c.metrics.blocksMaxTimeDelta.Observe(float64(now - meta.MaxTime)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you plan to use this new metric? Could you show a sample query to alert on?
I'm asking because the BucketCompactor.Compact()
is called for each tenant. If a tenant is lagging behind, but other tenants are good, you will track a timestamp for an old block every while and then, but in the alert query may be difficult to understand if that issue is resolved or not (it may not be resolved, but not even tracked until that tenant blocks are compacted again).
I'm not saying this approach is wrong, but I think reviewing this without having the context of how this metric will be queried is a bit difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query used would be pretty simple, making sure that the p75 of deltas were less than 48h (not sure what the actually percentile or delta value would be, I need to test this first). For the alert, maybe something like:
histogram_quantile(0.75, sum by (le) (rate(cortex_compactor_block_max_time_delta_seconds_bucket{namespace="$namespace"}[$__rate_interval]))) / 3600 > 48
You are correct, this wouldn't help identify when an individual tenant is lagging but most others are not. My thinking was that this would be useful for detecting when compactors are under-scaled for a cell. For example, I suspect the recent performance issues in prod-10 would have been caught by something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We've never been able to come up with a very good metric (yet). I think this new metric can be useful in some scenarios, maybe not much in others, but until we'll have a better way to actually measure how much the compactor is lagging behind for real, I'm 👍 to add it.
Emit a histogram of the difference between `now` and the max time of a block being compacted as a number of seconds. This can be used to alert if compactors are not able to keep up with the number of blocks being generated. In this case, the delta will begin to fall into higher and higher buckets (> 24h). Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
c986ccf
to
c8a2825
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please add a CHANGELOG entry.
Keep in mind that to alert on it you will have to look at the |
What this PR does
Emit a histogram of the difference between
now
and the max time of a block being compacted as a number of seconds. This can be used to alert if compactors are not able to keep up with the number of blocks being generated. In this case, the delta will begin to fall into higher and higher buckets (> 24h).Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]