Compactor: emit histogram of block max time delta to now #3240

56quarters · 2022-10-17T15:39:24Z

What this PR does

Emit a histogram of the difference between now and the max time of a block being compacted as a number of seconds. This can be used to alert if compactors are not able to keep up with the number of blocks being generated. In this case, the delta will begin to fall into higher and higher buckets (> 24h).

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Which issue(s) this PR fixes or relates to

N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci

The code change looks correct to me. I just left a question to better understand it.

pracucci · 2022-10-17T16:19:18Z

pkg/compactor/bucket_compactor.go

+		now := time.Now().Unix()
+		for _, gr := range jobs {
+			for _, meta := range gr.Metas() {
+				c.metrics.blocksMaxTimeDelta.Observe(float64(now - meta.MaxTime))


How do you plan to use this new metric? Could you show a sample query to alert on?

I'm asking because the BucketCompactor.Compact() is called for each tenant. If a tenant is lagging behind, but other tenants are good, you will track a timestamp for an old block every while and then, but in the alert query may be difficult to understand if that issue is resolved or not (it may not be resolved, but not even tracked until that tenant blocks are compacted again).

I'm not saying this approach is wrong, but I think reviewing this without having the context of how this metric will be queried is a bit difficult.

The query used would be pretty simple, making sure that the p75 of deltas were less than 48h (not sure what the actually percentile or delta value would be, I need to test this first). For the alert, maybe something like:

histogram_quantile(0.75, sum by (le) (rate(cortex_compactor_block_max_time_delta_seconds_bucket{namespace="$namespace"}[$__rate_interval]))) / 3600 > 48

You are correct, this wouldn't help identify when an individual tenant is lagging but most others are not. My thinking was that this would be useful for detecting when compactors are under-scaled for a cell. For example, I suspect the recent performance issues in prod-10 would have been caught by something like this.

Right. We've never been able to come up with a very good metric (yet). I think this new metric can be useful in some scenarios, maybe not much in others, but until we'll have a better way to actually measure how much the compactor is lagging behind for real, I'm 👍 to add it.

Emit a histogram of the difference between `now` and the max time of a block being compacted as a number of seconds. This can be used to alert if compactors are not able to keep up with the number of blocks being generated. In this case, the delta will begin to fall into higher and higher buckets (> 24h). Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci

LGTM, but please add a CHANGELOG entry.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters · 2022-10-18T23:21:35Z

After running this in dev for a while, the results are about what you'd expect: everything compacted is < 24h delta so the p75 of this metric is 0.75 * 24h = 18h.

pracucci · 2022-10-19T06:39:20Z

After running this in dev for a while, the results are about what you'd expect: everything compacted is < 24h delta so the p75 of this metric is 0.75 * 24h = 18h.

Keep in mind that to alert on it you will have to look at the max_over_time() of the quantile for a reasonable large time window.

pracucci reviewed Oct 17, 2022

View reviewed changes

56quarters force-pushed the 56quarters/compactor-lag branch from c986ccf to c8a2825 Compare October 17, 2022 19:21

pracucci approved these changes Oct 18, 2022

View reviewed changes

Changelog

8829558

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters marked this pull request as ready for review October 18, 2022 23:19

56quarters requested a review from a team as a code owner October 18, 2022 23:19

pracucci merged commit 61ca998 into main Oct 19, 2022

pracucci deleted the 56quarters/compactor-lag branch October 19, 2022 06:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: emit histogram of block max time delta to now #3240

Compactor: emit histogram of block max time delta to now #3240

56quarters commented Oct 17, 2022 •

edited

pracucci left a comment

pracucci Oct 17, 2022

56quarters Oct 17, 2022

pracucci Oct 18, 2022

pracucci left a comment

56quarters commented Oct 18, 2022 •

edited

pracucci commented Oct 19, 2022

Compactor: emit histogram of block max time delta to now #3240

Compactor: emit histogram of block max time delta to now #3240

Conversation

56quarters commented Oct 17, 2022 • edited

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

pracucci left a comment

Choose a reason for hiding this comment

pracucci Oct 17, 2022

Choose a reason for hiding this comment

56quarters Oct 17, 2022

Choose a reason for hiding this comment

pracucci Oct 18, 2022

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment

56quarters commented Oct 18, 2022 • edited

pracucci commented Oct 19, 2022

56quarters commented Oct 17, 2022 •

edited

56quarters commented Oct 18, 2022 •

edited