Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor: emit histogram of block max time delta to now #3240

Merged
merged 2 commits into from Oct 19, 2022

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Oct 17, 2022

What this PR does

Emit a histogram of the difference between now and the max time of a block being compacted as a number of seconds. This can be used to alert if compactors are not able to keep up with the number of blocks being generated. In this case, the delta will begin to fall into higher and higher buckets (> 24h).

Signed-off-by: Nick Pillitteri nick.pillitteri@grafana.com

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks correct to me. I just left a question to better understand it.

now := time.Now().Unix()
for _, gr := range jobs {
for _, meta := range gr.Metas() {
c.metrics.blocksMaxTimeDelta.Observe(float64(now - meta.MaxTime))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to use this new metric? Could you show a sample query to alert on?

I'm asking because the BucketCompactor.Compact() is called for each tenant. If a tenant is lagging behind, but other tenants are good, you will track a timestamp for an old block every while and then, but in the alert query may be difficult to understand if that issue is resolved or not (it may not be resolved, but not even tracked until that tenant blocks are compacted again).

I'm not saying this approach is wrong, but I think reviewing this without having the context of how this metric will be queried is a bit difficult.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query used would be pretty simple, making sure that the p75 of deltas were less than 48h (not sure what the actually percentile or delta value would be, I need to test this first). For the alert, maybe something like:

histogram_quantile(0.75, sum by (le) (rate(cortex_compactor_block_max_time_delta_seconds_bucket{namespace="$namespace"}[$__rate_interval]))) / 3600 > 48

You are correct, this wouldn't help identify when an individual tenant is lagging but most others are not. My thinking was that this would be useful for detecting when compactors are under-scaled for a cell. For example, I suspect the recent performance issues in prod-10 would have been caught by something like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We've never been able to come up with a very good metric (yet). I think this new metric can be useful in some scenarios, maybe not much in others, but until we'll have a better way to actually measure how much the compactor is lagging behind for real, I'm 👍 to add it.

Emit a histogram of the difference between `now` and the max time of
a block being compacted as a number of seconds. This can be used to
alert if compactors are not able to keep up with the number of blocks
being generated. In this case, the delta will begin to fall into higher
and higher buckets (> 24h).

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please add a CHANGELOG entry.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters marked this pull request as ready for review October 18, 2022 23:19
@56quarters 56quarters requested a review from a team as a code owner October 18, 2022 23:19
@56quarters
Copy link
Contributor Author

56quarters commented Oct 18, 2022

After running this in dev for a while, the results are about what you'd expect: everything compacted is < 24h delta so the p75 of this metric is 0.75 * 24h = 18h.
Screenshot 2022-10-18 at 19-20-00 Explore - cortex-dev-01-dev-us-central-0 - Grafana

@pracucci
Copy link
Collaborator

After running this in dev for a while, the results are about what you'd expect: everything compacted is < 24h delta so the p75 of this metric is 0.75 * 24h = 18h.

Keep in mind that to alert on it you will have to look at the max_over_time() of the quantile for a reasonable large time window.

@pracucci pracucci merged commit 61ca998 into main Oct 19, 2022
@pracucci pracucci deleted the 56quarters/compactor-lag branch October 19, 2022 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants