Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CortexIngesterReachingSeriesLimit warning less sensitive #362

Merged
merged 1 commit into from
Jul 28, 2021

Conversation

beorn7
Copy link

@beorn7 beorn7 commented Jul 26, 2021

What this PR does:

As it turns out, during normal shuffle-sharding operation, the 70%
mark is often exceeded, but not by much. Therefore, this change sets
the new warning mark at 75%. It also increases the for duration to
15m as the expected reaction time for warning alerts is usually in the
order of hours, so we can as well wait a bit longer to see if the
problem is transient.

Which issue(s) this PR fixes:

n/a

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@beorn7 beorn7 requested a review from pracucci July 26, 2021 17:12
@beorn7 beorn7 requested a review from a team as a code owner July 26, 2021 17:12
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Beorn for working on this. I think that what we want from this warning alert is find out ingesters which are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). Since when the TSDB head is compacted, we flush series with a timestamp between [-3h, -1h] the worst case scenario is that it takes up to 3h to flush stale series. So, I would rather keep the 70% threshold but with a for duration of 3h.

What do you think?

As it turns out, during normal shuffle-sharding operation, the 70%
mark is often exceeded, but not by much. Rather than increasing the
threshold to 75%, this commit increases the `for` duration to 3h,
following the thought that we want this alert to fire if ingesters are
constantly above the threshold even after stale series are flushed
(which occurs every 2h, when the TSDB head is compacted). We flush
series with a timestamp between [-3h, -1h] after the last compaction,
so the worst case scenario is that it takes 3h to flush a stale
series.

Signed-off-by: beorn7 <beorn@grafana.com>
CHANGELOG.md Outdated
@@ -29,6 +29,7 @@
* [ENHANCEMENT] cortex-mixin: Added `alert_excluded_routes` config to exclude specific routes from alerts. #338
* [ENHANCEMENT] Added `CortexMemcachedRequestErrors` alert. #346
* [ENHANCEMENT] Ruler dashboard: added "Per route p99 latency" panel in the "Configuration API" row. #353
* [ENHANCEMENT] Tweaked threshould and `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Suggested change
* [ENHANCEMENT] Tweaked threshould and `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362
* [ENHANCEMENT] Tweaked `for` duration for `CortexIngesterReachingSeriesLimit` warning alert. #362

@beorn7
Copy link
Author

beorn7 commented Jul 27, 2021

Thanks for your thoughts. Let's make it so!

I have updated the CHANGELOG.md and the commit description accordingly.

@beorn7
Copy link
Author

beorn7 commented Jul 27, 2021

Note: I cannot merge this because I'm not authorized. (I guess @pracucci you are. ;-)

@pracucci pracucci merged commit 9ba9ac9 into main Jul 28, 2021
@pracucci pracucci deleted the beorn7/alerting branch July 28, 2021 07:24
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…rting

Make CortexIngesterReachingSeriesLimit warning less sensitive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants