Update mixin for TempoIngesterFlushes thresholds #1354

zalegrala · 2022-03-31T19:19:26Z

What this PR does:

replace the exiting critical alert for TempoIngesterFlushes with a warning, and lengthen the critical threshold duration
replace the metric used for TempoIngesterFlushes to only measure retries

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

mdisibio · 2022-04-11T12:04:03Z

operations/tempo-mixin/alerts.libsonnet

+              severity: 'warning',
+            },
+            annotations: {
+              message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,


I think it would be good to tweak the message so the warning and critical are different. Don't have a recommendation though... any ideas?

Good idea, I'll wordsmith around a little.

Updated a little. How's that read to you now?

…_failed_retries_total in mixin

joe-elliott · 2022-04-12T18:25:55Z

operations/tempo-mixin/alerts.libsonnet

            expr: |||
-              sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
-              sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
+              sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > %s and


what do we think about keying the "unhealthy" warning alert on the old metric and the "failing" critical alert on the retries metric?

then for consistency keep the for: 5m on both?

That could be reasonable. The goal being to always know when a flush failed at a warning level, but only get paged when a retry fails for the 5m duration?

Yeah, that's what i was thinking. 👍

Okay, I've made the update.

zalegrala requested review from joe-elliott, annanay25, mdisibio, dgzlopes, mapno and kvrhdn as code owners March 31, 2022 19:19

zalegrala marked this pull request as draft March 31, 2022 19:42

zalegrala force-pushed the unhealthyIngesterFlush branch 2 times, most recently from 668a33d to cc59827 Compare April 1, 2022 20:19

zalegrala changed the title ~~Update TempoIngesterFlushes thresholds and include warning~~ Update mixin for TempoIngesterFlushes thresholds Apr 1, 2022

zalegrala marked this pull request as ready for review April 4, 2022 13:00

mdisibio reviewed Apr 11, 2022

View reviewed changes

zalegrala added 4 commits April 11, 2022 19:20

Update TempoIngesterFlushes thresholds and include warning

eaae33f

Update changelog

3d16bf7

Replace tempo_ingester_failed_flushes_total with tempo_ingester_flush…

e8e86c9

…_failed_retries_total in mixin

Adjust alert message wording slightly

2df00cf

zalegrala force-pushed the unhealthyIngesterFlush branch from cc59827 to 2df00cf Compare April 11, 2022 19:28

joe-elliott reviewed Apr 12, 2022

View reviewed changes

Adjust unhealthy metric and failed duration

0ca1ccd

zalegrala force-pushed the unhealthyIngesterFlush branch from 70b2613 to 0ca1ccd Compare April 18, 2022 15:04

joe-elliott approved these changes Apr 18, 2022

View reviewed changes

zalegrala merged commit d4c5a7c into grafana:main Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update mixin for TempoIngesterFlushes thresholds #1354

Update mixin for TempoIngesterFlushes thresholds #1354

zalegrala commented Mar 31, 2022 •

edited

mdisibio Apr 11, 2022

zalegrala Apr 11, 2022

zalegrala Apr 11, 2022

joe-elliott Apr 12, 2022

zalegrala Apr 18, 2022

joe-elliott Apr 18, 2022

zalegrala Apr 18, 2022

Update mixin for TempoIngesterFlushes thresholds #1354

Update mixin for TempoIngesterFlushes thresholds #1354

Conversation

zalegrala commented Mar 31, 2022 • edited

mdisibio Apr 11, 2022

Choose a reason for hiding this comment

zalegrala Apr 11, 2022

Choose a reason for hiding this comment

zalegrala Apr 11, 2022

Choose a reason for hiding this comment

joe-elliott Apr 12, 2022

Choose a reason for hiding this comment

zalegrala Apr 18, 2022

Choose a reason for hiding this comment

joe-elliott Apr 18, 2022

Choose a reason for hiding this comment

zalegrala Apr 18, 2022

Choose a reason for hiding this comment

zalegrala commented Mar 31, 2022 •

edited