Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mixin for TempoIngesterFlushes thresholds #1354

Merged
merged 5 commits into from
Apr 18, 2022

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Mar 31, 2022

What this PR does:

  • replace the exiting critical alert for TempoIngesterFlushes with a warning, and lengthen the critical threshold duration
  • replace the metric used for TempoIngesterFlushes to only measure retries

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala marked this pull request as draft March 31, 2022 19:42
@zalegrala zalegrala force-pushed the unhealthyIngesterFlush branch 2 times, most recently from 668a33d to cc59827 Compare April 1, 2022 20:19
@zalegrala zalegrala changed the title Update TempoIngesterFlushes thresholds and include warning Update mixin for TempoIngesterFlushes thresholds Apr 1, 2022
@zalegrala zalegrala marked this pull request as ready for review April 4, 2022 13:00
severity: 'warning',
},
annotations: {
message: 'Greater than %s flushes have failed in the past hour.' % $._config.alerts.flushes_per_hour_failed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to tweak the message so the warning and critical are different. Don't have a recommendation though... any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll wordsmith around a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated a little. How's that read to you now?

expr: |||
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[1h])) > %s and
sum by (%s) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0
sum by (%s) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > %s and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we think about keying the "unhealthy" warning alert on the old metric and the "failing" critical alert on the retries metric?

then for consistency keep the for: 5m on both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be reasonable. The goal being to always know when a flush failed at a warning level, but only get paged when a retry fails for the 5m duration?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what i was thinking. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've made the update.

@zalegrala zalegrala merged commit d4c5a7c into grafana:main Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants