Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CortexRolloutStuck alert #405

Merged
merged 2 commits into from
Oct 14, 2021
Merged

Add CortexRolloutStuck alert #405

merged 2 commits into from
Oct 14, 2021

Conversation

pracucci
Copy link
Collaborator

What this PR does:
In this PR I propose to add the CortexRolloutStuck which fires if a Cortex StatefulSet or Deployment rollout is stuck. I've manually tried the queries and they should work as expected. The alert is a warning to get some confidence with it, but final goal would be to run it as critical.

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci requested a review from a team as a code owner October 13, 2021 08:53
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +421 to +426
(
max without (revision) (
kube_statefulset_status_current_revision
unless
kube_statefulset_status_update_revision
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too sure why stateful sets need that revision check, but I guess it's also upstream: https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/6c72589035f4f49674a56cf97a3ec1a02f14671a/alerts/apps_alerts.libsonnet#L128

So it should be ok 🙂

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I learned it from there.

@pracucci pracucci merged commit 306c081 into main Oct 14, 2021
@pracucci pracucci deleted the alert-on-stuck-rollout branch October 14, 2021 07:46
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…tuck-rollout

Add CortexRolloutStuck alert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants