Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower CortexIngesterRestarts severity #321

Merged
merged 2 commits into from
Jun 8, 2021

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Jun 7, 2021

What this PR does:
Last night I've been paged by CortexIngesterRestarts but it was a false positive caused by a K8S cluster downscaling triggered by the node autoscaler. There was no impact on the Cortex cluster availability, given the ingester pod disruption budget was honored.

It's not the first time this is happening and, in this PR, I'm proposing to lower CortexIngesterRestarts severity from critical to warning. An ingester restart is not an issue per se. The signal may still be useful in case of a cluster outage, but we shouldn't have a critical alert on it.

Which issue(s) this PR fixes:
N/A

Checklist

  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci requested a review from a team as a code owner June 7, 2021 06:55
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, downscale is the cause, restart is a symptom. Typically on downscale I'd expect single restart, not multiple. We have caught issues with this alert before (mostly in Loki, but still). I would keep it as is, and perhaps tune the parameters.

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 7, 2021

In my opinion, downscale is the cause, restart is a symptom. Typically on downscale I'd expect single restart, not multiple. We have caught issues with this alert before (mostly in Loki, but still). I would keep it as is, and perhaps tune the parameters.

An ingester restart is not an issue per se, not even if it happen twice in a row.

@pstibrany
Copy link
Member

An ingester restart is not an issue per se, not even if it happen twice in a row.

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

@pstibrany
Copy link
Member

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

That said, since other alerts cover our write and read path, maybe lowering priority for this one is fine. 🤔

@pracucci
Copy link
Collaborator Author

pracucci commented Jun 7, 2021

I think that depends on actual reason (cause) why it was restarted twice in a row. Restarts due to "query of death" or some wrong pushed data would be bad. Restarts due to autoscaler are fine.

Agree on this. However, even if that's the case, if it just happens on a single ingester it's still not a critical thing (not critical enough to be waken up during the night because doesn't negatively affect the cluster). If it happens on multiple ingesters at the same time, SLO and other critical alerts (eg. requests failure) will trigger: these latter alerts are more symptom based than "an ingester is restarting too frequently".

Copy link
Member

@owen-d owen-d left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think this can be moved to warning as other causes should be covered by SLOs, etc.

@pracucci pracucci merged commit e7cbfe4 into main Jun 8, 2021
@pracucci pracucci deleted the lower-ingester-restarts-severity branch June 8, 2021 13:23
simonswine pushed a commit to grafana/mimir that referenced this pull request Oct 18, 2021
…ster-restarts-severity

Lower CortexIngesterRestarts severity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants