Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create alert for OOMKill events inside containers #822

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mac-chaffee
Copy link

@mac-chaffee mac-chaffee commented Jan 29, 2023

Helps with #759, second attempt of #760. Also may be related to #112 and may supersede #800?

In Kubernetes 1.24, kubelet started exposing a metric that counts OOMKill events for specific containers, container_oom_events_total, which I used for this alert.

This alert will fire if there are any of these OOMKill events in a container. Multi-process containers like webservers that have multiple "worker" process could silently be OOMKilled without this. I have personally seen a pod running Gunicorn throw a 100% error rate due to OOMKills that 1) didn't show up in app-level monitoring, since the workers died before recording stats, and 2) didn't show up in any existing kubernetes-mixin alerts since PID1 never died.

IMO this alert might be better than #800 since it's more granular (at the container and process level). The OOMKilled pod status may be incorrect since it just checks if exit_code == 137, which is caused by any SIGKILL, not just the OOMKiller.

Open to suggestions!

Copy link
Member

@paulfantom paulfantom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I would like others to weight in before merging this.

@povilasv
Copy link
Contributor

povilasv commented Feb 3, 2023

This is very similiar to #800 see discussion in #800 (comment)

@paulfantom I guess we need to decide whether we want these types of alerts in mixin and on what severity level :)

@szymonpk
Copy link

Right now, this will not work due to the cadvisor bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants