Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make OOM not be a SIGKILL #40157

Closed
grosser opened this issue Jan 19, 2017 · 58 comments
Closed

Make OOM not be a SIGKILL #40157

grosser opened this issue Jan 19, 2017 · 58 comments
Labels
kind/support Categorizes issue or PR as a support question. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@grosser
Copy link

grosser commented Jan 19, 2017

Atm apps that go over the memory limit are hard killed 'OOMKilled', which is bad (losing state / not running cleanup code etc)

Is there a way to get SIGTERM instead (with a grace period or 100m before reaching the limit) ?

@grosser grosser changed the title Make OOM behave not be a SIGKILL Make OOM not be a SIGKILL Jan 19, 2017
@smarterclayton smarterclayton added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 28, 2017
@smarterclayton
Copy link
Contributor

@kubernetes/sig-node-feature-requests

@smarterclayton smarterclayton added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 28, 2017
@vishh
Copy link
Member

vishh commented Feb 28, 2017

It is not possible to change OOM behavior currently. Kubernetes (or runtime) could provide your container a signal whenever your container is close to its memory limit. This will be on a best effort basis though because memory spikes might not be handled on time.

@grosser
Copy link
Author

grosser commented Feb 28, 2017

FYI using this crutch atm https://github.com/grosser/preoomkiller

any idea what would need to change to make OOM behavior configureable ?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 21, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2018
@yujuhong
Copy link
Member

/remove-lifecycle stale
/cc @dashpole

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@tonglil
Copy link
Contributor

tonglil commented Feb 21, 2018

Was this meant to be closed?

It seems like @yujuhong meant to say /remove-lifecycle rotten?

@dashpole
Copy link
Contributor

/remove-lifecycle rotten

@dashpole dashpole reopened this Feb 21, 2018
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 21, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 21, 2018
@Draiken
Copy link

Draiken commented Jul 17, 2018

When the node is reaching OOM levels I guess I understand some SIGKILLs happening but when a pod is reaching it's manually set resource limit it also gets a SIGKILL. As the initial post mentions, this can cause a lot of harm.

As a workaround we're going to try and make the pod unhealthy before it reaches the memory limit to get a graceful shutdown.
Kubernetes sending this signal beforehand would solve the issue.

If I want this feature created, how should I go about it? Should I provide a PR with code changes or ping someone to make a proposal?

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@gajus
Copy link

gajus commented Jan 1, 2019

This remains an active issue.

There appears to be no way to gracefully handle OOMKilled at the moment.

@dashpole Can this be re-opened?

@dashpole
Copy link
Contributor

dashpole commented Jan 2, 2019

/reopen
/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot reopened this Jan 2, 2019
@andy-v-h
Copy link

What makes this really interesting is sometimes you want the kernel to do the OOMKill when the node is under memory pressure, other times it woul dbe nice if kubernetes would preemptively SIGTERM as a OOMKill because the container is approaching a set limit

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2021
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2021
@s2maki
Copy link

s2maki commented Mar 4, 2021

Even better would be that OOM causes the call to brk to fail (aka malloc returning NULL), allowing the application to gracefully handle running out of memory the normal way instead of killing the process when it goes over.

@lonre
Copy link

lonre commented Mar 10, 2021

It should provide opportunity to gracefully shutdown

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2021
@george-angel
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2021
@ffromani
Copy link
Contributor

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

@ffromani
Copy link
Contributor

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind feature
/kind support
/close

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Jun 24, 2021
@k8s-ci-robot
Copy link
Contributor

@fromanirh: Closing this issue.

In response to this:

Kubernetes does not use issues on this repo for support requests. If you have a question on how to use Kubernetes or to debug a specific issue, please visit our forums.

/remove-kind feature
/kind support
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tsuna
Copy link
Contributor

tsuna commented Apr 27, 2022

@fromanirh this is not a support request, it's a legit feature request.

@T3rm1
Copy link

T3rm1 commented Aug 23, 2022

@fromanirh Can you reopen this, please? It's clearly a feature request, not a support case.

@aojea
Copy link
Member

aojea commented Aug 26, 2022

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

it is a feature request for the Kernel not for Kubernetes , the kernel generates the SIGKILL

@johnnyshields
Copy link

@aojea there non-kernel solutions here, such as triggering graceful shutdowns at a threshold memory usage (e.g. 95%) before the hard limit is reached.

@aojea
Copy link
Member

aojea commented Aug 26, 2022

@aojea there non-kernel solutions here, such as triggering graceful shutdowns at a threshold memory usage (e.g. 95%) before the hard limit is reached.

oh, that is not clear from the title and from the comments, sorry

for reasons outlined in #40157 (comment) we can't just change the signal delivered. OTOH we can integrate with some OOM daemons, but this would require a separate discussion and KEP.

so it should be retitled or open a new issue with the clear request ... and for sure that will need a KEP

@johnnyshields
Copy link

Sure, how about titling it: "Add graceful memory usage-based SIGTERM before hard OOM kill happens"

@mike-chen-samsung
Copy link

Is there another Issue that's been recreated for this? I can't find one in the Issues list. If not, I can create a new feature request issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests