Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make CrashLoopBackoff timing tuneable, or add mechanism to exempt some exits #57291

Open
jgiles opened this issue Dec 18, 2017 · 103 comments
Open
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@jgiles
Copy link

jgiles commented Dec 18, 2017

Is this a BUG REPORT or FEATURE REQUEST?: Feature request

/kind feature

What happened:
As part of a development workflow, I intentionally killed a container in a pod with restartPolicy: Always. The plan was to do this repeatedly, as a quick way to restart the container and clear old state (and, in Minikube, to load image changes).
The container went into a crash-loop backoff, making this anything but a quick option.

What you expected to happen:
I expected there so be some configuration allowing me to disable, or at least tune the timing of, the CrashLoopBackoff.

How to reproduce it (as minimally and precisely as possible):
Create a pod with restartPolicy: Always, and intentionally exit a container repeatedly.

Anything else we need to know?:
I see that the backoff timing parameters are hard-coded constants here:

MaxContainerBackOff = 300 * time.Second

backOffPeriod = time.Second * 10

One might reasonably expect these to be configurable at least at the kubelet level - say, by a setting like these. That would be sufficient for my use-case (local development with fast restarts), and presumably useful as an advanced configuration setting for production workloads.

A more aggressive change would allow tuning per-pod.

There are other options for my target workflow:

  • Put the pod in a Deployment or similar, kubectl delete the pod, let Kubernetes schedule another, work with the new pod. However, this is much slower than a container restart without backoff (and ironically causes more kubelet load than the backoff avoids). It also relies on using kubectl/the Kubernetes API to do the restart, as opposed to just exiting the container.
  • Run the server process as a secondary process in the container rather than the primary process. This means the server can be started/stopped without container backoff, but is trickier to implement and doesn't offer the same isolation guarantees as exiting the container and starting fresh. It also means I probably can't use the same image I deploy to production (because I probably don't want this extra restart-support stuff floating around in the production image).

Environment:

  • Kubernetes version (use kubectl version): v1.8.0
  • Cloud provider or hardware configuration: Minikube 0.23.0 with Virtualbox driver on OSX
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 18, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 18, 2017
@jgiles
Copy link
Author

jgiles commented Dec 18, 2017

I guess a more direct way to achieve what I am looking for would be a kubectl restart pod_name -c container_name that was explicitly exempted from crash-loop backoff (see #24957 (comment) for related discussion) or some other way to indicate that we're bringing the container down on purpose and are not in an uncontrolled crash loop.

But, the 5 minutes max backoff/10 minutes backoff reset for image pull and crash backoff seems far too high for development environments regardless. I'd like to tune those down significantly on my Minikube anyway.

@dims
Copy link
Member

dims commented Dec 18, 2017

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Dec 18, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 18, 2017
@bryanlarsen
Copy link

This isn't just an issue for dev, it's also important for some production workloads.

We have some workloads that deliberately exit whenever they get any sort of error, including bad input data, expecting that they'll be restarted in a clean state so that they can continue. Bad input data is ~2% of input, each unit takes ~5 seconds, so those workloads seem to spend more time in CrashLoopBackoff than they do processing jobs. Especially since bad input data tends to be clustered.

@fredsted
Copy link

fredsted commented Feb 19, 2018

We have this issue with workers that are restarted automatically every 5 minutes to clear any bad database connections, etc. The process quits it itself automatically and then should just restart with no delay. Would be nice to be able to just disable this backoff.

Edit: We ended up using the solution from here, which allows us to restart the script every 5 minute and not worry about CrashLoopBackOff.

@3dbrows
Copy link

3dbrows commented Feb 26, 2018

It would be ideal if the backoff could be disregarded in cases where the container exited with code 0. One could argue that containers exiting with such a code are not stuck in a "crash loop"; they have merely exited after successfully completing their work, and you want another one to start. This is tantamount to an infinite Job (no completion-count target).

@mcfedr
Copy link

mcfedr commented Mar 30, 2018

Having a similar issue, I have worker that I know is unstable, and yea, we are working to make it better, but i really want the deployment to just keep restarting it

@3dbrows
Copy link

3dbrows commented Mar 30, 2018

@mcfedr Such a design of K8S would probably strain the masters too much due to continual thrashing of container state. To solve your problem, how about something like this, where your Dockerfile runs a script that continually restarts the process you want to remain alive;

CMD bash /code/start.sh

Where start.sh is something to the effect of

#!/bin/bash

while true; do
    python /code/app.py
done

In other words, upon python exiting, just start it again.

@mcfedr
Copy link

mcfedr commented Mar 30, 2018

Thats the system I was basically moving to at the moment, well, its actually where I was before moving to k8s, and assumed k8s would be better place to handle it, but that would also work.

@eroldan
Copy link

eroldan commented Apr 11, 2018

Here is my own "small team" scenario:

As k8 does not have any kind of depencency, if some frontend pods are in CrashLoopBackoff because other pod isn't ready (eg buggy backend service), when the backend comes up again, the complete app will take 5 minutes more to be available. In this case It will be useful to kill the buggy backend, let the Deployment create new one (probably pulling a bug fixed image) and just wait 1 minute for frontend to reconnect.

@mcfedr
Copy link

mcfedr commented Apr 13, 2018

@eroldan I have been using init containers to make sure that backend services are running/updated before launching frontends. Of course this wont prevent issues if the backend is going down during work, but maybe you can change the frontend livelyness check to report healthy even when backend is not working.

@sporkmonger
Copy link

I have a use-case where init containers are running into a rate-limit enforced by an external system if they go into crash loop, which just ensures the loop continues. Would like to be able to adjust crash back off to prevent hitting that externally enforced rate-limit without needing to resort to something hacky like e.g. a sleep inside the init container.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2018
@grosser
Copy link

grosser commented Oct 1, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2018
@e-pirate
Copy link

e-pirate commented Oct 8, 2018

We have even more painful scenario and reason MaxContainerBackOff and backOffPeriod mast be tunable in per-node way. We have two ingress nodes running Calico and nginx-ingress pods. Besides, we have Keepalived on ingress nodes watching for specified pods to be in the Running state and moving VIP (external real IP) according to the condition. In case all ingress nodes goes down and stay unresponsive for some time and then one of them returns, Kubernetes will try to restart calico-node pod (the only available restartPolicy in Calico DS is "Always"), but pod will stay in CrashLoopBackOff state up to 5 minutes leaving the whole cluster unavailable from outside while Kubernetes simply doing nothing and waiting for timeout to expire. Instead, it is vital to push pods as hard as possible to go through internal cycle ASAP in that particular scenario.

@scher200
Copy link

scher200 commented Oct 9, 2018

I would even prefer to set this "MaxContainerBackOff and backOffPeriod" of adjust "crashBackOfPeriod" per container.
Some tasks are isolated processes on purpose, that you really don't like to repeat within a container like this, think of privacy concerns:

#!/bin/bash

while true; do
    python /code/isolated-task.py
done

but rather prefer to specify within your Dockerfile:

CMD ['isolated-task.py']

@Dag24 this straining of masters depends on the lifecycle and scale of your tasks and so it must be configurable both ways.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 7, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 6, 2019
@mattysweeps
Copy link

/remove-lifecycle rotten

I also would like to adjust the CrashLoopBackoff timings: to make them shorter.

@nijave
Copy link

nijave commented Sep 19, 2023

It might be useful to have this configurable similar to startup/liveness/readiness probes. In my case, I have an application consisting of a few different Deployments and a StatefulSet. This creates a dependency graph and subsequent items crash while waiting for their dependencies to start. Eventually everything starts, but this can lead to 5+ minute delays for the last dependencies depending on how many times they crashed already.

For instance

  1. StatefulSet
  2. Deployment (depends on StatefulSet)
  3. Job (depends on Deployment)

In my case, a gracePeriod could also be useful. I think it's pretty common to let Kubernetes be "eventually consistent" but the backoff logic can lead to some less-than-desirable feedback loops.

This really slows down acceptance test automation (using Tilt and kustomize) where the entire expected runtime is <5 minutes but most of the time is spent waiting on backoff.

@dkrieger
Copy link

Thought I'd weigh in here years after being on team "let's get this merged" and seeing new people adding the same thing to the discussion that I originally did...

When deploying a workload that is supposed to be able to exit and restart, just accept that it's an application-level concern and use an init process, even if it's a simple bash script. It's conceptually cleaner than treating it as an infrastructure-level concern. With the exception of Jobs, k8s workloads are assumed to run forever, and when they stop running, they're assumed to have crashed. This is OK.

@haqa
Copy link

haqa commented Sep 19, 2023

Thought I'd weigh in here years after being on team "let's get this merged" and seeing new people adding the same thing to the discussion that I originally did...

When deploying a workload that is supposed to be able to exit and restart, just accept that it's an application-level concern and use an init process, even if it's a simple bash script. It's conceptually cleaner than treating it as an infrastructure-level concern. With the exception of Jobs, k8s workloads are assumed to run forever, and when they stop running, they're assumed to have crashed. This is OK.

I'm glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.

But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn't decide how Kubernetes should work based on "it works on my machine"!

This feature should be switchable or configurable for situations where we can't change the code to work without exiting.

@dkrieger
Copy link

I'm glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.

But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn't decide how Kubernetes should work based on "it works on my machine"!

This feature should be switchable or configurable for situations where we can't change the code to work without exiting.

This is a false premise, and I'll explain why.

cmd and args are not part of the artifact, and that's all you need to control how the application is initialized. There's no such thing as a manifest you're not able to control, be it for coding or licensing reasons. If you're talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn't have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application's source, whether it's transparent and modifiable or opaque and unmodifiable.

I'll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I'd like you to describe your particular situation if that is wrong.

The bottom line is there is no scenario where you're "allowed" to configure a yet-to-be implemented crashloopbackoff setting but you're "not allowed" to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.

@haqa
Copy link

haqa commented Sep 19, 2023

I'm glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.
But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn't decide how Kubernetes should work based on "it works on my machine"!
This feature should be switchable or configurable for situations where we can't change the code to work without exiting.

This is a false premise, and I'll explain why.

cmd and args are not part of the artifact, and that's all you need to control how the application is initialized. There's no such thing as a manifest you're not able to control, be it for coding or licensing reasons. If you're talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn't have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application's source, whether it's transparent and modifiable or opaque and unmodifiable.

I'll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I'd like you to describe your particular situation if that is wrong.

The bottom line is there is no scenario where you're "allowed" to configure a yet-to-be implemented crashloopbackoff setting but you're "not allowed" to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.

Not all software is open source, or open license, or, in fact, open in any way. Much software is licensed very restrictively. Not all software is packaged with the tools you suggest, and not all software may legally be "from"'d. Are we saying, publicly here, that these applications are not able to be run reliably under Kubernetes? That kubernetes is not for these types of application licensing?

That would be a bold statement indeed.

@dkrieger
Copy link

dkrieger commented Sep 19, 2023

I'm glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great.
But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn't decide how Kubernetes should work based on "it works on my machine"!
This feature should be switchable or configurable for situations where we can't change the code to work without exiting.

This is a false premise, and I'll explain why.
cmd and args are not part of the artifact, and that's all you need to control how the application is initialized. There's no such thing as a manifest you're not able to control, be it for coding or licensing reasons. If you're talking about a 3rd party helm chart, it would be a bug in the chart (or a missing feature) if it doesn't have an init process suitable to your needs. The way your process is initialized is akin to the env vars you choose to set, it is not part of your application's source, whether it's transparent and modifiable or opaque and unmodifiable.
I'll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I'd like you to describe your particular situation if that is wrong.
The bottom line is there is no scenario where you're "allowed" to configure a yet-to-be implemented crashloopbackoff setting but you're "not allowed" to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings.

Not all software is open source, or open license, or, in fact, open in any way. Much software is licensed very restrictively. Not all software is packaged with the tools you suggest, and not all software may legally be "from"'d. Are we saying, publicly here, that these applications are not able to be run reliably under Kubernetes? That kubernetes is not for these types of application licensing?

That would be a bold statement indeed.

Can you provide a concrete example of where a docker image cannot be FROM'd? I don't believe that's true. Even still you could use volumes to inject a self-contained init process, then use cmd/args. Open/closed-source is completely irrelevant as nothing I'm proposing involves touching source code. You're describing contrived what-ifs that aren't relevant to the proposition of wrapping an image's default entrypoint with an init process

No commercial license for a container image would be intentionally written such that it prevents running on kubernetes, unless it's intentionally blocking kubernetes usage in which case it's a moot point.

@nediamond
Copy link

@dkrieger Are you defending the current state of the software? Maybe this is ok as an untuneable parameter for legitimate crash cases, but for situations where a container is exiting with code 0 it seems clearly incorrect, or at best highly unintuitive, to apply this "crash backoff" policy.

@kribor
Copy link

kribor commented Sep 20, 2023

Manipulating the job to not exit in some cases doesn't make any sense and can be a security vulnerability. E.g. we've set CI runner to run as ever restarting deployment. The restarting with clean "storage" is a key security feature to prevent data leaking between jobs or block an attack from being persisted. The pod restart itself has lots of useful implications for security and reproducibility. Preventing people from taking advantage of them doesn't make sense.

@dkrieger
Copy link

dkrieger commented Sep 20, 2023

Manipulating the job to not exit in some cases doesn't make any sense and can be a security vulnerability. E.g. we've set CI runner to run as ever restarting deployment. The restarting with clean "storage" is a key security feature to prevent data leaking between jobs or block an attack from being persisted. The pod restart itself has lots of useful implications for security and reproducibility. Preventing people from taking advantage of them doesn't make sense.

Your use of the term "job" here is appropriate. Shouldn't this be modeled as a Kubernetes Job, which already does distinguish between exit codes? If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that's allowed to schedule Kubernetes Jobs. This feels relevant: https://xkcd.com/1172/ . If the issue is performance, I don't think adding configuration parameters to better support what amounts to a hack (using deployments for ephemeral workloads, which threatens the stability of the cluster due to thrashing) is the right way to extend the APIs of a massively adopted post-beta platform, and I'm not convinced there's no way to facilitate performant, secure CI runners with the existing APIs. Let me add that I have absolutely no say over whether this feature is added, I'm just voicing a perspective that is likely shared by those who do, and I think commenters lobbying for it need to provide some more compelling arguments than have been seen for the past ~6 years to change the status quo.

@kribor
Copy link

kribor commented Sep 20, 2023

If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that's allowed to schedule Kubernetes Jobs.

I presume you are kubernetes developer used to building apps interacting in complex ways with kubernetes api's, hooks etc. I'm not, I'm a kubernetes user looking for an orchestration platform that solves complex infrastructure needs so I don't need to code that logic myself. To me, creating some custom piece of complex code to schedule jobs, monitor their completion to start new ones as soon as the last one completes is just not worth the complexity when the use case can be 99% covered with existing kubernetes features. I don't remember exactly why I arrived at deployment a few years ago when setting this up but looking at Job docs now I assume it was to enable constant number of runners and immediate restart on success. Achieving that with Job doesn't seem possible (e.g. from job docs: "Only a RestartPolicy equal to Never or OnFailure is allowed"), without such custom scheduling as I think you're suggesting. Deployments make constant rescheduling via replica count trivial, except for the issue of when jobs run "too fast" for kubernetes default liking. Basically Deployment is much closer to match the use case than Job. Unless there would be som other kind of Job scheduling allowing "restarting forever".

This not about laziness or inability to build a custom Job scheduler. But I don't believe I should need to build a custom job scheduler to support this, based on the popularity of this thread, it's reasonably common scenario.

@mitar
Copy link
Contributor

mitar commented Sep 20, 2023

As somebody who designed and developed an init especially for Docker images with the idea that in contrast with traditional init systems, Docker-specific init should in fact not restart processes itself but should propagate failures to the whole container to enable container supervisor to do proper corrective action (and logging it and collect stats). From our experience we had issues with an init inside containers doing too much hiding errors from the supervisor.

So I am a bit surprised that a supervisor like Kubernetes does not have more flexible configuration here about what and how should corrective actions occurs. Respecting exit codes seems to be something very obvious, and being able to control the backoff as well.

@dkrieger
Copy link

If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that's allowed to schedule Kubernetes Jobs.

I presume you are kubernetes developer used to building apps interacting in complex ways with kubernetes api's, hooks etc. I'm not, I'm a kubernetes user looking for an orchestration platform that solves complex infrastructure needs so I don't need to code that logic myself. To me, creating some custom piece of complex code to schedule jobs, monitor their completion to start new ones as soon as the last one completes is just not worth the complexity when the use case can be 99% covered with existing kubernetes features. I don't remember exactly why I arrived at deployment a few years ago when setting this up but looking at Job docs now I assume it was to enable constant number of runners and immediate restart on success. Achieving that with Job doesn't seem possible (e.g. from job docs: "Only a RestartPolicy equal to Never or OnFailure is allowed"), without such custom scheduling as I think you're suggesting. Deployments make constant rescheduling via replica count trivial, except for the issue of when jobs run "too fast" for kubernetes default liking. Basically Deployment is much closer to match the use case than Job. Unless there would be som other kind of Job scheduling allowing "restarting forever".

If you scroll up, multiple ready-made solutions have been linked that will delete pods in crashloopbackoff state. The objection that logs won't be preserved that followed is not really compelling; if you're not aggregating logs in your cluster, that's the problem to solve. There are already multiple low effort paths for you to accomplish what you want to accomplish.

If this were a rampant high impact problem, there would be popular CRDs and operators for dealing with this, because k8s was thoughtfully designed for extensibility. There probably are some in the wild. For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They've probably never even seen this gh issue.

@mitar
Copy link
Contributor

mitar commented Sep 20, 2023

For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They've probably never even seen this gh issue.

This is currently the 5th most upvoted open issue in this repository, so I think your assertion here is false.

@dkrieger
Copy link

dkrieger commented Sep 20, 2023

For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They've probably never even seen this gh issue.

This is currently the 5th most upvoted open issue in this repository, so I think your assertion here is false.

246 is nothing compared to k8s's user base. 6 years and no RFC (unless I missed it). People are voting with their feet that this isn't necessary. If it were there would be popular operators. Plenty of things that are essential to running a k8s cluster in production are not even part of the core APIs. Having this feature seems like all upside and no downside from a user perspective, but it almost certainly would threaten the reliability of the cluster itself, and keeping it out of out-of-the-box APIs doesn't prevent people from consciously taking that risk. Unlike a dedicated process supervisor, k8s is responsible for the complete distributed system.

@kribor
Copy link

kribor commented Sep 20, 2023

246 is nothing compared to k8s's user base.

Sure in absolute terms but most people won't act even to upvote. The fact that it's top 5 says a lot more than the absolute number.

@thockin
Copy link
Member

thockin commented Dec 8, 2023

Lot's of strong feelings on this one. I just wanted to chime in on what I think might be palatable.

First, let's acknowledge that real users are really struggling with this. The current design was intended to balance the needs of the few (crashy apps) with the needs of the many (everyone else). Crashing and restarting in a really fast loop usually indicates that something is wrong. But clearly not ALWAYS.

There are a lot of good ideas in this thread. Some that stand out to me, with some commentary.

  1. Revisit the default backoff curve. Perhaps it's just too aggressive and should start slower and peak lower. E.g. it could start at 250ms and climb to 4s and stop there. Or 16s (still much less than 5m). Maybe it uses a 1.2x factor instead of 2x. Or maybe it flattens out again at 4s and you can restart 100 times before it starts climbing again. Or maybe its 100x at each step.
  2. Don't count a 0 exit as a crash, making it eligible for fast(er) restart. This could mean "immediate" or a very different backoff curve.
  3. Codify the previous into a tool, e.g. restart-on-exit-0 -- mycommand -arg -arg -arg. I don't love this but it is totally back-compatible.
  4. Reset the crash counter when your app successfully passes a startupProbe or readinessProbe (or 3x readiness or something). Or reduce the count for every probe you pass (so starting up, serving for a bit, then exiting would always be fast restart, but crash-crash-crash would slow down quickly).

NOTE: These are not all mutually exclusive!

It's important to repeat - this is designed to protect the system from badly behaving apps. We don't want to remove that safety. If we add an explicit API to bypass it, we can at least let cluster admins install policies that say "oh no you don't". But APIs have costs that we pay forever.

So my strong preference is to do something smart without any API surface, and only if we can't make that work to add API. Some combination of options 1, 2, and 4 seem like we could put a real dent in the problem, and then see what's left.

@tzneal
Copy link
Contributor

tzneal commented Dec 8, 2023

Revisit the default backoff curve. Perhaps it's just too aggressive and should start slower and peak lower. E.g. it could start at 250ms and climb to 4s and stop there. Or 16s (still much less than 5m). Maybe it uses a 1.2x factor instead of 2x. Or maybe it flattens out again at 4s and you can restart 100 times before it starts climbing again. Or maybe its 100x at each step.

I really like this one. I've deleted pods before to force them to re-create because I didn't want to wait multiple minutes for a restart. But, if I only had to wait 16s, or even 30s, I would probably just wait instead.

@thockin
Copy link
Member

thockin commented Dec 8, 2023

Also, it feels like this issue and #50375 are approaching the same problem from two different sides. I much prefer this formulation of the problem to "just me clear the counter". Do we need both issues? @SergeyKanzhelev

@haqa
Copy link

haqa commented Dec 9, 2023 via email

@NGTOne
Copy link

NGTOne commented Jan 18, 2024

I have another use case that I don't think has been mentioned here (and which I will admit is at the far end of the curve). I'm working on some automotive hardware using K3s as a basis. Because it's an automotive application, ungraceful shutdowns are essentially a given, e.g. when the engine is cut or ignition gets cranked. And, when things start back up again, it's a very frequent occurrence for the cluster to treat our Pods as being in CrashLoopBackOff (or for them to enter CrashLoopBackOff because they can't access some shared resource because it's still booting or has itself gone into startup-related CrashLoopBackOff). Having the ability to tune the backoff interval, or disable it entirely, would improve the apparent startup time of our device by, in some cases, several minutes.

@haqa
Copy link

haqa commented Jan 18, 2024

I feel that there are a lot of use cases here that are (or should be considered) valid for which the current concrete backoff policy is unhelpful. I'm not really a go dev. How do we go from "it needs a change" to "here's a change"?

@rofreytag
Copy link

rofreytag commented Feb 22, 2024

We have a akka-cluster that only correctly forms if ceil(n/2) pods are available during cluster formation (bootstrap). We configured a deployment with n nodes (which is always odd). The bootstrap can fail if a lot of pods are stuck in pending, or are having other boot issues (i.e. networking issues, other factors), which causes restarts of individual pods.
The restart gives the bootstrap for the whole cluster another chance, and in 99% of cases it works.

The 1% result in a unrecoverable error (manual intervention needed), because the crash-loop-backoff starts the pods at different times and eventually only one of pod is started at the same time. I was looking for a backoffLimit or crashLoopBackoffThreshold key in the pod/container spec which felt natural, given that I can configure backoffLimit in a Job.

Why is this not a feature?

@tallclair
Copy link
Member

/cc @lauralorenz

@AndrewWinterman
Copy link

we're currently killing pods to make them restart faster, which loses good info on why they crashlooped. Not awesome.

@haqa
Copy link

haqa commented Apr 3, 2024

I can help thinking that if a ticket has been active with people consistently asking for a feature for over 6 years then perhaps it might be considered "wanted" and possibly even "due"?

Or are we in a CrashLoopBackOff?

@haqa
Copy link

haqa commented Apr 3, 2024

It's been observed that the current behaviour is to prevent crashing software from adversely affecting a cluster. However several valid use cases have been presented where a container may exit for a valid reason.

It has been suggested that we should write and employ an operator. It's not awful idea, but it is fundamentally wrong in this case. Operators are there to do things K8S doesn't do, not to magically fix aberrant behaviour.

This behaviour is, at best, only appropriate for one, bounded (albeit large) use case, while completely ignoring and frustrating tens of equally, or even more (IMHO), valid use cases. A behaviour described in such terms is often referred to using the shorthand terminology "broken", that is: it is not suitable for the majority of observed use cases. That's not a missing behaviour, that's a bug.

As such a (rapid) bug fix would be the appropriate response.

@tallclair
Copy link
Member

Or are we in a CrashLoopBackOff?

😆

We're planning to publish a KEP soon. Hoping to get an alpha in v1.31

@thockin
Copy link
Member

thockin commented Apr 3, 2024

I can help thinking that if a ticket has been active with people consistently asking for a feature for over 6 years then perhaps it might be considered "wanted" and possibly even "due"?

I'd like to remind the audience at large that this is not an awesome way to communicate. The folks who work on this project have 100x more things that we would like to fix than can be reasonably fixed in a given unit of time. We're not ignoring issues because we want to punish you, or because we don't care. The fact that this has not been closed is a pretty straight-forward acknowledgement that it is "real".

I'm not going to play the usual "PRs welcome" response, because I know just how not-simple this issue is. Not everyone has the time to help solve a problem. That said, pounding the table and demanding attention, implying that we are ignoring it on purpose doesn't help. If all you want is to "me, too" it, that's what the emoji votes are for.

I spent 7 minutes writing this reply that I could have spent on a real bug :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests