-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make CrashLoopBackoff timing tuneable, or add mechanism to exempt some exits #57291
Comments
I guess a more direct way to achieve what I am looking for would be a But, the 5 minutes max backoff/10 minutes backoff reset for image pull and crash backoff seems far too high for development environments regardless. I'd like to tune those down significantly on my Minikube anyway. |
/sig node |
This isn't just an issue for dev, it's also important for some production workloads. We have some workloads that deliberately exit whenever they get any sort of error, including bad input data, expecting that they'll be restarted in a clean state so that they can continue. Bad input data is ~2% of input, each unit takes ~5 seconds, so those workloads seem to spend more time in CrashLoopBackoff than they do processing jobs. Especially since bad input data tends to be clustered. |
We have this issue with workers that are restarted automatically every 5 minutes to clear any bad database connections, etc. The process quits it itself automatically and then should just restart with no delay. Would be nice to be able to just disable this backoff. Edit: We ended up using the solution from here, which allows us to restart the script every 5 minute and not worry about CrashLoopBackOff. |
It would be ideal if the backoff could be disregarded in cases where the container exited with code 0. One could argue that containers exiting with such a code are not stuck in a "crash loop"; they have merely exited after successfully completing their work, and you want another one to start. This is tantamount to an infinite |
Having a similar issue, I have worker that I know is unstable, and yea, we are working to make it better, but i really want the deployment to just keep restarting it |
@mcfedr Such a design of K8S would probably strain the masters too much due to continual thrashing of container state. To solve your problem, how about something like this, where your Dockerfile runs a script that continually restarts the process you want to remain alive;
Where
In other words, upon python exiting, just start it again. |
Thats the system I was basically moving to at the moment, well, its actually where I was before moving to k8s, and assumed k8s would be better place to handle it, but that would also work. |
Here is my own "small team" scenario: As k8 does not have any kind of depencency, if some frontend pods are in CrashLoopBackoff because other pod isn't ready (eg buggy backend service), when the backend comes up again, the complete app will take 5 minutes more to be available. In this case It will be useful to kill the buggy backend, let the Deployment create new one (probably pulling a bug fixed image) and just wait 1 minute for frontend to reconnect. |
@eroldan I have been using init containers to make sure that backend services are running/updated before launching frontends. Of course this wont prevent issues if the backend is going down during work, but maybe you can change the frontend livelyness check to report healthy even when backend is not working. |
I have a use-case where init containers are running into a rate-limit enforced by an external system if they go into crash loop, which just ensures the loop continues. Would like to be able to adjust crash back off to prevent hitting that externally enforced rate-limit without needing to resort to something hacky like e.g. a sleep inside the init container. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
We have even more painful scenario and reason MaxContainerBackOff and backOffPeriod mast be tunable in per-node way. We have two ingress nodes running Calico and nginx-ingress pods. Besides, we have Keepalived on ingress nodes watching for specified pods to be in the Running state and moving VIP (external real IP) according to the condition. In case all ingress nodes goes down and stay unresponsive for some time and then one of them returns, Kubernetes will try to restart calico-node pod (the only available restartPolicy in Calico DS is "Always"), but pod will stay in CrashLoopBackOff state up to 5 minutes leaving the whole cluster unavailable from outside while Kubernetes simply doing nothing and waiting for timeout to expire. Instead, it is vital to push pods as hard as possible to go through internal cycle ASAP in that particular scenario. |
I would even prefer to set this "MaxContainerBackOff and backOffPeriod" of adjust "crashBackOfPeriod" per container.
but rather prefer to specify within your Dockerfile:
@Dag24 this straining of masters depends on the lifecycle and scale of your tasks and so it must be configurable both ways. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten I also would like to adjust the CrashLoopBackoff timings: to make them shorter. |
It might be useful to have this configurable similar to startup/liveness/readiness probes. In my case, I have an application consisting of a few different For instance
In my case, a gracePeriod could also be useful. I think it's pretty common to let Kubernetes be "eventually consistent" but the backoff logic can lead to some less-than-desirable feedback loops. This really slows down acceptance test automation (using |
Thought I'd weigh in here years after being on team "let's get this merged" and seeing new people adding the same thing to the discussion that I originally did... When deploying a workload that is supposed to be able to exit and restart, just accept that it's an application-level concern and use an init process, even if it's a simple bash script. It's conceptually cleaner than treating it as an infrastructure-level concern. With the exception of Jobs, k8s workloads are assumed to run forever, and when they stop running, they're assumed to have crashed. This is OK. |
I'm glad this works for your workload and usage model. If Kubernetes were only for that type of workload that would be great, or even if it were only for workloads WE wrote, that would be great. But Kubernetes is for ALL workloads, including unmodifiable images, for coding or licensing reasons. We shouldn't decide how Kubernetes should work based on "it works on my machine"! This feature should be switchable or configurable for situations where we can't change the code to work without exiting. |
This is a false premise, and I'll explain why.
I'll go a step further and challenge your claim that you cannot legally build your image from another, potentially proprietary image, but I'd like you to describe your particular situation if that is wrong. The bottom line is there is no scenario where you're "allowed" to configure a yet-to-be implemented crashloopbackoff setting but you're "not allowed" to control the process by which your app is executed, excluding literally a person or group of people you work with that exert arbitrary constraints not founded in technical or legal underpinnings. |
Not all software is open source, or open license, or, in fact, open in any way. Much software is licensed very restrictively. Not all software is packaged with the tools you suggest, and not all software may legally be "from"'d. Are we saying, publicly here, that these applications are not able to be run reliably under Kubernetes? That kubernetes is not for these types of application licensing? That would be a bold statement indeed. |
Can you provide a concrete example of where a docker image cannot be FROM'd? I don't believe that's true. Even still you could use volumes to inject a self-contained init process, then use No commercial license for a container image would be intentionally written such that it prevents running on kubernetes, unless it's intentionally blocking kubernetes usage in which case it's a moot point. |
@dkrieger Are you defending the current state of the software? Maybe this is ok as an untuneable parameter for legitimate crash cases, but for situations where a container is exiting with code 0 it seems clearly incorrect, or at best highly unintuitive, to apply this "crash backoff" policy. |
Manipulating the job to not exit in some cases doesn't make any sense and can be a security vulnerability. E.g. we've set CI runner to run as ever restarting deployment. The restarting with clean "storage" is a key security feature to prevent data leaking between jobs or block an attack from being persisted. The pod restart itself has lots of useful implications for security and reproducibility. Preventing people from taking advantage of them doesn't make sense. |
Your use of the term "job" here is appropriate. Shouldn't this be modeled as a Kubernetes Job, which already does distinguish between exit codes? If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that's allowed to schedule Kubernetes Jobs. This feels relevant: https://xkcd.com/1172/ . If the issue is performance, I don't think adding configuration parameters to better support what amounts to a hack (using deployments for ephemeral workloads, which threatens the stability of the cluster due to thrashing) is the right way to extend the APIs of a massively adopted post-beta platform, and I'm not convinced there's no way to facilitate performant, secure CI runners with the existing APIs. Let me add that I have absolutely no say over whether this feature is added, I'm just voicing a perspective that is likely shared by those who do, and I think commenters lobbying for it need to provide some more compelling arguments than have been seen for the past ~6 years to change the status quo. |
If you need a CI job scheduler workload, that could be modeled as a deployment with a service account that's allowed to schedule Kubernetes Jobs. I presume you are kubernetes developer used to building apps interacting in complex ways with kubernetes api's, hooks etc. I'm not, I'm a kubernetes user looking for an orchestration platform that solves complex infrastructure needs so I don't need to code that logic myself. To me, creating some custom piece of complex code to schedule jobs, monitor their completion to start new ones as soon as the last one completes is just not worth the complexity when the use case can be 99% covered with existing kubernetes features. I don't remember exactly why I arrived at deployment a few years ago when setting this up but looking at Job docs now I assume it was to enable constant number of runners and immediate restart on success. Achieving that with Job doesn't seem possible (e.g. from job docs: "Only a RestartPolicy equal to Never or OnFailure is allowed"), without such custom scheduling as I think you're suggesting. Deployments make constant rescheduling via replica count trivial, except for the issue of when jobs run "too fast" for kubernetes default liking. Basically Deployment is much closer to match the use case than Job. Unless there would be som other kind of Job scheduling allowing "restarting forever". This not about laziness or inability to build a custom Job scheduler. But I don't believe I should need to build a custom job scheduler to support this, based on the popularity of this thread, it's reasonably common scenario. |
As somebody who designed and developed an init especially for Docker images with the idea that in contrast with traditional init systems, Docker-specific init should in fact not restart processes itself but should propagate failures to the whole container to enable container supervisor to do proper corrective action (and logging it and collect stats). From our experience we had issues with an init inside containers doing too much hiding errors from the supervisor. So I am a bit surprised that a supervisor like Kubernetes does not have more flexible configuration here about what and how should corrective actions occurs. Respecting exit codes seems to be something very obvious, and being able to control the backoff as well. |
If you scroll up, multiple ready-made solutions have been linked that will delete pods in crashloopbackoff state. The objection that logs won't be preserved that followed is not really compelling; if you're not aggregating logs in your cluster, that's the problem to solve. There are already multiple low effort paths for you to accomplish what you want to accomplish. If this were a rampant high impact problem, there would be popular CRDs and operators for dealing with this, because k8s was thoughtfully designed for extensibility. There probably are some in the wild. For every use case described in here, there are plenty of other people with the same use case who have found suitable means to accomplish their ends with the current core k8s APIs. They've probably never even seen this gh issue. |
This is currently the 5th most upvoted open issue in this repository, so I think your assertion here is false. |
246 is nothing compared to k8s's user base. 6 years and no RFC (unless I missed it). People are voting with their feet that this isn't necessary. If it were there would be popular operators. Plenty of things that are essential to running a k8s cluster in production are not even part of the core APIs. Having this feature seems like all upside and no downside from a user perspective, but it almost certainly would threaten the reliability of the cluster itself, and keeping it out of out-of-the-box APIs doesn't prevent people from consciously taking that risk. Unlike a dedicated process supervisor, k8s is responsible for the complete distributed system. |
246 is nothing compared to k8s's user base. Sure in absolute terms but most people won't act even to upvote. The fact that it's top 5 says a lot more than the absolute number. |
Lot's of strong feelings on this one. I just wanted to chime in on what I think might be palatable. First, let's acknowledge that real users are really struggling with this. The current design was intended to balance the needs of the few (crashy apps) with the needs of the many (everyone else). Crashing and restarting in a really fast loop usually indicates that something is wrong. But clearly not ALWAYS. There are a lot of good ideas in this thread. Some that stand out to me, with some commentary.
NOTE: These are not all mutually exclusive! It's important to repeat - this is designed to protect the system from badly behaving apps. We don't want to remove that safety. If we add an explicit API to bypass it, we can at least let cluster admins install policies that say "oh no you don't". But APIs have costs that we pay forever. So my strong preference is to do something smart without any API surface, and only if we can't make that work to add API. Some combination of options 1, 2, and 4 seem like we could put a real dent in the problem, and then see what's left. |
I really like this one. I've deleted pods before to force them to re-create because I didn't want to wait multiple minutes for a restart. But, if I only had to wait 16s, or even 30s, I would probably just wait instead. |
Also, it feels like this issue and #50375 are approaching the same problem from two different sides. I much prefer this formulation of the problem to "just me clear the counter". Do we need both issues? @SergeyKanzhelev |
I prefer the "make it configurable" option - allow the user to specify
either a max backoff time, or a "non-crash exit code".
Any single fixed solution is going to work for a few and not work for the
many.
A...
…On Fri, 8 Dec 2023, 23:56 Tim Hockin, ***@***.***> wrote:
Also, it feels like this issue and #50375
<#50375> are approaching
the same problem from two different sides. I much prefer this formulation
of the problem to "just me clear the counter". Do we need both issues?
@SergeyKanzhelev <https://github.com/SergeyKanzhelev>
—
Reply to this email directly, view it on GitHub
<#57291 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQAQZ5JD5OYU7XI5PJFYS3YIOSMDAVCNFSM4EITQID2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBUG44TQNJQGMYQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I have another use case that I don't think has been mentioned here (and which I will admit is at the far end of the curve). I'm working on some automotive hardware using K3s as a basis. Because it's an automotive application, ungraceful shutdowns are essentially a given, e.g. when the engine is cut or ignition gets cranked. And, when things start back up again, it's a very frequent occurrence for the cluster to treat our |
I feel that there are a lot of use cases here that are (or should be considered) valid for which the current concrete backoff policy is unhelpful. I'm not really a go dev. How do we go from "it needs a change" to "here's a change"? |
We have a akka-cluster that only correctly forms if The 1% result in a unrecoverable error (manual intervention needed), because the crash-loop-backoff starts the pods at different times and eventually only one of pod is started at the same time. I was looking for a Why is this not a feature? |
/cc @lauralorenz |
we're currently killing pods to make them restart faster, which loses good info on why they crashlooped. Not awesome. |
I can help thinking that if a ticket has been active with people consistently asking for a feature for over 6 years then perhaps it might be considered "wanted" and possibly even "due"? Or are we in a CrashLoopBackOff? |
It's been observed that the current behaviour is to prevent crashing software from adversely affecting a cluster. However several valid use cases have been presented where a container may exit for a valid reason. It has been suggested that we should write and employ an operator. It's not awful idea, but it is fundamentally wrong in this case. Operators are there to do things K8S doesn't do, not to magically fix aberrant behaviour. This behaviour is, at best, only appropriate for one, bounded (albeit large) use case, while completely ignoring and frustrating tens of equally, or even more (IMHO), valid use cases. A behaviour described in such terms is often referred to using the shorthand terminology "broken", that is: it is not suitable for the majority of observed use cases. That's not a missing behaviour, that's a bug. As such a (rapid) bug fix would be the appropriate response. |
😆 We're planning to publish a KEP soon. Hoping to get an alpha in v1.31 |
I'd like to remind the audience at large that this is not an awesome way to communicate. The folks who work on this project have 100x more things that we would like to fix than can be reasonably fixed in a given unit of time. We're not ignoring issues because we want to punish you, or because we don't care. The fact that this has not been closed is a pretty straight-forward acknowledgement that it is "real". I'm not going to play the usual "PRs welcome" response, because I know just how not-simple this issue is. Not everyone has the time to help solve a problem. That said, pounding the table and demanding attention, implying that we are ignoring it on purpose doesn't help. If all you want is to "me, too" it, that's what the emoji votes are for. I spent 7 minutes writing this reply that I could have spent on a real bug :) |
Is this a BUG REPORT or FEATURE REQUEST?: Feature request
/kind feature
What happened:
As part of a development workflow, I intentionally killed a container in a pod with
restartPolicy: Always
. The plan was to do this repeatedly, as a quick way to restart the container and clear old state (and, in Minikube, to load image changes).The container went into a crash-loop backoff, making this anything but a quick option.
What you expected to happen:
I expected there so be some configuration allowing me to disable, or at least tune the timing of, the CrashLoopBackoff.
How to reproduce it (as minimally and precisely as possible):
Create a pod with
restartPolicy: Always
, and intentionally exit a container repeatedly.Anything else we need to know?:
I see that the backoff timing parameters are hard-coded constants here:
kubernetes/pkg/kubelet/kubelet.go
Line 121 in 5f92042
kubernetes/pkg/kubelet/kubelet.go
Line 155 in 5f92042
One might reasonably expect these to be configurable at least at the kubelet level - say, by a setting like these. That would be sufficient for my use-case (local development with fast restarts), and presumably useful as an advanced configuration setting for production workloads.
A more aggressive change would allow tuning per-pod.
There are other options for my target workflow:
Environment:
kubectl version
): v1.8.0The text was updated successfully, but these errors were encountered: