-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382
Comments
/sig scheduling |
/sig job |
/sig testing |
We just updated to 1.10 and seem to be affected by this, too. |
I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround... |
Can confirm with 1.10.1 |
Cause appears to be that the |
Looking at job_controller.go#L249, we force an immediate resync when we see a pod that has failed which also clears the key in our queue, but that in turn makes us lose state of the number of requeues. |
This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don't yet use CronJobs in prod. |
killed my cluster too :( still existing in kubernetes 1.10.2 |
This issue also can't be mitigated as terminated are not part of the resource quota: #51150 |
This is a pretty terrible bug and it definitely still exists in v1.10.2 |
/priority important-soon |
@soltysh can you please take a look? |
Looking right now... |
I've opened #63650 to address the issue. Sorry for the troubles y'all. |
@soltysh pulled the patch into current 1.10.2 release and it works! Thanks :) |
Any ETA for 1.10.5 yet? Even a ballpark date would be fine. |
I'm planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow. |
Thankyou to all involved! I really appreciate this being rolled back into 1.10. |
Given that the fix is already on 1.11 and on 1.10 branches, this issue should be closed. Thanks for fixing this! |
/close |
@lbguilherme @soltysh I don't believe this is fixed? Unless I'm doing something wrong..
Both resource specs have a This is in GKE, |
I have the same bug on 1.9, with an openshift cluster:
Any plan on backporting it to 1.9 as well? |
@zoobab you likely are encountering a different issue, as the root cause of this bug was introduced in kubernetes 1.10, not 1.9. Also, kubernetes 1.9 is now outside of the support window, so no fixes will be backported to the upstream branch. |
I also find this bug in kubernetes 1.12.1 |
STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?@ kubectl version: minikube version: v0.30.0 Job YAML fileapiVersion: batch/v1 Listing pods and job resourceNAME READY STATUS RESTARTS AGE NAME DESIRED SUCCESSFUL AGE Please guide |
Just update to something newer than 1.10.0. That version is ages old and I'm sure it is fixed within the latest 1.10.x release. |
In version "v1.10.11" the
Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that's happening. The point is that the Job's-Pod should be attempted 4 times and not 5 times.
There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem. Would appreciate if someone in this forum can advise on the root cause and a solution. |
1.10.11 is required to avoid a critical security bug. |
As noted above 1.11+ are supported by the community until 1.14 is out: |
@shanit-saha I experienced the same behavior with |
We are experiencing the same issue with version 1.11.9-docker-1. |
In meantime we updated cluster: We still experiencing the issue. |
Hi everyone! 👋 This issue is for a specific problem in Kubernetes 1.10. Right now, the upstream Kubernetes project supports the following three versions: 1.13, 1.14, 1.15. If you are experiencing a similar issue under any of those three versions, then please open a new issue and provide all the details requested. As the specific bug in this issue is resolved and on an unsupported version, I'm going to lock this closed issue. Thanks! |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
.spec.backoffLimit
for a Job does not work on Kubernetes 1.10.0What you expected to happen:
.spec.backoffLimit
can limit the number of time a pod is restarted when running inside a jobHow to reproduce it (as minimally and precisely as possible):
Use this resource file:
If the job was created in Kubernetes 1.9, it will soon fail:
While creating the job in Kubernetes 1.10, it will be restarted infinitely:
Anything else we need to know?:
Environment:
kubectl version
): 1.10.0uname -a
):Linux
The text was updated successfully, but these errors were encountered: