Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Closed
keimoon opened this issue Apr 11, 2018 · 54 comments
Closed

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

keimoon opened this issue Apr 11, 2018 · 54 comments

Comments

@keimoon
Copy link

@keimoon keimoon commented Apr 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0

What you expected to happen:

.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job

How to reproduce it (as minimally and precisely as possible):

Use this resource file:

apiVersion: batch/v1
kind: Job
metadata:
  name: error
spec:
  backoffLimit: 1
  template:
    metadata:
      name: job
    spec:
      restartPolicy: Never
      containers:
        - name: job
          image: ubuntu:16.04
          args:
            - sh
            - -c
            - sleep 5; false

If the job was created in Kubernetes 1.9, it will soon fail:

....
status:
    conditions:
    - lastProbeTime: 2018-04-11T10:20:31Z
      lastTransitionTime: 2018-04-11T10:20:31Z
      message: Job has reach the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 2
    startTime: 2018-04-11T10:20:00Z
...

While creating the job in Kubernetes 1.10, it will be restarted infinitely:

...
status:
  active: 1
  failed: 8
  startTime: 2018-04-11T10:37:48Z
...

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.0
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux
  • Install tools: Kubeadm
  • Others:
@keimoon
Copy link
Author

@keimoon keimoon commented Apr 11, 2018

/sig scheduling

@keimoon
Copy link
Author

@keimoon keimoon commented Apr 17, 2018

/sig job
/sig lifecycle
/sig apps

@wgliang
Copy link
Member

@wgliang wgliang commented Apr 18, 2018

/sig testing

@jangrewe
Copy link

@jangrewe jangrewe commented Apr 25, 2018

We just updated to 1.10 and seem to be affected by this, too.

@xray33
Copy link

@xray33 xray33 commented Apr 25, 2018

I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround...

@CallMeFoxie
Copy link
Contributor

@CallMeFoxie CallMeFoxie commented Apr 26, 2018

Can confirm with 1.10.1

@ceshihao
Copy link
Contributor

@ceshihao ceshihao commented Apr 27, 2018

It seems to be introduced by #60985

cc @soltysh

@foxish
Copy link
Member

@foxish foxish commented May 4, 2018

Cause appears to be that the syncJob method gets exactly one attempt to update the status with the failed condition and if it fails due to resource conflict, it never adds the condition. The controller code seems to use requeues to track failures, but if a sync loop executes successfully, it would lose track of the failed item ever having been requeued. Looking into a fix now.

@foxish
Copy link
Member

@foxish foxish commented May 4, 2018

Looking at job_controller.go#L249, we force an immediate resync when we see a pod that has failed which also clears the key in our queue, but that in turn makes us lose state of the number of requeues.

@drdivano
Copy link

@drdivano drdivano commented May 7, 2018

This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don't yet use CronJobs in prod.

@nerumo
Copy link

@nerumo nerumo commented May 7, 2018

killed my cluster too :( still existing in kubernetes 1.10.2

@hanikesn
Copy link

@hanikesn hanikesn commented May 7, 2018

This issue also can't be mitigated as terminated are not part of the resource quota: #51150

@nyxi
Copy link

@nyxi nyxi commented May 9, 2018

This is a pretty terrible bug and it definitely still exists in v1.10.2

@dims
Copy link
Member

@dims dims commented May 9, 2018

/priority important-soon

@dims
Copy link
Member

@dims dims commented May 9, 2018

@soltysh can you please take a look?

@soltysh
Copy link
Contributor

@soltysh soltysh commented May 10, 2018

Looking right now...

@soltysh
Copy link
Contributor

@soltysh soltysh commented May 10, 2018

I've opened #63650 to address the issue. Sorry for the troubles y'all.

@CallMeFoxie
Copy link
Contributor

@CallMeFoxie CallMeFoxie commented May 10, 2018

@soltysh pulled the patch into current 1.10.2 release and it works! Thanks :)

@jangrewe
Copy link

@jangrewe jangrewe commented Jun 19, 2018

Any ETA for 1.10.5 yet? Even a ballpark date would be fine.

@MaciekPytel
Copy link
Contributor

@MaciekPytel MaciekPytel commented Jun 19, 2018

I'm planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.

@nickschuch
Copy link

@nickschuch nickschuch commented Jun 19, 2018

Thankyou to all involved! I really appreciate this being rolled back into 1.10.

@lbguilherme
Copy link

@lbguilherme lbguilherme commented Jun 20, 2018

Given that the fix is already on 1.11 and on 1.10 branches, this issue should be closed.

Thanks for fixing this!

@soltysh
Copy link
Contributor

@soltysh soltysh commented Jun 21, 2018

/close

@walterdolce
Copy link

@walterdolce walterdolce commented Nov 14, 2018

@lbguilherme @soltysh I don't believe this is fixed?

Unless I'm doing something wrong..

NAME                                                          SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/sqlite-database-backup-generator                * * * * *   False     1        2m              17m
cronjob.batch/app-invoices-backup-generator   * * * * *   False     1        1m              17m

NAME                                                                 DESIRED   SUCCESSFUL   AGE
job.batch/sqlite-database-backup-generator-1542233280                1         0            14m
job.batch/sqlite-database-backup-generator-1542233460                1         0            11m
job.batch/sqlite-database-backup-generator-1542233580                1         0            9m
job.batch/sqlite-database-backup-generator-1542233760                1         0            7m
job.batch/sqlite-database-backup-generator-1542233880                1         0            4m
job.batch/sqlite-database-backup-generator-1542234060                1         0            1m
job.batch/app-invoices-backup-generator-1542233280   1         0            14m
job.batch/app-invoices-backup-generator-1542233460   1         0            11m
job.batch/app-invoices-backup-generator-1542233640   1         0            9m
job.batch/app-invoices-backup-generator-1542233760   1         0            6m
job.batch/app-invoices-backup-generator-1542233940   1         0            3m
job.batch/app-invoices-backup-generator-1542234120   1         0            1m

NAME                                                                 READY   STATUS    RESTARTS   AGE
pod/nfs-server-75ccf5786f-p4bjh                                      1/1     Running   0          23h
pod/sqlite-database-backup-generator-1542233280-4n8tr                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-mszhp                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-p24s7                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-qtkfl                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-tg4ht                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233460-2xrqj                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-7m2dp                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-s7hrf                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-vdnsm                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233580-4s7dj                0/1     Error     0          8m
pod/sqlite-database-backup-generator-1542233580-j7mm6                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-kqfhf                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-n2bvs                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-t4l94                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233760-6bdz2                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233760-6bntx                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-cbrcx                0/1     Error     0          5m
pod/sqlite-database-backup-generator-1542233760-kllnp                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-nk2t5                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233880-gxs24                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-mpfvl                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-vchgm                0/1     Error     0          3m
pod/sqlite-database-backup-generator-1542233880-w4g4c                0/1     Error     0          3m

Both resource specs have a backoffLimit: 4 and restartPolicy: Never.

This is in GKE, 1.11.2-gke.18

@zoobab
Copy link

@zoobab zoobab commented Dec 18, 2018

I have the same bug on 1.9, with an openshift cluster:

$ oc version
oc v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://myserver
openshift v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657

Any plan on backporting it to 1.9 as well?

@ixdy
Copy link
Member

@ixdy ixdy commented Dec 18, 2018

@zoobab you likely are encountering a different issue, as the root cause of this bug was introduced in kubernetes 1.10, not 1.9.

Also, kubernetes 1.9 is now outside of the support window, so no fixes will be backported to the upstream branch.

@JacoBlunt
Copy link

@JacoBlunt JacoBlunt commented Dec 25, 2018

I also find this bug in kubernetes 1.12.1
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
In pod failure condition,when i set parallelism 2, backofflimit 0,it will destory the last failure pod,only left one pod remain;And set parallelism 2, backofflimit 1,sometimes it keeps 3 pods,sometimes it kills the last failure pod remains 2

@aamirpinger
Copy link

@aamirpinger aamirpinger commented Feb 4, 2019

STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?

@

kubectl version:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

minikube version: v0.30.0

Job YAML file

apiVersion: batch/v1
kind: Job
metadata:
name: whalesay1
spec:
template:
spec:
containers:
- name: whalesay
image: docker/whalesay
command: ["cowsa", "This is a Kubernetes Job!"]
restartPolicy: Never
backoffLimit: 2

Listing pods and job resource

NAME READY STATUS RESTARTS AGE
pod/whalesay1-6lt4b 0/1 ContainerCannotRun 0 59s
pod/whalesay1-86tlz 0/1 ContainerCannotRun 0 48s
pod/whalesay1-8v8nr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8w9k2 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8xxtr 0/1 ContainerCannotRun 0 53s
pod/whalesay1-dl4l8 0/1 ContainerCannotRun 0 1m
pod/whalesay1-fhlhg 0/1 ContainerCannotRun 0 37s
pod/whalesay1-fz8mv 0/1 ContainerCannotRun 0 26s
pod/whalesay1-g2mt6 0/1 ContainerCannotRun 0 9s
pod/whalesay1-hgj95 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hrsrd 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hzm42 0/1 ContainerCannotRun 0 1m
pod/whalesay1-mvctw 0/1 ContainerCannotRun 0 1m
pod/whalesay1-n7zq9 0/1 ContainerCannotRun 0 2m
pod/whalesay1-nl89l 0/1 ContainerCannotRun 0 31s
pod/whalesay1-nqcmg 0/1 ContainerCreating 0 2s
pod/whalesay1-p85rm 0/1 ContainerCannotRun 0 2m
pod/whalesay1-pjzmd 0/1 ContainerCannotRun 0 2m
pod/whalesay1-q2k7p 0/1 ContainerCannotRun 0 43s
pod/whalesay1-rhvz7 0/1 ContainerCannotRun 0 1m
pod/whalesay1-rzrtg 0/1 ContainerCannotRun 0 1m
pod/whalesay1-xp5tr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zn2s6 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zq4g9 0/1 ContainerCannotRun 0 14s
pod/whalesay1-ztxdv 0/1 ContainerCannotRun 0 19s

NAME DESIRED SUCCESSFUL AGE
job.batch/whalesay1 1 0 2m

Please guide

@CallMeFoxie
Copy link
Contributor

@CallMeFoxie CallMeFoxie commented Feb 4, 2019

Just update to something newer than 1.10.0. That version is ages old and I'm sure it is fixed within the latest 1.10.x release.
(Which is also out of support so update to 1.12/1.13 ASAP.)

@shanit-saha
Copy link

@shanit-saha shanit-saha commented Feb 6, 2019

In version "v1.10.11" the backoffLimit though set to 4 for a Job's pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have applied completions: 1 and parallelism: 1
Below is the version details of Kubernetes that we are using

[xuser@tyl20ks9px3as41 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.12", GitCommit:"c757b93cf034d49af3a3b8ecee3b9639a7a11df7", GitTreeState:"clean", BuildDate:"2018-12-19T11:04:29Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that's happening. The point is that the Job's-Pod should be attempted 4 times and not 5 times.
Please note that strangely we don't get to see this error in another environment where we have v1.10.05. providing the version details of that environment below

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.

Would appreciate if someone in this forum can advise on the root cause and a solution.
Do we have any flaw with version "v1.10.5" like security loophole.

@jstric01
Copy link

@jstric01 jstric01 commented Feb 6, 2019

1.10.11 is required to avoid a critical security bug.
https://www.zdnet.com/article/kubernetes-first-major-security-hole-discovered/
I imagine we will need to update to fix this, but we will need to find out which of the patched versions has the fix we need. We can then roll this through the environments as a staged update.

@BenTheElder
Copy link
Member

@BenTheElder BenTheElder commented Feb 6, 2019

As noted above 1.11+ are supported by the community until 1.14 is out:
https://kubernetes.io/docs/setup/version-skew-policy/#supported-versions

@stojan-jovic
Copy link

@stojan-jovic stojan-jovic commented May 20, 2019

@shanit-saha I experienced the same behavior with v1.10.7. Can somebody confirm that this not happening with >= 1.11.x versions?

@zakthan
Copy link

@zakthan zakthan commented Jul 30, 2019

We are experiencing the same issue with version 1.11.9-docker-1.
If it makes any sense we have 2 enviroments. One is Ubuntu 18.04.2 LTS and the other is CentOS Linux release 7.5.1804.
It happens only at the Ubuntu enviroment.

@stojan-jovic
Copy link

@stojan-jovic stojan-jovic commented Jul 30, 2019

In meantime we updated cluster:
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.7", GitCommit:"65ecaf0671341311ce6aea0edab46ee69f65d59e", GitTreeState:"clean", BuildDate: "2019-01-24T19:32:00Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate: "2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

We still experiencing the issue.

@cblecker
Copy link
Member

@cblecker cblecker commented Jul 30, 2019

Hi everyone! 👋

This issue is for a specific problem in Kubernetes 1.10. Right now, the upstream Kubernetes project supports the following three versions: 1.13, 1.14, 1.15. If you are experiencing a similar issue under any of those three versions, then please open a new issue and provide all the details requested.

As the specific bug in this issue is resolved and on an unsupported version, I'm going to lock this closed issue. Thanks!

@kubernetes kubernetes locked as resolved and limited conversation to collaborators Jul 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Workloads
  
Done
Linked pull requests

Successfully merging a pull request may close this issue.