Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Closed
keimoon opened this Issue Apr 11, 2018 · 50 comments

Comments

@keimoon
Copy link

keimoon commented Apr 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0

What you expected to happen:

.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job

How to reproduce it (as minimally and precisely as possible):

Use this resource file:

apiVersion: batch/v1
kind: Job
metadata:
  name: error
spec:
  backoffLimit: 1
  template:
    metadata:
      name: job
    spec:
      restartPolicy: Never
      containers:
        - name: job
          image: ubuntu:16.04
          args:
            - sh
            - -c
            - sleep 5; false

If the job was created in Kubernetes 1.9, it will soon fail:

....
status:
    conditions:
    - lastProbeTime: 2018-04-11T10:20:31Z
      lastTransitionTime: 2018-04-11T10:20:31Z
      message: Job has reach the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 2
    startTime: 2018-04-11T10:20:00Z
...

While creating the job in Kubernetes 1.10, it will be restarted infinitely:

...
status:
  active: 1
  failed: 8
  startTime: 2018-04-11T10:37:48Z
...

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.0
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux
  • Install tools: Kubeadm
  • Others:
@keimoon

This comment has been minimized.

Copy link
Author

keimoon commented Apr 11, 2018

/sig scheduling

@keimoon

This comment has been minimized.

Copy link
Author

keimoon commented Apr 17, 2018

/sig job
/sig lifecycle
/sig apps

@wgliang

This comment has been minimized.

Copy link
Member

wgliang commented Apr 18, 2018

/sig testing

@jangrewe

This comment has been minimized.

Copy link

jangrewe commented Apr 25, 2018

We just updated to 1.10 and seem to be affected by this, too.

@xray33

This comment has been minimized.

Copy link

xray33 commented Apr 25, 2018

I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround...

@CallMeFoxie

This comment has been minimized.

Copy link
Contributor

CallMeFoxie commented Apr 26, 2018

Can confirm with 1.10.1

@ceshihao

This comment has been minimized.

Copy link
Contributor

ceshihao commented Apr 27, 2018

It seems to be introduced by #60985

cc @soltysh

@foxish

This comment has been minimized.

Copy link
Member

foxish commented May 4, 2018

Cause appears to be that the syncJob method gets exactly one attempt to update the status with the failed condition and if it fails due to resource conflict, it never adds the condition. The controller code seems to use requeues to track failures, but if a sync loop executes successfully, it would lose track of the failed item ever having been requeued. Looking into a fix now.

@foxish

This comment has been minimized.

Copy link
Member

foxish commented May 4, 2018

Looking at job_controller.go#L249, we force an immediate resync when we see a pod that has failed which also clears the key in our queue, but that in turn makes us lose state of the number of requeues.

@drdivano

This comment has been minimized.

Copy link

drdivano commented May 7, 2018

This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don't yet use CronJobs in prod.

@nerumo

This comment has been minimized.

Copy link

nerumo commented May 7, 2018

killed my cluster too :( still existing in kubernetes 1.10.2

@hanikesn

This comment has been minimized.

Copy link

hanikesn commented May 7, 2018

This issue also can't be mitigated as terminated are not part of the resource quota: #51150

@nyxi

This comment has been minimized.

Copy link

nyxi commented May 9, 2018

This is a pretty terrible bug and it definitely still exists in v1.10.2

@dims

This comment has been minimized.

Copy link
Member

dims commented May 9, 2018

/priority important-soon

@dims

This comment has been minimized.

Copy link
Member

dims commented May 9, 2018

@soltysh can you please take a look?

@soltysh

This comment has been minimized.

Copy link
Contributor

soltysh commented May 10, 2018

Looking right now...

@soltysh

This comment has been minimized.

Copy link
Contributor

soltysh commented May 10, 2018

I've opened #63650 to address the issue. Sorry for the troubles y'all.

@CallMeFoxie

This comment has been minimized.

Copy link
Contributor

CallMeFoxie commented May 10, 2018

@soltysh pulled the patch into current 1.10.2 release and it works! Thanks :)

@XericZephyr

This comment has been minimized.

Copy link

XericZephyr commented Jun 18, 2018

Having the same issue. Waiting for update.

@cblecker

This comment has been minimized.

Copy link
Member

cblecker commented Jun 18, 2018

This fix has been cherry picked back to 1.10, and should show up in the next patch release (1.10.5).

@jangrewe

This comment has been minimized.

Copy link

jangrewe commented Jun 19, 2018

Any ETA for 1.10.5 yet? Even a ballpark date would be fine.

@MaciekPytel

This comment has been minimized.

Copy link
Contributor

MaciekPytel commented Jun 19, 2018

I'm planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.

@nickschuch

This comment has been minimized.

Copy link

nickschuch commented Jun 19, 2018

Thankyou to all involved! I really appreciate this being rolled back into 1.10.

@lbguilherme

This comment has been minimized.

Copy link

lbguilherme commented Jun 20, 2018

Given that the fix is already on 1.11 and on 1.10 branches, this issue should be closed.

Thanks for fixing this!

@soltysh

This comment has been minimized.

Copy link
Contributor

soltysh commented Jun 21, 2018

/close

Workloads automation moved this from Backlog to Done Jun 21, 2018

arithehun added a commit to freenome/k8s-jobs that referenced this issue Jul 3, 2018

Allow users of kbatch to specify and backoff limit
Note: retry limit > 0 is broken for kubernetes < 1.10.5:
kubernetes/kubernetes#62382. So currently you
either set it to 0 for no retries or > 0 for infinite retries

arithehun added a commit to freenome/k8s-jobs that referenced this issue Jul 3, 2018

Allow users of kbatch to specify and backoff limit (#12)
* Allow users of kbatch to specify and backoff limit

Note: retry limit > 0 is broken for kubernetes < 1.10.5:
kubernetes/kubernetes#62382. So currently you
either set it to 0 for no retries or > 0 for infinite retries

* add version

* fix lint

* cast to int

@BEllis BEllis referenced this issue Oct 2, 2018

Closed

backoffLimit bug in Kubernetes #3250

2 of 2 tasks complete
@walterdolce

This comment has been minimized.

Copy link

walterdolce commented Nov 14, 2018

@lbguilherme @soltysh I don't believe this is fixed?

Unless I'm doing something wrong..

NAME                                                          SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/sqlite-database-backup-generator                * * * * *   False     1        2m              17m
cronjob.batch/app-invoices-backup-generator   * * * * *   False     1        1m              17m

NAME                                                                 DESIRED   SUCCESSFUL   AGE
job.batch/sqlite-database-backup-generator-1542233280                1         0            14m
job.batch/sqlite-database-backup-generator-1542233460                1         0            11m
job.batch/sqlite-database-backup-generator-1542233580                1         0            9m
job.batch/sqlite-database-backup-generator-1542233760                1         0            7m
job.batch/sqlite-database-backup-generator-1542233880                1         0            4m
job.batch/sqlite-database-backup-generator-1542234060                1         0            1m
job.batch/app-invoices-backup-generator-1542233280   1         0            14m
job.batch/app-invoices-backup-generator-1542233460   1         0            11m
job.batch/app-invoices-backup-generator-1542233640   1         0            9m
job.batch/app-invoices-backup-generator-1542233760   1         0            6m
job.batch/app-invoices-backup-generator-1542233940   1         0            3m
job.batch/app-invoices-backup-generator-1542234120   1         0            1m

NAME                                                                 READY   STATUS    RESTARTS   AGE
pod/nfs-server-75ccf5786f-p4bjh                                      1/1     Running   0          23h
pod/sqlite-database-backup-generator-1542233280-4n8tr                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-mszhp                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-p24s7                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-qtkfl                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-tg4ht                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233460-2xrqj                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-7m2dp                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-s7hrf                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-vdnsm                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233580-4s7dj                0/1     Error     0          8m
pod/sqlite-database-backup-generator-1542233580-j7mm6                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-kqfhf                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-n2bvs                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-t4l94                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233760-6bdz2                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233760-6bntx                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-cbrcx                0/1     Error     0          5m
pod/sqlite-database-backup-generator-1542233760-kllnp                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-nk2t5                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233880-gxs24                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-mpfvl                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-vchgm                0/1     Error     0          3m
pod/sqlite-database-backup-generator-1542233880-w4g4c                0/1     Error     0          3m

Both resource specs have a backoffLimit: 4 and restartPolicy: Never.

This is in GKE, 1.11.2-gke.18

@zoobab

This comment has been minimized.

Copy link

zoobab commented Dec 18, 2018

I have the same bug on 1.9, with an openshift cluster:

$ oc version
oc v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://myserver
openshift v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657

Any plan on backporting it to 1.9 as well?

@ixdy

This comment has been minimized.

Copy link
Member

ixdy commented Dec 18, 2018

@zoobab you likely are encountering a different issue, as the root cause of this bug was introduced in kubernetes 1.10, not 1.9.

Also, kubernetes 1.9 is now outside of the support window, so no fixes will be backported to the upstream branch.

@JacoBlunt

This comment has been minimized.

Copy link

JacoBlunt commented Dec 25, 2018

I also find this bug in kubernetes 1.12.1
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
In pod failure condition,when i set parallelism 2, backofflimit 0,it will destory the last failure pod,only left one pod remain;And set parallelism 2, backofflimit 1,sometimes it keeps 3 pods,sometimes it kills the last failure pod remains 2

@aamirpinger

This comment has been minimized.

Copy link

aamirpinger commented Feb 4, 2019

STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?

@

kubectl version:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

minikube version: v0.30.0

Job YAML file

apiVersion: batch/v1
kind: Job
metadata:
name: whalesay1
spec:
template:
spec:
containers:
- name: whalesay
image: docker/whalesay
command: ["cowsa", "This is a Kubernetes Job!"]
restartPolicy: Never
backoffLimit: 2

Listing pods and job resource

NAME READY STATUS RESTARTS AGE
pod/whalesay1-6lt4b 0/1 ContainerCannotRun 0 59s
pod/whalesay1-86tlz 0/1 ContainerCannotRun 0 48s
pod/whalesay1-8v8nr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8w9k2 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8xxtr 0/1 ContainerCannotRun 0 53s
pod/whalesay1-dl4l8 0/1 ContainerCannotRun 0 1m
pod/whalesay1-fhlhg 0/1 ContainerCannotRun 0 37s
pod/whalesay1-fz8mv 0/1 ContainerCannotRun 0 26s
pod/whalesay1-g2mt6 0/1 ContainerCannotRun 0 9s
pod/whalesay1-hgj95 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hrsrd 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hzm42 0/1 ContainerCannotRun 0 1m
pod/whalesay1-mvctw 0/1 ContainerCannotRun 0 1m
pod/whalesay1-n7zq9 0/1 ContainerCannotRun 0 2m
pod/whalesay1-nl89l 0/1 ContainerCannotRun 0 31s
pod/whalesay1-nqcmg 0/1 ContainerCreating 0 2s
pod/whalesay1-p85rm 0/1 ContainerCannotRun 0 2m
pod/whalesay1-pjzmd 0/1 ContainerCannotRun 0 2m
pod/whalesay1-q2k7p 0/1 ContainerCannotRun 0 43s
pod/whalesay1-rhvz7 0/1 ContainerCannotRun 0 1m
pod/whalesay1-rzrtg 0/1 ContainerCannotRun 0 1m
pod/whalesay1-xp5tr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zn2s6 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zq4g9 0/1 ContainerCannotRun 0 14s
pod/whalesay1-ztxdv 0/1 ContainerCannotRun 0 19s

NAME DESIRED SUCCESSFUL AGE
job.batch/whalesay1 1 0 2m

Please guide

@CallMeFoxie

This comment has been minimized.

Copy link
Contributor

CallMeFoxie commented Feb 4, 2019

Just update to something newer than 1.10.0. That version is ages old and I'm sure it is fixed within the latest 1.10.x release.
(Which is also out of support so update to 1.12/1.13 ASAP.)

@shanit-saha

This comment has been minimized.

Copy link

shanit-saha commented Feb 6, 2019

In version "v1.10.11" the backoffLimit though set to 4 for a Job's pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have applied completions: 1 and parallelism: 1
Below is the version details of Kubernetes that we are using

[xuser@tyl20ks9px3as41 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.12", GitCommit:"c757b93cf034d49af3a3b8ecee3b9639a7a11df7", GitTreeState:"clean", BuildDate:"2018-12-19T11:04:29Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that's happening. The point is that the Job's-Pod should be attempted 4 times and not 5 times.
Please note that strangely we don't get to see this error in another environment where we have v1.10.05. providing the version details of that environment below

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.

Would appreciate if someone in this forum can advise on the root cause and a solution.
Do we have any flaw with version "v1.10.5" like security loophole.

@jstric01

This comment has been minimized.

Copy link

jstric01 commented Feb 6, 2019

1.10.11 is required to avoid a critical security bug.
https://www.zdnet.com/article/kubernetes-first-major-security-hole-discovered/
I imagine we will need to update to fix this, but we will need to find out which of the patched versions has the fix we need. We can then roll this through the environments as a staged update.

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Feb 6, 2019

As noted above 1.11+ are supported by the community until 1.14 is out:
https://kubernetes.io/docs/setup/version-skew-policy/#supported-versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.