Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

keimoon · 2018-04-11T10:44:26Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0

What you expected to happen:

.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job

How to reproduce it (as minimally and precisely as possible):

Use this resource file:

apiVersion: batch/v1
kind: Job
metadata:
  name: error
spec:
  backoffLimit: 1
  template:
    metadata:
      name: job
    spec:
      restartPolicy: Never
      containers:
        - name: job
          image: ubuntu:16.04
          args:
            - sh
            - -c
            - sleep 5; false

If the job was created in Kubernetes 1.9, it will soon fail:

....
status:
    conditions:
    - lastProbeTime: 2018-04-11T10:20:31Z
      lastTransitionTime: 2018-04-11T10:20:31Z
      message: Job has reach the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 2
    startTime: 2018-04-11T10:20:00Z
...

While creating the job in Kubernetes 1.10, it will be restarted infinitely:

...
status:
  active: 1
  failed: 8
  startTime: 2018-04-11T10:37:48Z
...

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.10.0
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): Linux
Install tools: Kubeadm
Others:

The text was updated successfully, but these errors were encountered:

keimoon · 2018-04-11T10:51:21Z

/sig scheduling

keimoon · 2018-04-17T02:43:15Z

/sig job
/sig lifecycle
/sig apps

wgliang · 2018-04-18T10:11:32Z

/sig testing

jangrewe · 2018-04-25T14:49:48Z

We just updated to 1.10 and seem to be affected by this, too.

xray33 · 2018-04-25T15:23:46Z

I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround...

CallMeFoxie · 2018-04-26T06:50:50Z

Can confirm with 1.10.1

ceshihao · 2018-04-27T14:13:32Z

It seems to be introduced by #60985

cc @soltysh

foxish · 2018-05-04T17:40:02Z

Cause appears to be that the syncJob method gets exactly one attempt to update the status with the failed condition and if it fails due to resource conflict, it never adds the condition. The controller code seems to use requeues to track failures, but if a sync loop executes successfully, it would lose track of the failed item ever having been requeued. Looking into a fix now.

foxish · 2018-05-04T18:02:02Z

Looking at job_controller.go#L249, we force an immediate resync when we see a pod that has failed which also clears the key in our queue, but that in turn makes us lose state of the number of requeues.

drdivano · 2018-05-07T05:42:50Z

This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don't yet use CronJobs in prod.

nerumo · 2018-05-07T08:58:48Z

killed my cluster too :( still existing in kubernetes 1.10.2

hanikesn · 2018-05-07T09:50:28Z

This issue also can't be mitigated as terminated are not part of the resource quota: #51150

nyxi · 2018-05-09T11:11:03Z

This is a pretty terrible bug and it definitely still exists in v1.10.2

dims · 2018-05-09T11:21:09Z

/priority important-soon

dims · 2018-05-09T11:22:50Z

@soltysh can you please take a look?

soltysh · 2018-05-10T07:48:25Z

Looking right now...

soltysh · 2018-05-10T09:31:39Z

I've opened #63650 to address the issue. Sorry for the troubles y'all.

CallMeFoxie · 2018-05-10T10:11:33Z

@soltysh pulled the patch into current 1.10.2 release and it works! Thanks :)

jangrewe · 2018-06-19T08:50:41Z

Any ETA for 1.10.5 yet? Even a ballpark date would be fine.

MaciekPytel · 2018-06-19T11:13:09Z

I'm planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.

nickschuch · 2018-06-19T23:36:10Z

Thankyou to all involved! I really appreciate this being rolled back into 1.10.

lbguilherme · 2018-06-20T12:11:06Z

Given that the fix is already on 1.11 and on 1.10 branches, this issue should be closed.

Thanks for fixing this!

soltysh · 2018-06-21T10:50:41Z

/close

walterdolce · 2018-11-14T22:26:07Z

@lbguilherme @soltysh I don't believe this is fixed?

Unless I'm doing something wrong..

NAME                                                          SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/sqlite-database-backup-generator                * * * * *   False     1        2m              17m
cronjob.batch/app-invoices-backup-generator   * * * * *   False     1        1m              17m

NAME                                                                 DESIRED   SUCCESSFUL   AGE
job.batch/sqlite-database-backup-generator-1542233280                1         0            14m
job.batch/sqlite-database-backup-generator-1542233460                1         0            11m
job.batch/sqlite-database-backup-generator-1542233580                1         0            9m
job.batch/sqlite-database-backup-generator-1542233760                1         0            7m
job.batch/sqlite-database-backup-generator-1542233880                1         0            4m
job.batch/sqlite-database-backup-generator-1542234060                1         0            1m
job.batch/app-invoices-backup-generator-1542233280   1         0            14m
job.batch/app-invoices-backup-generator-1542233460   1         0            11m
job.batch/app-invoices-backup-generator-1542233640   1         0            9m
job.batch/app-invoices-backup-generator-1542233760   1         0            6m
job.batch/app-invoices-backup-generator-1542233940   1         0            3m
job.batch/app-invoices-backup-generator-1542234120   1         0            1m

NAME                                                                 READY   STATUS    RESTARTS   AGE
pod/nfs-server-75ccf5786f-p4bjh                                      1/1     Running   0          23h
pod/sqlite-database-backup-generator-1542233280-4n8tr                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-mszhp                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-p24s7                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-qtkfl                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-tg4ht                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233460-2xrqj                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-7m2dp                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-s7hrf                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-vdnsm                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233580-4s7dj                0/1     Error     0          8m
pod/sqlite-database-backup-generator-1542233580-j7mm6                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-kqfhf                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-n2bvs                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-t4l94                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233760-6bdz2                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233760-6bntx                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-cbrcx                0/1     Error     0          5m
pod/sqlite-database-backup-generator-1542233760-kllnp                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-nk2t5                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233880-gxs24                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-mpfvl                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-vchgm                0/1     Error     0          3m
pod/sqlite-database-backup-generator-1542233880-w4g4c                0/1     Error     0          3m

Both resource specs have a backoffLimit: 4 and restartPolicy: Never.

This is in GKE, 1.11.2-gke.18

zoobab · 2018-12-18T09:27:05Z

I have the same bug on 1.9, with an openshift cluster:

$ oc version
oc v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://myserver
openshift v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657

Any plan on backporting it to 1.9 as well?

ixdy · 2018-12-18T22:35:34Z

@zoobab you likely are encountering a different issue, as the root cause of this bug was introduced in kubernetes 1.10, not 1.9.

Also, kubernetes 1.9 is now outside of the support window, so no fixes will be backported to the upstream branch.

JacoBlunt · 2018-12-25T04:04:49Z

I also find this bug in kubernetes 1.12.1
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
In pod failure condition,when i set parallelism 2, backofflimit 0,it will destory the last failure pod,only left one pod remain;And set parallelism 2, backofflimit 1,sometimes it keeps 3 pods,sometimes it kills the last failure pod remains 2

aamirpinger · 2019-02-04T07:56:11Z

STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?

@

kubectl version:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

minikube version: v0.30.0

Job YAML file

apiVersion: batch/v1
kind: Job
metadata:
name: whalesay1
spec:
template:
spec:
containers:
- name: whalesay
image: docker/whalesay
command: ["cowsa", "This is a Kubernetes Job!"]
restartPolicy: Never
backoffLimit: 2

Listing pods and job resource

NAME READY STATUS RESTARTS AGE
pod/whalesay1-6lt4b 0/1 ContainerCannotRun 0 59s
pod/whalesay1-86tlz 0/1 ContainerCannotRun 0 48s
pod/whalesay1-8v8nr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8w9k2 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8xxtr 0/1 ContainerCannotRun 0 53s
pod/whalesay1-dl4l8 0/1 ContainerCannotRun 0 1m
pod/whalesay1-fhlhg 0/1 ContainerCannotRun 0 37s
pod/whalesay1-fz8mv 0/1 ContainerCannotRun 0 26s
pod/whalesay1-g2mt6 0/1 ContainerCannotRun 0 9s
pod/whalesay1-hgj95 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hrsrd 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hzm42 0/1 ContainerCannotRun 0 1m
pod/whalesay1-mvctw 0/1 ContainerCannotRun 0 1m
pod/whalesay1-n7zq9 0/1 ContainerCannotRun 0 2m
pod/whalesay1-nl89l 0/1 ContainerCannotRun 0 31s
pod/whalesay1-nqcmg 0/1 ContainerCreating 0 2s
pod/whalesay1-p85rm 0/1 ContainerCannotRun 0 2m
pod/whalesay1-pjzmd 0/1 ContainerCannotRun 0 2m
pod/whalesay1-q2k7p 0/1 ContainerCannotRun 0 43s
pod/whalesay1-rhvz7 0/1 ContainerCannotRun 0 1m
pod/whalesay1-rzrtg 0/1 ContainerCannotRun 0 1m
pod/whalesay1-xp5tr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zn2s6 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zq4g9 0/1 ContainerCannotRun 0 14s
pod/whalesay1-ztxdv 0/1 ContainerCannotRun 0 19s

NAME DESIRED SUCCESSFUL AGE
job.batch/whalesay1 1 0 2m

Please guide

CallMeFoxie · 2019-02-04T07:58:12Z

Just update to something newer than 1.10.0. That version is ages old and I'm sure it is fixed within the latest 1.10.x release.
(Which is also out of support so update to 1.12/1.13 ASAP.)

shanit-saha · 2019-02-06T11:47:26Z

In version "v1.10.11" the backoffLimit though set to 4 for a Job's pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have applied completions: 1 and parallelism: 1
Below is the version details of Kubernetes that we are using

[xuser@tyl20ks9px3as41 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.12", GitCommit:"c757b93cf034d49af3a3b8ecee3b9639a7a11df7", GitTreeState:"clean", BuildDate:"2018-12-19T11:04:29Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that's happening. The point is that the Job's-Pod should be attempted 4 times and not 5 times.
Please note that strangely we don't get to see this error in another environment where we have v1.10.05. providing the version details of that environment below

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.

Would appreciate if someone in this forum can advise on the root cause and a solution.
Do we have any flaw with version "v1.10.5" like security loophole.

jstric01 · 2019-02-06T13:22:48Z

1.10.11 is required to avoid a critical security bug.
https://www.zdnet.com/article/kubernetes-first-major-security-hole-discovered/
I imagine we will need to update to fix this, but we will need to find out which of the patched versions has the fix we need. We can then roll this through the environments as a staged update.

BenTheElder · 2019-02-06T16:44:16Z

As noted above 1.11+ are supported by the community until 1.14 is out:
https://kubernetes.io/docs/setup/version-skew-policy/#supported-versions

stojan-jovic · 2019-05-20T11:50:35Z

@shanit-saha I experienced the same behavior with v1.10.7. Can somebody confirm that this not happening with >= 1.11.x versions?

zakthan · 2019-07-30T14:54:28Z

We are experiencing the same issue with version 1.11.9-docker-1.
If it makes any sense we have 2 enviroments. One is Ubuntu 18.04.2 LTS and the other is CentOS Linux release 7.5.1804.
It happens only at the Ubuntu enviroment.

stojan-jovic · 2019-07-30T17:32:04Z

In meantime we updated cluster:
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.7", GitCommit:"65ecaf0671341311ce6aea0edab46ee69f65d59e", GitTreeState:"clean", BuildDate: "2019-01-24T19:32:00Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate: "2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

We still experiencing the issue.

cblecker · 2019-07-30T17:57:29Z

Hi everyone! 👋

This issue is for a specific problem in Kubernetes 1.10. Right now, the upstream Kubernetes project supports the following three versions: 1.13, 1.14, 1.15. If you are experiencing a similar issue under any of those three versions, then please open a new issue and provide all the details requested.

As the specific bug in this issue is resolved and on an unsupported version, I'm going to lock this closed issue. Thanks!

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 11, 2018

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2018

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 17, 2018

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Apr 18, 2018

CallMeFoxie mentioned this issue Apr 26, 2018

Job is continuously creating failed pods, not respecting backoffLimit parameter #63135

Closed

wgliang mentioned this issue May 7, 2018

-Fix BackoffLimit for Job does not work #63484

Closed

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 9, 2018

soltysh self-assigned this May 10, 2018

soltysh mentioned this issue May 10, 2018

Never clean backoff in job controller #63650

Merged

soltysh mentioned this issue May 10, 2018

Cronjobs - failedJobsHistoryLimit not reaping state Error #53331

Open

k8s-ci-robot closed this as completed Jun 21, 2018

Workloads automation moved this from Backlog to Done Jun 21, 2018

consideRatio mentioned this issue Jun 25, 2018

Feature add 'any/every/this/in-any-case' policy to hook-delete-policy helm/helm#3744

Closed

moelsayed mentioned this issue Jul 2, 2018

Upgrade on addons does not happen after a failed attempt to bring up addons. rancher/rke#718

Closed

moelsayed mentioned this issue Jul 11, 2018

Thousands of deploy-job pods in pending state rancher/rke#755

Closed

grantr mentioned this issue Jul 20, 2018

Set job backoff limit and deadline knative/eventing#254

Merged

rmorrise mentioned this issue Aug 14, 2018

Upgrade minikube to kubernetes 1.10.5 to address backoffLimit bug kubernetes/minikube#3074

Closed

tallaxes mentioned this issue Aug 31, 2018

garbage collection cronjobs produce massive amounts of error pods jenkins-x/jenkins-x-platform#3158

Closed

hekonsek mentioned this issue Sep 24, 2018

EKS nodes in NotReady state after 10 hours jenkins-x/jx#1108

Closed

BEllis mentioned this issue Oct 2, 2018

backoffLimit bug in Kubernetes docker/for-mac#3250

Closed

2 tasks

adamwalach mentioned this issue Nov 27, 2018

BackoffLimit doesn't work on minikube cluster kyma-project/kyma#1838

Closed

kubernetes locked as resolved and limited conversation to collaborators Jul 30, 2019

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Comments

keimoon commented Apr 11, 2018 • edited Loading

keimoon commented Apr 11, 2018

keimoon commented Apr 17, 2018

wgliang commented Apr 18, 2018

jangrewe commented Apr 25, 2018

xray33 commented Apr 25, 2018

CallMeFoxie commented Apr 26, 2018

ceshihao commented Apr 27, 2018

foxish commented May 4, 2018 • edited Loading

foxish commented May 4, 2018

drdivano commented May 7, 2018 • edited Loading

nerumo commented May 7, 2018

hanikesn commented May 7, 2018

nyxi commented May 9, 2018

dims commented May 9, 2018

dims commented May 9, 2018

soltysh commented May 10, 2018

soltysh commented May 10, 2018

CallMeFoxie commented May 10, 2018

jangrewe commented Jun 19, 2018

MaciekPytel commented Jun 19, 2018

nickschuch commented Jun 19, 2018

lbguilherme commented Jun 20, 2018

soltysh commented Jun 21, 2018

walterdolce commented Nov 14, 2018

zoobab commented Dec 18, 2018

ixdy commented Dec 18, 2018

JacoBlunt commented Dec 25, 2018 • edited Loading

aamirpinger commented Feb 4, 2019

STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?

Job YAML file

Listing pods and job resource

Please guide

CallMeFoxie commented Feb 4, 2019

shanit-saha commented Feb 6, 2019 • edited Loading

jstric01 commented Feb 6, 2019

BenTheElder commented Feb 6, 2019

stojan-jovic commented May 20, 2019 • edited Loading

zakthan commented Jul 30, 2019

stojan-jovic commented Jul 30, 2019

cblecker commented Jul 30, 2019

keimoon commented Apr 11, 2018 •

edited

Loading

foxish commented May 4, 2018 •

edited

Loading

drdivano commented May 7, 2018 •

edited

Loading

JacoBlunt commented Dec 25, 2018 •

edited

Loading

shanit-saha commented Feb 6, 2019 •

edited

Loading

stojan-jovic commented May 20, 2019 •

edited

Loading