Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

Closed
keimoon opened this issue Apr 11, 2018 · 54 comments · Fixed by #63650
Closed

Backoff Limit for Job does not work on Kubernetes 1.10.0 #62382

keimoon opened this issue Apr 11, 2018 · 54 comments · Fixed by #63650
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Milestone

Comments

@keimoon
Copy link

keimoon commented Apr 11, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

.spec.backoffLimit for a Job does not work on Kubernetes 1.10.0

What you expected to happen:

.spec.backoffLimit can limit the number of time a pod is restarted when running inside a job

How to reproduce it (as minimally and precisely as possible):

Use this resource file:

apiVersion: batch/v1
kind: Job
metadata:
  name: error
spec:
  backoffLimit: 1
  template:
    metadata:
      name: job
    spec:
      restartPolicy: Never
      containers:
        - name: job
          image: ubuntu:16.04
          args:
            - sh
            - -c
            - sleep 5; false

If the job was created in Kubernetes 1.9, it will soon fail:

....
status:
    conditions:
    - lastProbeTime: 2018-04-11T10:20:31Z
      lastTransitionTime: 2018-04-11T10:20:31Z
      message: Job has reach the specified backoff limit
      reason: BackoffLimitExceeded
      status: "True"
      type: Failed
    failed: 2
    startTime: 2018-04-11T10:20:00Z
...

While creating the job in Kubernetes 1.10, it will be restarted infinitely:

...
status:
  active: 1
  failed: 8
  startTime: 2018-04-11T10:37:48Z
...

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.0
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux
  • Install tools: Kubeadm
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Apr 11, 2018
@keimoon
Copy link
Author

keimoon commented Apr 11, 2018

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2018
@keimoon
Copy link
Author

keimoon commented Apr 17, 2018

/sig job
/sig lifecycle
/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 17, 2018
@wgliang
Copy link
Contributor

wgliang commented Apr 18, 2018

/sig testing

@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Apr 18, 2018
@jangrewe
Copy link

We just updated to 1.10 and seem to be affected by this, too.

@xray33
Copy link

xray33 commented Apr 25, 2018

I had to add activeDeadlineSeconds to all Jobs and Hooks, as workaround...

@CallMeFoxie
Copy link
Contributor

Can confirm with 1.10.1

@ceshihao
Copy link
Contributor

It seems to be introduced by #60985

cc @soltysh

@foxish
Copy link
Contributor

foxish commented May 4, 2018

Cause appears to be that the syncJob method gets exactly one attempt to update the status with the failed condition and if it fails due to resource conflict, it never adds the condition. The controller code seems to use requeues to track failures, but if a sync loop executes successfully, it would lose track of the failed item ever having been requeued. Looking into a fix now.

@foxish
Copy link
Contributor

foxish commented May 4, 2018

Looking at job_controller.go#L249, we force an immediate resync when we see a pod that has failed which also clears the key in our queue, but that in turn makes us lose state of the number of requeues.

@drdivano
Copy link

drdivano commented May 7, 2018

This bug killed my test cluster over the long weekend (with 20000+ pods). Luckily we don't yet use CronJobs in prod.

@nerumo
Copy link

nerumo commented May 7, 2018

killed my cluster too :( still existing in kubernetes 1.10.2

@hanikesn
Copy link

hanikesn commented May 7, 2018

This issue also can't be mitigated as terminated are not part of the resource quota: #51150

@nyxi
Copy link

nyxi commented May 9, 2018

This is a pretty terrible bug and it definitely still exists in v1.10.2

@dims
Copy link
Member

dims commented May 9, 2018

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label May 9, 2018
@dims
Copy link
Member

dims commented May 9, 2018

@soltysh can you please take a look?

@soltysh
Copy link
Contributor

soltysh commented May 10, 2018

Looking right now...

@soltysh
Copy link
Contributor

soltysh commented May 10, 2018

I've opened #63650 to address the issue. Sorry for the troubles y'all.

@CallMeFoxie
Copy link
Contributor

@soltysh pulled the patch into current 1.10.2 release and it works! Thanks :)

@jangrewe
Copy link

Any ETA for 1.10.5 yet? Even a ballpark date would be fine.

@MaciekPytel
Copy link
Contributor

I'm planning to cut 1.10.5 tomorrow, assuming all tests will be green. In case of any last minute problems it may slip by a day or two, but so far everything looks good for tomorrow.

@nickschuch
Copy link

Thankyou to all involved! I really appreciate this being rolled back into 1.10.

@lbguilherme
Copy link

Given that the fix is already on 1.11 and on 1.10 branches, this issue should be closed.

Thanks for fixing this!

@soltysh
Copy link
Contributor

soltysh commented Jun 21, 2018

/close

@walterdolce
Copy link

@lbguilherme @soltysh I don't believe this is fixed?

Unless I'm doing something wrong..

NAME                                                          SCHEDULE    SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/sqlite-database-backup-generator                * * * * *   False     1        2m              17m
cronjob.batch/app-invoices-backup-generator   * * * * *   False     1        1m              17m

NAME                                                                 DESIRED   SUCCESSFUL   AGE
job.batch/sqlite-database-backup-generator-1542233280                1         0            14m
job.batch/sqlite-database-backup-generator-1542233460                1         0            11m
job.batch/sqlite-database-backup-generator-1542233580                1         0            9m
job.batch/sqlite-database-backup-generator-1542233760                1         0            7m
job.batch/sqlite-database-backup-generator-1542233880                1         0            4m
job.batch/sqlite-database-backup-generator-1542234060                1         0            1m
job.batch/app-invoices-backup-generator-1542233280   1         0            14m
job.batch/app-invoices-backup-generator-1542233460   1         0            11m
job.batch/app-invoices-backup-generator-1542233640   1         0            9m
job.batch/app-invoices-backup-generator-1542233760   1         0            6m
job.batch/app-invoices-backup-generator-1542233940   1         0            3m
job.batch/app-invoices-backup-generator-1542234120   1         0            1m

NAME                                                                 READY   STATUS    RESTARTS   AGE
pod/nfs-server-75ccf5786f-p4bjh                                      1/1     Running   0          23h
pod/sqlite-database-backup-generator-1542233280-4n8tr                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-mszhp                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-p24s7                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233280-qtkfl                0/1     Error     0          13m
pod/sqlite-database-backup-generator-1542233280-tg4ht                0/1     Error     0          14m
pod/sqlite-database-backup-generator-1542233460-2xrqj                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-7m2dp                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-s7hrf                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233460-vdnsm                0/1     Error     0          11m
pod/sqlite-database-backup-generator-1542233580-4s7dj                0/1     Error     0          8m
pod/sqlite-database-backup-generator-1542233580-j7mm6                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-kqfhf                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-n2bvs                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233580-t4l94                0/1     Error     0          9m
pod/sqlite-database-backup-generator-1542233760-6bdz2                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233760-6bntx                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-cbrcx                0/1     Error     0          5m
pod/sqlite-database-backup-generator-1542233760-kllnp                0/1     Error     0          7m
pod/sqlite-database-backup-generator-1542233760-nk2t5                0/1     Error     0          6m
pod/sqlite-database-backup-generator-1542233880-gxs24                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-mpfvl                0/1     Error     0          4m
pod/sqlite-database-backup-generator-1542233880-vchgm                0/1     Error     0          3m
pod/sqlite-database-backup-generator-1542233880-w4g4c                0/1     Error     0          3m

Both resource specs have a backoffLimit: 4 and restartPolicy: Never.

This is in GKE, 1.11.2-gke.18

@zoobab
Copy link

zoobab commented Dec 18, 2018

I have the same bug on 1.9, with an openshift cluster:

$ oc version
oc v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://myserver
openshift v3.9.0+ba7faec-1
kubernetes v1.9.1+a0ce1bc657

Any plan on backporting it to 1.9 as well?

@ixdy
Copy link
Member

ixdy commented Dec 18, 2018

@zoobab you likely are encountering a different issue, as the root cause of this bug was introduced in kubernetes 1.10, not 1.9.

Also, kubernetes 1.9 is now outside of the support window, so no fixes will be backported to the upstream branch.

@JacoBlunt
Copy link

JacoBlunt commented Dec 25, 2018

I also find this bug in kubernetes 1.12.1
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:46:06Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
In pod failure condition,when i set parallelism 2, backofflimit 0,it will destory the last failure pod,only left one pod remain;And set parallelism 2, backofflimit 1,sometimes it keeps 3 pods,sometimes it kills the last failure pod remains 2

@aamirpinger
Copy link

STILL backOffLimit is not taken into account to stop the creation of pods in case of failure?

@

kubectl version:
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.3", GitCommit:"435f92c719f279a3a67808c80521ea17d5715c66", GitTreeState:"clean", BuildDate:"2018-11-26T12:57:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

minikube version: v0.30.0

Job YAML file

apiVersion: batch/v1
kind: Job
metadata:
name: whalesay1
spec:
template:
spec:
containers:
- name: whalesay
image: docker/whalesay
command: ["cowsa", "This is a Kubernetes Job!"]
restartPolicy: Never
backoffLimit: 2

Listing pods and job resource

NAME READY STATUS RESTARTS AGE
pod/whalesay1-6lt4b 0/1 ContainerCannotRun 0 59s
pod/whalesay1-86tlz 0/1 ContainerCannotRun 0 48s
pod/whalesay1-8v8nr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8w9k2 0/1 ContainerCannotRun 0 1m
pod/whalesay1-8xxtr 0/1 ContainerCannotRun 0 53s
pod/whalesay1-dl4l8 0/1 ContainerCannotRun 0 1m
pod/whalesay1-fhlhg 0/1 ContainerCannotRun 0 37s
pod/whalesay1-fz8mv 0/1 ContainerCannotRun 0 26s
pod/whalesay1-g2mt6 0/1 ContainerCannotRun 0 9s
pod/whalesay1-hgj95 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hrsrd 0/1 ContainerCannotRun 0 1m
pod/whalesay1-hzm42 0/1 ContainerCannotRun 0 1m
pod/whalesay1-mvctw 0/1 ContainerCannotRun 0 1m
pod/whalesay1-n7zq9 0/1 ContainerCannotRun 0 2m
pod/whalesay1-nl89l 0/1 ContainerCannotRun 0 31s
pod/whalesay1-nqcmg 0/1 ContainerCreating 0 2s
pod/whalesay1-p85rm 0/1 ContainerCannotRun 0 2m
pod/whalesay1-pjzmd 0/1 ContainerCannotRun 0 2m
pod/whalesay1-q2k7p 0/1 ContainerCannotRun 0 43s
pod/whalesay1-rhvz7 0/1 ContainerCannotRun 0 1m
pod/whalesay1-rzrtg 0/1 ContainerCannotRun 0 1m
pod/whalesay1-xp5tr 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zn2s6 0/1 ContainerCannotRun 0 1m
pod/whalesay1-zq4g9 0/1 ContainerCannotRun 0 14s
pod/whalesay1-ztxdv 0/1 ContainerCannotRun 0 19s

NAME DESIRED SUCCESSFUL AGE
job.batch/whalesay1 1 0 2m

Please guide

@CallMeFoxie
Copy link
Contributor

Just update to something newer than 1.10.0. That version is ages old and I'm sure it is fixed within the latest 1.10.x release.
(Which is also out of support so update to 1.12/1.13 ASAP.)

@shanit-saha
Copy link

shanit-saha commented Feb 6, 2019

In version "v1.10.11" the backoffLimit though set to 4 for a Job's pod, the pod eviction happens for 5 times and at few times it was for 3 times. Also to note that we also have applied completions: 1 and parallelism: 1
Below is the version details of Kubernetes that we are using

[xuser@tyl20ks9px3as41 ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.12", GitCommit:"c757b93cf034d49af3a3b8ecee3b9639a7a11df7", GitTreeState:"clean", BuildDate:"2018-12-19T11:04:29Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Though we have set the backoffLimit of a Job to 4 the pod eviction is still being done more than 4 times i.e. 5 times at one of the instances its three times as well. The eviction of the pod happens for a geniune reason i.e. the disk space of the pod node exhausts when the application within the pod executes, we know the reason why that's happening. The point is that the Job's-Pod should be attempted 4 times and not 5 times.
Please note that strangely we don't get to see this error in another environment where we have v1.10.05. providing the version details of that environment below

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:46:00Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.5", GitCommit:"32ac1c9073b132b8ba18aa830f46b77dcceb0723", GitTreeState:"clean", BuildDate:"2018-06-21T11:34:22Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

There the pod eviction happens as per the stated literature of Eviction with BackOffLimit. We have no issues there. Now when we have upgraded to v1.10.11 recently [ About two days back ] we are getting the reported problem.

Would appreciate if someone in this forum can advise on the root cause and a solution.
Do we have any flaw with version "v1.10.5" like security loophole.

@jstric01
Copy link

jstric01 commented Feb 6, 2019

1.10.11 is required to avoid a critical security bug.
https://www.zdnet.com/article/kubernetes-first-major-security-hole-discovered/
I imagine we will need to update to fix this, but we will need to find out which of the patched versions has the fix we need. We can then roll this through the environments as a staged update.

@BenTheElder
Copy link
Member

As noted above 1.11+ are supported by the community until 1.14 is out:
https://kubernetes.io/docs/setup/version-skew-policy/#supported-versions

@stojan-jovic
Copy link

stojan-jovic commented May 20, 2019

@shanit-saha I experienced the same behavior with v1.10.7. Can somebody confirm that this not happening with >= 1.11.x versions?

@zakthan
Copy link

zakthan commented Jul 30, 2019

We are experiencing the same issue with version 1.11.9-docker-1.
If it makes any sense we have 2 enviroments. One is Ubuntu 18.04.2 LTS and the other is CentOS Linux release 7.5.1804.
It happens only at the Ubuntu enviroment.

@stojan-jovic
Copy link

In meantime we updated cluster:
Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.7", GitCommit:"65ecaf0671341311ce6aea0edab46ee69f65d59e", GitTreeState:"clean", BuildDate: "2019-01-24T19:32:00Z", GoVersion:"go1.10.7", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.8", GitCommit:"a89f8c11a5f4f132503edbc4918c98518fd504e3", GitTreeState:"clean", BuildDate: "2019-04-23T04:41:47Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

We still experiencing the issue.

@cblecker
Copy link
Member

Hi everyone! 👋

This issue is for a specific problem in Kubernetes 1.10. Right now, the upstream Kubernetes project supports the following three versions: 1.13, 1.14, 1.15. If you are experiencing a similar issue under any of those three versions, then please open a new issue and provide all the details requested.

As the specific bug in this issue is resolved and on an unsupported version, I'm going to lock this closed issue. Thanks!

@kubernetes kubernetes locked as resolved and limited conversation to collaborators Jul 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Workloads
  
Done