Job not failing after backoffLimit is reached #96630

yacinelazaar · 2020-11-17T10:37:14Z

What happened:
I have the current job running that randomly fails or succeeds with completions set to 4 and no parallelism. Notice the backoffLimit here is 2:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  backoffLimit: 2
  completions: 4
  parallelism: 1
  template:
    spec:
      containers:
      - name: random-error
        image: busybox
        args:
        - bin/sh
        - -c
        - 'sleep 10; exit $(( ( RANDOM % 2 ) ));'
      restartPolicy: Never

So the job create multiple pods then fails:

Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      15m    job-controller  Created pod: test-job-cgwhn
  Normal   SuccessfulCreate      14m    job-controller  Created pod: test-job-96x9t
  Normal   SuccessfulCreate      14m    job-controller  Created pod: test-job-tqkmr
  Normal   SuccessfulCreate      13m    job-controller  Created pod: test-job-stzq2
  Normal   SuccessfulCreate      13m    job-controller  Created pod: test-job-z4z77
  Normal   SuccessfulCreate      12m    job-controller  Created pod: test-job-9m8fl
  Normal   SuccessfulCreate      11m    job-controller  Created pod: test-job-nj2ql
  Normal   SuccessfulCreate      10m    job-controller  Created pod: test-job-b82ts
  Normal   SuccessfulCreate      9m51s  job-controller  Created pod: test-job-llzs5
  Normal   SuccessfulCreate      8m22s  job-controller  Created pod: test-job-9mjtp
  Normal   SuccessfulCreate      7m36s  job-controller  Created pod: test-job-z4vnv
  Warning  BackoffLimitExceeded  6m43s  job-controller  Job has reached the specified backoff limit

But upon checking the created pods

ubuntu@master-0:~/tests$ k get po --sort-by=.metadata.creationTimestamp  
NAME             READY   STATUS      RESTARTS   AGE
test-job-cgwhn   0/1     Error       0          9m39s
test-job-96x9t   0/1     Error       0          9m13s
test-job-tqkmr   0/1     Completed   0          8m31s
test-job-stzq2   0/1     Completed   0          8m3s
test-job-z4z77   0/1     Error       0          7m29s
test-job-9m8fl   0/1     Completed   0          6m58s
test-job-nj2ql   0/1     Error       0          6m22s
test-job-b82ts   0/1     Error       0          5m6s
test-job-llzs5   0/1     Error       0          4m19s
test-job-9mjtp   0/1     Error       0          2m50s
test-job-z4vnv   0/1     Error       0          2m4s

I noticed that the job did not fail after making 2 failed retries but 4 (check the last 5 pods)

What you expected to happen:
Job to fail after 2 failed retries
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.19.3
Cloud provider or hardware configuration: Local KVMs
OS (e.g: cat /etc/os-release): Ubuntu 18.04.4 LTS (Bionic Beaver)
Kernel (e.g. uname -a): Linux master-0 4.15.0-123-generic Suggest people verify they can start a VM on GCE. #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Network plugin and version (if this is a network-related bug): Weave
Others:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-11-17T10:37:22Z

@yacinelazaar: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yacinelazaar · 2020-11-17T11:10:53Z

/sig apps

Joseph-Irving · 2020-11-19T13:53:41Z

This is expected behaviour, the reason you have more than 2 failed pods is due to the fact that the backoff count can get reset when a pod exits successfully. So pod fails backoff = 1, then pod completes backoff = 0, the pod fails backoff = 1, the pod completes backoff = 0, etc.
See https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

yacinelazaar · 2020-11-19T15:06:47Z

This is expected behaviour, the reason you have more than 2 failed pods is due to the fact that the backoff count can get reset when a pod exits successfully. So pod fails backoff = 1, then pod completes backoff = 0, the pod fails backoff = 1, the pod completes backoff = 0, etc.
See https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

Yes but the pods are created successively here. Plus I have sorted the pods by creationTimestamp so there is still 5 failing pods after the last completed pod so backoff = 4 which still superior to 2.

Joseph-Irving · 2020-11-19T15:42:52Z

Ah yes, I see you have four failed pods in a row
I could not replicate this on my own test cluster on version 1.19.3
Here you can see it working as expected.

 kubectl describe jobs test-job
Name:           test-job
Namespace:      default
Selector:       controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
Labels:         controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
                job-name=test-job
Annotations:    <none>
Parallelism:    1
Completions:    4
Start Time:     Thu, 19 Nov 2020 15:17:19 +0000
Pods Statuses:  0 Running / 2 Succeeded / 3 Failed
Pod Template:
  Labels:  controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
           job-name=test-job
  Containers:
   random-error:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      bin/sh
      -c
      sleep 10; exit $(( ( RANDOM % 2 ) ));
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      2m25s  job-controller  Created pod: test-job-r5s6b
  Normal   SuccessfulCreate      2m11s  job-controller  Created pod: test-job-4r2wd
  Normal   SuccessfulCreate      117s   job-controller  Created pod: test-job-8926h
  Normal   SuccessfulCreate      104s   job-controller  Created pod: test-job-2crd6
  Normal   SuccessfulCreate      91s    job-controller  Created pod: test-job-dgwqd
  Warning  BackoffLimitExceeded  58s    job-controller  Job has reached the specified backoff limit

NAME             READY   STATUS      RESTARTS   AGE
test-job-r5s6b   0/1     Completed   0          3m34s
test-job-4r2wd   0/1     Error       0          3m20s
test-job-8926h   0/1     Completed   0          3m6s
test-job-2crd6   0/1     Error       0          2m53s
test-job-dgwqd   0/1     Error       0          2m40s

Can you show the full output of kubectl describe ? Just to make sure the Pod Statuses look correct
Can you reliably reproduce this? your code will only sometimes fail more than the backoff limit.
Were there any error logs in the controller-manager while this happened? specifically from job_controller.go?

Joseph-Irving · 2020-11-19T15:48:35Z

It seems this may be some flaky behaviour that you're seeing
A fix was submitted in this PR to make it more reliable #93779

yacinelazaar · 2020-11-19T18:00:14Z

Had a couple of trials with reduced sleep time (3s) and came across this:

ubuntu@master-0:~$ k get po --sort-by=.metadata.creationTimestamp
NAME             READY   STATUS      RESTARTS   AGE
test-job-rk45b   0/1     Completed   0          4m51s
test-job-j8rf5   0/1     Error       0          4m26s
test-job-nrrqn   0/1     Completed   0          4m1s
test-job-bm6nx   0/1     Error       0          3m39s
test-job-5pbnz   0/1     Error       0          3m14s
test-job-c2l2j   0/1     Completed   0          2m44s
test-job-f4qv2   0/1     Error       0          2m29s
test-job-76n59   0/1     Error       0          2m4s

The job is failed but i dont see a reason why. Thought i had to have 3 pods successively failing to reach that status but in this case it just failed after 2. Notice it did not fail after the 4th and 5th.

Here is the job description:

ubuntu@master-0:~$ k describe job test-job 
Name:           test-job
Namespace:      default
Selector:       controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
Labels:         controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
                job-name=test-job
Annotations:    <none>
Parallelism:    1
Completions:    4
Start Time:     Thu, 19 Nov 2020 17:47:50 +0000
Pods Statuses:  0 Running / 3 Succeeded / 5 Failed
Pod Template:
  Labels:  controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
           job-name=test-job
  Containers:
   random-error:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      bin/sh
      -c
      sleep 3; exit $(( ( RANDOM % 2 ) ));
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      8m31s  job-controller  Created pod: test-job-rk45b
  Normal   SuccessfulCreate      8m6s   job-controller  Created pod: test-job-j8rf5
  Normal   SuccessfulCreate      7m41s  job-controller  Created pod: test-job-nrrqn
  Normal   SuccessfulCreate      7m19s  job-controller  Created pod: test-job-bm6nx
  Normal   SuccessfulCreate      6m54s  job-controller  Created pod: test-job-5pbnz
  Normal   SuccessfulCreate      6m24s  job-controller  Created pod: test-job-c2l2j
  Normal   SuccessfulCreate      6m9s   job-controller  Created pod: test-job-f4qv2
  Normal   SuccessfulCreate      5m44s  job-controller  Created pod: test-job-76n59
  Warning  BackoffLimitExceeded  5m2s   job-controller  Job has reached the specified backoff limit

As for the job controller, it reports these errors whenever a pod fails:

ubuntu@master-0:~$ k logs -f -nkube-system kube-controller-manager-master-0 | grep job_controller.go
E1119 17:48:40.716650       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:49:27.745620       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:49:57.597418       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:50:37.685580       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:50:37.759372       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"

fejta-bot · 2021-02-17T18:45:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-03-19T19:31:53Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-04-18T19:48:11Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-04-18T19:48:20Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yacinelazaar added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2020

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2020

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 17, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2021

k8s-ci-robot closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job not failing after backoffLimit is reached #96630

Job not failing after backoffLimit is reached #96630

yacinelazaar commented Nov 17, 2020 •

edited

Loading

k8s-ci-robot commented Nov 17, 2020

yacinelazaar commented Nov 17, 2020

Joseph-Irving commented Nov 19, 2020

yacinelazaar commented Nov 19, 2020

Joseph-Irving commented Nov 19, 2020

Joseph-Irving commented Nov 19, 2020

yacinelazaar commented Nov 19, 2020 •

edited

Loading

fejta-bot commented Feb 17, 2021

fejta-bot commented Mar 19, 2021

fejta-bot commented Apr 18, 2021

k8s-ci-robot commented Apr 18, 2021

Job not failing after backoffLimit is reached #96630

Job not failing after backoffLimit is reached #96630

Comments

yacinelazaar commented Nov 17, 2020 • edited Loading

k8s-ci-robot commented Nov 17, 2020

yacinelazaar commented Nov 17, 2020

Joseph-Irving commented Nov 19, 2020

yacinelazaar commented Nov 19, 2020

Joseph-Irving commented Nov 19, 2020

Joseph-Irving commented Nov 19, 2020

yacinelazaar commented Nov 19, 2020 • edited Loading

fejta-bot commented Feb 17, 2021

fejta-bot commented Mar 19, 2021

fejta-bot commented Apr 18, 2021

k8s-ci-robot commented Apr 18, 2021

yacinelazaar commented Nov 17, 2020 •

edited

Loading

yacinelazaar commented Nov 19, 2020 •

edited

Loading