Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job not failing after backoffLimit is reached #96630

Closed
yacinelazaar opened this issue Nov 17, 2020 · 11 comments
Closed

Job not failing after backoffLimit is reached #96630

yacinelazaar opened this issue Nov 17, 2020 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@yacinelazaar
Copy link

yacinelazaar commented Nov 17, 2020

What happened:
I have the current job running that randomly fails or succeeds with completions set to 4 and no parallelism. Notice the backoffLimit here is 2:

apiVersion: batch/v1
kind: Job
metadata:
  name: test-job
spec:
  backoffLimit: 2
  completions: 4
  parallelism: 1
  template:
    spec:
      containers:
      - name: random-error
        image: busybox
        args:
        - bin/sh
        - -c
        - 'sleep 10; exit $(( ( RANDOM % 2 ) ));'
      restartPolicy: Never

So the job create multiple pods then fails:

Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      15m    job-controller  Created pod: test-job-cgwhn
  Normal   SuccessfulCreate      14m    job-controller  Created pod: test-job-96x9t
  Normal   SuccessfulCreate      14m    job-controller  Created pod: test-job-tqkmr
  Normal   SuccessfulCreate      13m    job-controller  Created pod: test-job-stzq2
  Normal   SuccessfulCreate      13m    job-controller  Created pod: test-job-z4z77
  Normal   SuccessfulCreate      12m    job-controller  Created pod: test-job-9m8fl
  Normal   SuccessfulCreate      11m    job-controller  Created pod: test-job-nj2ql
  Normal   SuccessfulCreate      10m    job-controller  Created pod: test-job-b82ts
  Normal   SuccessfulCreate      9m51s  job-controller  Created pod: test-job-llzs5
  Normal   SuccessfulCreate      8m22s  job-controller  Created pod: test-job-9mjtp
  Normal   SuccessfulCreate      7m36s  job-controller  Created pod: test-job-z4vnv
  Warning  BackoffLimitExceeded  6m43s  job-controller  Job has reached the specified backoff limit

But upon checking the created pods

ubuntu@master-0:~/tests$ k get po --sort-by=.metadata.creationTimestamp  
NAME             READY   STATUS      RESTARTS   AGE
test-job-cgwhn   0/1     Error       0          9m39s
test-job-96x9t   0/1     Error       0          9m13s
test-job-tqkmr   0/1     Completed   0          8m31s
test-job-stzq2   0/1     Completed   0          8m3s
test-job-z4z77   0/1     Error       0          7m29s
test-job-9m8fl   0/1     Completed   0          6m58s
test-job-nj2ql   0/1     Error       0          6m22s
test-job-b82ts   0/1     Error       0          5m6s
test-job-llzs5   0/1     Error       0          4m19s
test-job-9mjtp   0/1     Error       0          2m50s
test-job-z4vnv   0/1     Error       0          2m4s

I noticed that the job did not fail after making 2 failed retries but 4 (check the last 5 pods)

What you expected to happen:
Job to fail after 2 failed retries
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.19.3
  • Cloud provider or hardware configuration: Local KVMs
  • OS (e.g: cat /etc/os-release): Ubuntu 18.04.4 LTS (Bionic Beaver)
  • Kernel (e.g. uname -a): Linux master-0 4.15.0-123-generic Suggest people verify they can start a VM on GCE. #126-Ubuntu SMP Wed Oct 21 09:40:11 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Network plugin and version (if this is a network-related bug): Weave
  • Others:
@yacinelazaar yacinelazaar added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2020
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 17, 2020
@k8s-ci-robot
Copy link
Contributor

@yacinelazaar: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yacinelazaar
Copy link
Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 17, 2020
@Joseph-Irving
Copy link
Member

This is expected behaviour, the reason you have more than 2 failed pods is due to the fact that the backoff count can get reset when a pod exits successfully. So pod fails backoff = 1, then pod completes backoff = 0, the pod fails backoff = 1, the pod completes backoff = 0, etc.
See https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

@yacinelazaar
Copy link
Author

This is expected behaviour, the reason you have more than 2 failed pods is due to the fact that the backoff count can get reset when a pod exits successfully. So pod fails backoff = 1, then pod completes backoff = 0, the pod fails backoff = 1, the pod completes backoff = 0, etc.
See https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-backoff-failure-policy

The back-off count is reset when a Job's Pod is deleted or successful without any other Pods for the Job failing around that time.

Yes but the pods are created successively here. Plus I have sorted the pods by creationTimestamp so there is still 5 failing pods after the last completed pod so backoff = 4 which still superior to 2.

@Joseph-Irving
Copy link
Member

Ah yes, I see you have four failed pods in a row
I could not replicate this on my own test cluster on version 1.19.3
Here you can see it working as expected.

 kubectl describe jobs test-job
Name:           test-job
Namespace:      default
Selector:       controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
Labels:         controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
                job-name=test-job
Annotations:    <none>
Parallelism:    1
Completions:    4
Start Time:     Thu, 19 Nov 2020 15:17:19 +0000
Pods Statuses:  0 Running / 2 Succeeded / 3 Failed
Pod Template:
  Labels:  controller-uid=9ea8a3a5-0dae-4fba-af33-7124b4148e53
           job-name=test-job
  Containers:
   random-error:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      bin/sh
      -c
      sleep 10; exit $(( ( RANDOM % 2 ) ));
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      2m25s  job-controller  Created pod: test-job-r5s6b
  Normal   SuccessfulCreate      2m11s  job-controller  Created pod: test-job-4r2wd
  Normal   SuccessfulCreate      117s   job-controller  Created pod: test-job-8926h
  Normal   SuccessfulCreate      104s   job-controller  Created pod: test-job-2crd6
  Normal   SuccessfulCreate      91s    job-controller  Created pod: test-job-dgwqd
  Warning  BackoffLimitExceeded  58s    job-controller  Job has reached the specified backoff limit
NAME             READY   STATUS      RESTARTS   AGE
test-job-r5s6b   0/1     Completed   0          3m34s
test-job-4r2wd   0/1     Error       0          3m20s
test-job-8926h   0/1     Completed   0          3m6s
test-job-2crd6   0/1     Error       0          2m53s
test-job-dgwqd   0/1     Error       0          2m40s

Can you show the full output of kubectl describe ? Just to make sure the Pod Statuses look correct
Can you reliably reproduce this? your code will only sometimes fail more than the backoff limit.
Were there any error logs in the controller-manager while this happened? specifically from job_controller.go?

@Joseph-Irving
Copy link
Member

It seems this may be some flaky behaviour that you're seeing
A fix was submitted in this PR to make it more reliable #93779

@yacinelazaar
Copy link
Author

yacinelazaar commented Nov 19, 2020

Had a couple of trials with reduced sleep time (3s) and came across this:

ubuntu@master-0:~$ k get po --sort-by=.metadata.creationTimestamp
NAME             READY   STATUS      RESTARTS   AGE
test-job-rk45b   0/1     Completed   0          4m51s
test-job-j8rf5   0/1     Error       0          4m26s
test-job-nrrqn   0/1     Completed   0          4m1s
test-job-bm6nx   0/1     Error       0          3m39s
test-job-5pbnz   0/1     Error       0          3m14s
test-job-c2l2j   0/1     Completed   0          2m44s
test-job-f4qv2   0/1     Error       0          2m29s
test-job-76n59   0/1     Error       0          2m4s

The job is failed but i dont see a reason why. Thought i had to have 3 pods successively failing to reach that status but in this case it just failed after 2. Notice it did not fail after the 4th and 5th.

Here is the job description:

ubuntu@master-0:~$ k describe job test-job 
Name:           test-job
Namespace:      default
Selector:       controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
Labels:         controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
                job-name=test-job
Annotations:    <none>
Parallelism:    1
Completions:    4
Start Time:     Thu, 19 Nov 2020 17:47:50 +0000
Pods Statuses:  0 Running / 3 Succeeded / 5 Failed
Pod Template:
  Labels:  controller-uid=ef8172b2-1bbb-4a27-8b49-f96863f43920
           job-name=test-job
  Containers:
   random-error:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Args:
      bin/sh
      -c
      sleep 3; exit $(( ( RANDOM % 2 ) ));
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      8m31s  job-controller  Created pod: test-job-rk45b
  Normal   SuccessfulCreate      8m6s   job-controller  Created pod: test-job-j8rf5
  Normal   SuccessfulCreate      7m41s  job-controller  Created pod: test-job-nrrqn
  Normal   SuccessfulCreate      7m19s  job-controller  Created pod: test-job-bm6nx
  Normal   SuccessfulCreate      6m54s  job-controller  Created pod: test-job-5pbnz
  Normal   SuccessfulCreate      6m24s  job-controller  Created pod: test-job-c2l2j
  Normal   SuccessfulCreate      6m9s   job-controller  Created pod: test-job-f4qv2
  Normal   SuccessfulCreate      5m44s  job-controller  Created pod: test-job-76n59
  Warning  BackoffLimitExceeded  5m2s   job-controller  Job has reached the specified backoff limit

As for the job controller, it reports these errors whenever a pod fails:

ubuntu@master-0:~$ k logs -f -nkube-system kube-controller-manager-master-0 | grep job_controller.go
E1119 17:48:40.716650       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:49:27.745620       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:49:57.597418       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:50:37.685580       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"
E1119 17:50:37.759372       1 job_controller.go:402] Error syncing job: failed pod(s) detected for job key "default/test-job"

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 19, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

4 participants