BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

kolomenkin · 2021-04-28T11:53:58Z

What happened:

I deployed CronJob with non-existing docker image,
Job was created in time, Pod was created and got pull problems
Pod stay in the state "ImagePullBackOff" for 10 and more hours
If I deploy new version of CronJob with correct docker image tag it will never start.

What you expected to happen:

I expect Pod to use default BackoffLimit like in documentation:
Default value = 6
Delays = 10sec, 20sec, 40sec, 80 sec, 160 sec and 320 sec
Total delay is less than 12 minutes.

I expect Pod will fail to start in about 12 minutes after it was created.
With "restartPolicy: Never" I expect Job will finish after Pod will fail (in 12 minutes).
And I expect scheduler will create a new job from updated template when it is necessary by the schedule
(currently it never happens).

How to reproduce it (as minimally and precisely as possible):

See "What happened"

Anything else we need to know?:

It seems the counter of Back-off pulling retries is not incremented. Here is a part of POD log:

  Warning  Failed   13m (x439 over 113m)    kubelet  Error: ImagePullBackOff
  Normal   BackOff  3m25s (x484 over 113m)  kubelet  Back-off pulling image "my.private.artifactory/myimage:mytag"

It seems it makes a docker pull try every 12 seconds on average. It seems it waits only initial 10 seconds between retries.

Here is an earlier version of the same POD event log:

  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  24m                   default-scheduler  Successfully assigned mynamespace/mynamespace-mycronjob-1619601000-h5cmr to ip-10-0-45-27.eu-west-1.compute.internal
  Normal   Pulling    23m (x4 over 24m)     kubelet            Pulling image "my.private.artifactory/myimage:mytag"
  Warning  Failed     23m (x4 over 24m)     kubelet            Failed to pull image "my.private.artifactory/myimage:mytag": rpc error: code = Unknown desc = Error response from daemon: Get https://my.private.artifactory/v2/myimage/manifests/mytag: unknown: Authentication is required
  Warning  Failed     23m (x4 over 24m)     kubelet            Error: ErrImagePull
  Normal   BackOff    9m44s (x65 over 24m)  kubelet            Back-off pulling image "my.private.artifactory/myimage:mytag"
  Warning  Failed     4m45s (x87 over 24m)  kubelet            Error: ImagePullBackOff

Related (similar) issues:

Environment:

Kubernetes version (use kubectl version): 1.19

Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Amazon EKS (their SaaS Kubernetes)
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-04-28T11:54:05Z

@kolomenkin: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kolomenkin · 2021-04-28T12:24:26Z

/sig architecture
/wg Reliability
/committee steering

k8s-ci-robot · 2021-04-28T12:25:39Z

@kolomenkin: The label(s) committee/-steering cannot be applied, because the repository doesn't have them.

In response to this:

/committee -steering

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kolomenkin · 2021-04-28T12:58:49Z

/sig apps

249043822 · 2021-05-06T07:01:44Z

If ErrImagePull happened, the containers of pod would not be created, the backoff only does ImagePull again, so the job's BackoffLimit could not perceive the ImagePullBackoff

SergeyKanzhelev · 2021-05-07T21:09:33Z

Glanced over. Is this dup of #87278?

kolomenkin · 2021-05-08T18:46:38Z

I would say it is not a full dulicate of #87278.

I can imagine someone can decide to keep those issue without a fix for Deployments. I'm not sure if Deployment yaml editing is affected,

But I point similar behavior in the context of CronJob. I.e. we need to success/fail a job by some schedule.

And it is extremely important when CronJob is updated (redeployed) with a different valid image reference. And these new changes are ignored infinitely.

k8s-triage-robot · 2021-08-06T19:17:20Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-09-05T19:18:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kouk · 2021-09-12T08:12:02Z

Does anyone have a workaround for this?

k8s-triage-robot · 2021-10-29T23:53:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-10-29T23:53:19Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Cuppojoe · 2022-04-15T17:55:29Z

/reopen

k8s-ci-robot · 2022-04-15T17:55:43Z

@Cuppojoe: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kolomenkin added the kind/bug Categorizes issue or PR as related to a bug. label Apr 28, 2021

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 28, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 28, 2021

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Apr 28, 2021

kolomenkin changed the title ~~BackoffLimit does not work for pods in Jobs~~ BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) Apr 28, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 5, 2021

k8s-ci-robot closed this as completed Oct 29, 2021

nuclearcat mentioned this issue Nov 7, 2022

k8s job deadline kernelci/kernelci-core#1492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

kolomenkin commented Apr 28, 2021 •

edited

k8s-ci-robot commented Apr 28, 2021

kolomenkin commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

kolomenkin commented Apr 28, 2021

249043822 commented May 6, 2021

SergeyKanzhelev commented May 7, 2021

kolomenkin commented May 8, 2021 •

edited

k8s-triage-robot commented Aug 6, 2021

k8s-triage-robot commented Sep 5, 2021

kouk commented Sep 12, 2021

k8s-triage-robot commented Oct 29, 2021

k8s-ci-robot commented Oct 29, 2021

Cuppojoe commented Apr 15, 2022

k8s-ci-robot commented Apr 15, 2022

BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

BackoffLimit does not work for pods in Jobs (Kubernetes 1.19) #101584

Comments

kolomenkin commented Apr 28, 2021 • edited

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Related (similar) issues:

Environment:

k8s-ci-robot commented Apr 28, 2021

kolomenkin commented Apr 28, 2021

k8s-ci-robot commented Apr 28, 2021

kolomenkin commented Apr 28, 2021

249043822 commented May 6, 2021

SergeyKanzhelev commented May 7, 2021

kolomenkin commented May 8, 2021 • edited

k8s-triage-robot commented Aug 6, 2021

k8s-triage-robot commented Sep 5, 2021

kouk commented Sep 12, 2021

k8s-triage-robot commented Oct 29, 2021

k8s-ci-robot commented Oct 29, 2021

Cuppojoe commented Apr 15, 2022

k8s-ci-robot commented Apr 15, 2022

kolomenkin commented Apr 28, 2021 •

edited

kolomenkin commented May 8, 2021 •

edited