Job spec.backoffLimit does not handle pod garbage collection #111608
Labels
kind/bug
Categorizes issue or PR as related to a bug.
lifecycle/stale
Denotes an issue or PR has remained open with no activity and has become stale.
needs-triage
Indicates an issue or PR lacks a `triage/foo` label and requires one.
sig/apps
Categorizes an issue or PR as relevant to SIG Apps.
What happened?
We created a cronjob with backoffLimit of 6 in a cluster with a lot of cronjobs and frequent schedules (possibly cycling through more than 10k pods in a matter of half an hour). The launched pods always end up in failures to a code issue. Kubernetes is now executing infinite number of retries for this cronjob while the maximum retries should have been capped at six.
What did you expect to happen?
We expected that the number of retries will be capped at 6.
How can we reproduce it (as minimally and precisely as possible)?
This requires either
a) churning through 12500 pods in a small time frame or
b) reduce terminated-pod-gc-threshold (this option is not yet tested as we are running on GKE and cannot modify this)
Anything else we need to know?
This bug seems to have been introduced while trying to solve another related bug in this PR: #93779
Kubernetes version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.13", GitCommit:"80ec6572b15ee0ed2e6efa97a4dcd30f57e68224", GitTreeState:"clean", BuildDate:"2022-05-24T12:40:44Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.12-gke.1700", GitCommit:"6c11aec6ce32cf0d66a2631eed2eb49dd65c89f8", GitTreeState:"clean", BuildDate:"2022-05-13T09:31:25Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: