New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes Jobs API rapid-fire scheduling doesnt honor exponential backoff characteristics #114391
Comments
@jayunit100: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig apps |
/assign |
This is broken when the
If the feature gate is disabled, the pods are retried by honoring exponential backoff characteristics. From a quick glance, there are a few reasons why this happens:
kubernetes/pkg/controller/job/job_controller.go Lines 869 to 884 in bb6edfb
Looking at: kubernetes/pkg/controller/job/job_controller.go Lines 308 to 330 in bb6edfb
We can end up with a scenario where:
I'll send a PR to fix both the issues. |
/kind regression Created a PR - #114516 |
Did you confirm in older versions of k8s? |
IIUC, this issue is not exclusive to job tracking with finalizers. Let's backport to all supported versions of k8s (1.23) |
Cherry-pick PRs: |
All cherry-picks have merged. Keeping this issue open to fix the backoff for jobs with /assign @sathyanarays |
@nikhita: GitHub didn't allow me to assign the following users: sathyanarays. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
Are master and release branches still in a regressed state, or is this a follow-up to fix/improve a pre-existing issue? If any regressions are now resolved, can we track #114768 in a new issue? |
master and release branches have been fixed for #114768 will fix the issue for |
#114768 is a really big change... is there a more scoped change we can backport with lower risk? |
#114768 is only expected to land on master and not intended to be backported. The issue with The fix for ref: #114516 (comment) Happy to track #114768 in a new issue if it's easier though. |
That's helpful to know... having one issue for the regression fixed in #114516 and one for the other issue still to be fixed might be helpful |
let's track it separately /close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened?
Problem
The job controller appears to "rapidly reschedule" several pods at the same time (without backing off) which ultimately appears almost like a batch or gang scheduling workflow.
... This is the opposite of what the job docs say, bc they clearly state :
So, rather then,
This is in the 1.25 release, but not sure if it also happens earlier. You can see this in the latest kind version.
What did you expect to happen?
Jobs wouldnt schedule the same pod at rapid fire intervals
How can we reproduce it (as minimally and precisely as possible)?
Details
Thanks to Ryan for this reproducer:
Now once this is running, you can graph the values like so:
And youl see a table like this, where the integers below are the AGE OF PODS in SECONDS. We can see that the pod ages are 3 seconds apart as opposed to (10, 20, 40, 80 ...) or some other exponentiallly increasing number of seconds, apart.....
Anything else we need to know?
No response
Kubernetes version
.25
Cloud provider
all
OS version
all
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: