-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove terminating count from rmAtLeast #121147
Remove terminating count from rmAtLeast #121147
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is on the right track.
The reasoning that suggests me this is correct is as follows. The panic happens when
rmAtLeast > len(pods)
, that is rmAtLeast > active
Now, substituting active + terminating - wantActive
for rmAtLeast
we get
wantActive < terminating
.
With the proposed change, we have active - wantActive > active
, that is wantActive < 0
-> not possible :).
parallelism: 4, | ||
completions: 1, | ||
backoffLimit: 6, | ||
initialStatus: &jobInitialStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the initial status? I get an error saying that StartTime
is not set unless I have this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to depend on the parllelism of 4, dropping.
6b2920a
to
2c27203
Compare
2c27203
to
ad6433a
Compare
ad6433a
to
feb0790
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please update the PR description, with user-oriented release note, and some justification why this helps for the panic.
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
b426e00
to
7a1ac18
Compare
/retest |
/approve |
/lgtm |
LGTM label has been added. Git tree hash: 32b302f5924668c4eedc3e6567b040ef672ba027
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, kannon92 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We encountered the following panic. The situation occurred when some nodes on which Job pods were running became NotReady. We feel that maybe this bug will be fixed in this PR and would like you to consider cherry-picking to patch release. kube-controller-manager version is 1.28.4 and
|
Since the feature gate is not enabled, I don't know if this would actually fix this. Terminating should be 0 with this off. Ideally, we would have a test case where we can reproduce this. |
@superbrothers do you have a reproducible scenario for this? If so, it would be great if you can open a dedicated issue to have the discussion there. Ideally, with the yaml's and steps necessary to reproduce. For example, it can matter if this is IndexedJob, or not. |
I will try to see if I can reproduce it, but it looks like I won't have time for a while. |
Retrospectively describing as closely as possible, with the yamls to create the API objects might be a good start already to open the issue. In the meanwhile, was this a job with |
OK, I would create an issue first. The Job that I believe caused the problem was NonIndexed. But this may not be the cause (not sure, since a lot of jobs are created all the time). |
…1147-upstream-release-1.28 Automated cherry pick of #121147: Fix panic if there are more terminating pods than active pods
What type of PR is this?
/kind bug
What this PR does / why we need it:
PodReplacementPolicy was enabled in 1.28 and it tracks the number of terminating pods. If Failed is specified, then we will halt replacement until the pod is fully terminal.
We discovered a bug where we were tracking (terminating + active) and using that index to access the active pods list. If there are more terminating pods than active, it is possible to hit an out of bounds error.
@mimowo pointed out that we don't need the terminating count as we don't need to delete the terminating pods.
This PR drops terminating so we are only counting active.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: