Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove terminating count from rmAtLeast #121147

Merged

Conversation

kannon92
Copy link
Contributor

@kannon92 kannon92 commented Oct 11, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

PodReplacementPolicy was enabled in 1.28 and it tracks the number of terminating pods. If Failed is specified, then we will halt replacement until the pod is fully terminal.

We discovered a bug where we were tracking (terminating + active) and using that index to access the active pods list. If there are more terminating pods than active, it is possible to hit an out of bounds error.

@mimowo pointed out that we don't need the terminating count as we don't need to delete the terminating pods.

This PR drops terminating so we are only counting active.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix panic in Job controller when podRecreationPolicy: Failed is used, and the number of terminating pods exceeds parallelism.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 11, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 11, 2023
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 11, 2023
@kannon92 kannon92 changed the title WIP: Remove terminating count from rmAtLeast Remove terminating count from rmAtLeast Oct 12, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2023
@kannon92
Copy link
Contributor Author

/cc @mimowo

With this PR, I am not sure we need #121009.

I'll verify that this functionally works today.

@kannon92
Copy link
Contributor Author

/cc @mimowo

With this PR, I am not sure we need #121009.

I'll verify that this functionally works today.

I did a manual test to verify that the Failed case works. We have integration tests for this also but wanted to make sure.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is on the right track.

The reasoning that suggests me this is correct is as follows. The panic happens when
rmAtLeast > len(pods), that is rmAtLeast > active

Now, substituting active + terminating - wantActive for rmAtLeast we get
wantActive < terminating.

With the proposed change, we have active - wantActive > active, that is wantActive < 0 -> not possible :).

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
parallelism: 4,
completions: 1,
backoffLimit: 6,
initialStatus: &jobInitialStatus{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the initial status? I get an error saying that StartTime is not set unless I have this here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to depend on the parllelism of 4, dropping.

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved
pkg/controller/job/job_controller_test.go Show resolved Hide resolved
@kannon92 kannon92 force-pushed the rm-at-least-no-terminating-count branch 2 times, most recently from 6b2920a to 2c27203 Compare October 13, 2023 15:03
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 13, 2023
@kannon92 kannon92 force-pushed the rm-at-least-no-terminating-count branch from 2c27203 to ad6433a Compare October 13, 2023 15:34
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 13, 2023
@kannon92 kannon92 force-pushed the rm-at-least-no-terminating-count branch from ad6433a to feb0790 Compare October 13, 2023 18:02
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please update the PR description, with user-oriented release note, and some justification why this helps for the panic.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023
Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>
@kannon92 kannon92 force-pushed the rm-at-least-no-terminating-count branch from b426e00 to 7a1ac18 Compare October 17, 2023 18:50
@alculquicondor
Copy link
Member

alculquicondor commented Oct 17, 2023

/retest
(flaky #120080)

@alculquicondor
Copy link
Member

/approve

@alculquicondor
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 32b302f5924668c4eedc3e6567b040ef672ba027

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit 6d70013 into kubernetes:master Oct 17, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 17, 2023
@superbrothers
Copy link
Member

We encountered the following panic. The situation occurred when some nodes on which Job pods were running became NotReady. We feel that maybe this bug will be fixed in this PR and would like you to consider cherry-picking to patch release.

kube-controller-manager version is 1.28.4 and PodRecreationPolicy feature gate is not enabled.

E1206 07:42:50.514010       1 runtime.go:79] Observed a panic: runtime.boundsError{x:5, y:0, signed:true, code:0x2} (runtime error: slice bounds out of range [:5] with capacity 0)
goroutine 3391 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x45ea720?, 0xc0f86db8f0})
     vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0a72ba240?})
     vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x45ea720, 0xc0f86db8f0})
     /usr/local/go/src/runtime/panic.go:884 +0x213
k8s.io/kubernetes/pkg/controller/job.activePodsForRemoval(0xc0fa589900?, {0x0?, 0x74aa080?, 0x74a1198?}, 0x5)
     pkg/controller/job/job_controller.go:1670 +0x26b
k8s.io/kubernetes/pkg/controller/job.(*Controller).manageJob(0xc0038360e0, {0x50b6fe0?, 0xc00042d130}, 0xc0fa589900, 0xc0f7fdcfc0)
     pkg/controller/job/job_controller.go:1500 +0x565
k8s.io/kubernetes/pkg/controller/job.(*Controller).syncJob(0xc0038360e0, {0x50b6fe0, 0xc00042d130}, {0xc0a6931b80, 0x3f})
     pkg/controller/job/job_controller.go:871 +0x18f8
k8s.io/kubernetes/pkg/controller/job.(*Controller).processNextWorkItem(0xc0038360e0, {0x50b6fe0, 0xc00042d130})
     pkg/controller/job/job_controller.go:589 +0x123
k8s.io/kubernetes/pkg/controller/job.(*Controller).worker(...)
     pkg/controller/job/job_controller.go:578
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001461bc0?, {0x508ccc0, 0xc0aa5a2000}, 0x1, 0xc000de5440)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x508ccc0?, 0x3b9aca00, 0x0, 0xa0?, 0x1000000000001f4?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x50b6fe0, 0xc00042d130}, 0xc06957e860, 0xc0042d5fa0?, 0x170e606?, 0x20?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x50b6fe0?, 0xc00042d130?}, 0xc0004cc3a8?, 0xc0042d5fb8?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:170 +0x2b
created by k8s.io/kubernetes/pkg/controller/job.(*Controller).Run
     pkg/controller/job/job_controller.go:234 +0x40d
panic: runtime error: slice bounds out of range [:5] with capacity 0 [recovered]
     panic: runtime error: slice bounds out of range [:5] with capacity 0

@kannon92
Copy link
Contributor Author

kannon92 commented Dec 6, 2023

We encountered the following panic. The situation occurred when some nodes on which Job pods were running became NotReady. We feel that maybe this bug will be fixed in this PR and would like you to consider cherry-picking to patch release.
kube-controller-manager version is 1.28.4 and PodRecreationPolicy feature gate is not enabled.
panic: runtime error: slice bounds out of range [:5] with capacity 0 [recovered]
panic: runtime error: slice bounds out of range [:5] with capacity 0

Since the feature gate is not enabled, I don't know if this would actually fix this. Terminating should be 0 with this off. Ideally, we would have a test case where we can reproduce this.

@mimowo
Copy link
Contributor

mimowo commented Dec 6, 2023

@superbrothers do you have a reproducible scenario for this? If so, it would be great if you can open a dedicated issue to have the discussion there. Ideally, with the yaml's and steps necessary to reproduce. For example, it can matter if this is IndexedJob, or not.

@superbrothers
Copy link
Member

I will try to see if I can reproduce it, but it looks like I won't have time for a while.

@mimowo
Copy link
Contributor

mimowo commented Dec 7, 2023

I will try to see if I can reproduce it, but it looks like I won't have time for a while.

Retrospectively describing as closely as possible, with the yamls to create the API objects might be a good start already to open the issue. In the meanwhile, was this a job with completionMode: Indexed, do you have the complete yaml? Also, was the job spec edited since created, for example to scale up or down the parallelism?

@superbrothers
Copy link
Member

OK, I would create an issue first. The Job that I believe caused the problem was NonIndexed. But this may not be the cause (not sure, since a lot of jobs are created all the time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants