Remove terminating count from rmAtLeast #121147

kannon92 · 2023-10-11T13:36:24Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

PodReplacementPolicy was enabled in 1.28 and it tracks the number of terminating pods. If Failed is specified, then we will halt replacement until the pod is fully terminal.

We discovered a bug where we were tracking (terminating + active) and using that index to access the active pods list. If there are more terminating pods than active, it is possible to hit an out of bounds error.

@mimowo pointed out that we don't need the terminating count as we don't need to delete the terminating pods.

This PR drops terminating so we are only counting active.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix panic in Job controller when podRecreationPolicy: Failed is used, and the number of terminating pods exceeds parallelism.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-10-11T13:36:34Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kannon92 · 2023-10-12T18:44:56Z

/cc @mimowo

With this PR, I am not sure we need #121009.

I'll verify that this functionally works today.

kannon92 · 2023-10-12T21:00:42Z

/cc @mimowo

With this PR, I am not sure we need #121009.

I'll verify that this functionally works today.

I did a manual test to verify that the Failed case works. We have integration tests for this also but wanted to make sure.

mimowo

I think this is on the right track.

The reasoning that suggests me this is correct is as follows. The panic happens when
rmAtLeast > len(pods), that is rmAtLeast > active

Now, substituting active + terminating - wantActive for rmAtLeast we get
wantActive < terminating.

With the proposed change, we have active - wantActive > active, that is wantActive < 0 -> not possible :).

pkg/controller/job/job_controller_test.go

mimowo · 2023-10-13T08:10:25Z

pkg/controller/job/job_controller_test.go

+			parallelism:  4,
+			completions:  1,
+			backoffLimit: 6,
+			initialStatus: &jobInitialStatus{


Drop the initial status? I get an error saying that StartTime is not set unless I have this here.

Seems to depend on the parllelism of 4, dropping.

pkg/controller/job/job_controller_test.go

mimowo

LGTM, please update the PR description, with user-oriented release note, and some justification why this helps for the panic.

Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

alculquicondor · 2023-10-17T19:47:32Z

/retest
(flaky #120080)

alculquicondor · 2023-10-17T20:13:21Z

/approve

alculquicondor · 2023-10-17T20:13:26Z

/lgtm

k8s-ci-robot · 2023-10-17T20:13:34Z

LGTM label has been added.

Git tree hash: 32b302f5924668c4eedc3e6567b040ef672ba027

k8s-ci-robot · 2023-10-17T20:13:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, kannon92

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/job/OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

superbrothers · 2023-12-06T08:19:16Z

We encountered the following panic. The situation occurred when some nodes on which Job pods were running became NotReady. We feel that maybe this bug will be fixed in this PR and would like you to consider cherry-picking to patch release.

kube-controller-manager version is 1.28.4 and PodRecreationPolicy feature gate is not enabled.

E1206 07:42:50.514010       1 runtime.go:79] Observed a panic: runtime.boundsError{x:5, y:0, signed:true, code:0x2} (runtime error: slice bounds out of range [:5] with capacity 0)
goroutine 3391 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x45ea720?, 0xc0f86db8f0})
     vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc0a72ba240?})
     vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75
panic({0x45ea720, 0xc0f86db8f0})
     /usr/local/go/src/runtime/panic.go:884 +0x213
k8s.io/kubernetes/pkg/controller/job.activePodsForRemoval(0xc0fa589900?, {0x0?, 0x74aa080?, 0x74a1198?}, 0x5)
     pkg/controller/job/job_controller.go:1670 +0x26b
k8s.io/kubernetes/pkg/controller/job.(*Controller).manageJob(0xc0038360e0, {0x50b6fe0?, 0xc00042d130}, 0xc0fa589900, 0xc0f7fdcfc0)
     pkg/controller/job/job_controller.go:1500 +0x565
k8s.io/kubernetes/pkg/controller/job.(*Controller).syncJob(0xc0038360e0, {0x50b6fe0, 0xc00042d130}, {0xc0a6931b80, 0x3f})
     pkg/controller/job/job_controller.go:871 +0x18f8
k8s.io/kubernetes/pkg/controller/job.(*Controller).processNextWorkItem(0xc0038360e0, {0x50b6fe0, 0xc00042d130})
     pkg/controller/job/job_controller.go:589 +0x123
k8s.io/kubernetes/pkg/controller/job.(*Controller).worker(...)
     pkg/controller/job/job_controller.go:578
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1()
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x25
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc001461bc0?, {0x508ccc0, 0xc0aa5a2000}, 0x1, 0xc000de5440)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x508ccc0?, 0x3b9aca00, 0x0, 0xa0?, 0x1000000000001f4?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext({0x50b6fe0, 0xc00042d130}, 0xc06957e860, 0xc0042d5fa0?, 0x170e606?, 0x20?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:259 +0x99
k8s.io/apimachinery/pkg/util/wait.UntilWithContext({0x50b6fe0?, 0xc00042d130?}, 0xc0004cc3a8?, 0xc0042d5fb8?)
     vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:170 +0x2b
created by k8s.io/kubernetes/pkg/controller/job.(*Controller).Run
     pkg/controller/job/job_controller.go:234 +0x40d
panic: runtime error: slice bounds out of range [:5] with capacity 0 [recovered]
     panic: runtime error: slice bounds out of range [:5] with capacity 0

kannon92 · 2023-12-06T14:04:15Z

We encountered the following panic. The situation occurred when some nodes on which Job pods were running became NotReady. We feel that maybe this bug will be fixed in this PR and would like you to consider cherry-picking to patch release.
kube-controller-manager version is 1.28.4 and PodRecreationPolicy feature gate is not enabled.
panic: runtime error: slice bounds out of range [:5] with capacity 0 [recovered]
panic: runtime error: slice bounds out of range [:5] with capacity 0

Since the feature gate is not enabled, I don't know if this would actually fix this. Terminating should be 0 with this off. Ideally, we would have a test case where we can reproduce this.

mimowo · 2023-12-06T14:41:24Z

@superbrothers do you have a reproducible scenario for this? If so, it would be great if you can open a dedicated issue to have the discussion there. Ideally, with the yaml's and steps necessary to reproduce. For example, it can matter if this is IndexedJob, or not.

superbrothers · 2023-12-07T04:18:39Z

I will try to see if I can reproduce it, but it looks like I won't have time for a while.

mimowo · 2023-12-07T12:45:22Z

I will try to see if I can reproduce it, but it looks like I won't have time for a while.

Retrospectively describing as closely as possible, with the yamls to create the API objects might be a good start already to open the issue. In the meanwhile, was this a job with completionMode: Indexed, do you have the complete yaml? Also, was the job spec edited since created, for example to scale up or down the parallelism?

superbrothers · 2023-12-08T01:20:11Z

OK, I would create an issue first. The Job that I believe caused the problem was NonIndexed. But this may not be the cause (not sure, since a lot of jobs are created all the time).

…1147-upstream-release-1.28 Automated cherry pick of #121147: Fix panic if there are more terminating pods than active pods

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Oct 11, 2023

k8s-ci-robot requested review from smarterclayton and soltysh October 11, 2023 13:36

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 11, 2023

kannon92 mentioned this pull request Oct 11, 2023

Fix panic in job controller on scale down when podRecreationPolicy=Failed is used #121009

Closed

kannon92 changed the title ~~WIP: Remove terminating count from rmAtLeast~~ Remove terminating count from rmAtLeast Oct 12, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 12, 2023

k8s-ci-robot requested a review from mimowo October 12, 2023 18:44

mimowo reviewed Oct 13, 2023

View reviewed changes

kannon92 force-pushed the rm-at-least-no-terminating-count branch 2 times, most recently from 6b2920a to 2c27203 Compare October 13, 2023 15:03

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 13, 2023

mimowo reviewed Oct 13, 2023

View reviewed changes

pkg/controller/job/job_controller_test.go Outdated Show resolved Hide resolved

kannon92 force-pushed the rm-at-least-no-terminating-count branch from 2c27203 to ad6433a Compare October 13, 2023 15:34

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 13, 2023

kannon92 force-pushed the rm-at-least-no-terminating-count branch from ad6433a to feb0790 Compare October 13, 2023 18:02

mimowo reviewed Oct 13, 2023

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023

k8s-ci-robot requested review from alculquicondor and mimowo October 17, 2023 18:50

Fix panic if there are more terminating pods than active pods

7a1ac18

Co-authored-by: Aldo Culquicondor <1299064+alculquicondor@users.noreply.github.com>

kannon92 force-pushed the rm-at-least-no-terminating-count branch from b426e00 to 7a1ac18 Compare October 17, 2023 18:50

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 17, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023

kannon92 mentioned this pull request Oct 17, 2023

Allow for recreation of pods once fully terminated in the job controller kubernetes/enhancements#3939

Open

8 tasks

k8s-ci-robot merged commit 6d70013 into kubernetes:master Oct 17, 2023
14 checks passed

k8s-ci-robot added this to the v1.29 milestone Oct 17, 2023

superbrothers mentioned this pull request Dec 8, 2023

job controller has panic #122235

Closed

mimowo mentioned this pull request Dec 11, 2023

Add unit test for Job Controller for panic when PodFailurePolicy is used on 1.28 #122261

Merged

mimowo added a commit to mimowo/kubernetes that referenced this pull request Dec 11, 2023

Revert the fix kubernetes#121147

98a24c7

mimowo added a commit to mimowo/kubernetes that referenced this pull request Dec 11, 2023

Revert the fix kubernetes#121147

989e21c

kannon92 mentioned this pull request Dec 11, 2023

Automated cherry pick of #121147: Fix panic if there are more terminating pods than active pods #122267

Merged

mimowo mentioned this pull request Dec 11, 2023

[DO NOT MERGE] Repro 1.28 Job controller panic when the fix is reverted #122262

Closed

k8s-ci-robot added a commit that referenced this pull request Dec 14, 2023

Merge pull request #122267 from kannon92/automated-cherry-pick-of-#12…

a5495b7

…1147-upstream-release-1.28 Automated cherry pick of #121147: Fix panic if there are more terminating pods than active pods

This was referenced Dec 14, 2023

Automated cherry pick of #122261: Add unit test for Job Controller for panic when PodFailurePolicy is used #122313

Merged

Automated cherry pick of #122261: Add unit test for Job Controller for panic when PodFailurePolicy is used #122314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove terminating count from rmAtLeast #121147

Remove terminating count from rmAtLeast #121147

kannon92 commented Oct 11, 2023 •

edited

k8s-ci-robot commented Oct 11, 2023

kannon92 commented Oct 12, 2023

kannon92 commented Oct 12, 2023

mimowo left a comment

mimowo Oct 13, 2023

kannon92 Oct 13, 2023

kannon92 Oct 13, 2023

mimowo left a comment

alculquicondor commented Oct 17, 2023 •

edited

alculquicondor commented Oct 17, 2023

alculquicondor commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

superbrothers commented Dec 6, 2023

kannon92 commented Dec 6, 2023

mimowo commented Dec 6, 2023

superbrothers commented Dec 7, 2023

mimowo commented Dec 7, 2023 •

edited

superbrothers commented Dec 8, 2023

Remove terminating count from rmAtLeast #121147

Remove terminating count from rmAtLeast #121147

Conversation

kannon92 commented Oct 11, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 11, 2023

kannon92 commented Oct 12, 2023

kannon92 commented Oct 12, 2023

mimowo left a comment

Choose a reason for hiding this comment

mimowo Oct 13, 2023

Choose a reason for hiding this comment

kannon92 Oct 13, 2023

Choose a reason for hiding this comment

kannon92 Oct 13, 2023

Choose a reason for hiding this comment

mimowo left a comment

Choose a reason for hiding this comment

alculquicondor commented Oct 17, 2023 • edited

alculquicondor commented Oct 17, 2023

alculquicondor commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

k8s-ci-robot commented Oct 17, 2023

superbrothers commented Dec 6, 2023

kannon92 commented Dec 6, 2023

mimowo commented Dec 6, 2023

superbrothers commented Dec 7, 2023

mimowo commented Dec 7, 2023 • edited

superbrothers commented Dec 8, 2023

kannon92 commented Oct 11, 2023 •

edited

alculquicondor commented Oct 17, 2023 •

edited

mimowo commented Dec 7, 2023 •

edited