Clean shutdown of kcm, ccm and scheduler #110207

wojtek-t · 2022-05-25T08:04:25Z

NONE

/kind cleanup
/priority important-longterm
/sig apps
/sig scheduler

k8s-ci-robot · 2022-05-25T08:04:31Z

@wojtek-t: The label(s) sig/scheduler cannot be applied, because the repository doesn't have them.

In response to this:

Ref #108483
NONE
/kind cleanup
/priority important-longterm
/sig apps
/sig scheduler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-05-25T10:43:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cmd/kube-controller-manager/OWNERS~~ [wojtek-t]
~~cmd/kube-scheduler/OWNERS~~ [wojtek-t]
~~pkg/scheduler/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/apimachinery/pkg/OWNERS~~ [wojtek-t]
~~staging/src/k8s.io/cloud-provider/OWNERS~~ [wojtek-t]
~~test/integration/scheduler/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2022-05-25T11:50:05Z

/retest

aojea · 2022-05-26T08:46:03Z

cmd/kube-controller-manager/app/controllermanager.go

@@ -227,13 +232,14 @@ func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
 		controllerContext.ObjectOrMetadataInformerFactory.Start(stopCh)
 		close(controllerContext.InformersStarted)

-		select {}
+		<-ctx.Done()


unrelated, but doesn't feel that this run closure has a mix of contexts and stopCh that is really confusing?
mainly because IIUIC this ctx.Done() is derived from the stopCh

we need to translate between function that expect stopCh and those that expect ctx (and ctx.Done()).
I think that's fine

aojea · 2022-05-26T08:47:35Z

cmd/kube-controller-manager/app/controllermanager.go

-	select {}
+	<-stopCh
+	return nil


Comment in line 178 says

// Run runs the KubeControllerManagerOptions. This should never exit.

is ok to exit now?

Removed the comment.

If one want it to never exit, they can pass wait.NeverStop as an argument (which we do in line 138).
In tests, we want it to stop actually.

aojea · 2022-05-26T08:49:08Z

cmd/kube-controller-manager/app/options/options.go

-	eventBroadcaster.StartStructuredLogging(0)
-	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})


don't you need these 2 lines too?

We need them - but they were moved to Run() - see lines 185-189 in controllermanager.go above.

aojea · 2022-05-26T08:49:57Z

cmd/kube-scheduler/app/testing/testserver.go

+		if errCh != nil {
+			err, ok := <-errCh
+			if ok && err != nil {
+				klog.ErrorS(err, "Failed to run test server clearly")


s/run/shutdown/ ???

aojea · 2022-05-26T08:52:38Z

pkg/scheduler/scheduler.go

+	// We need to start scheduleOne loop in a dedicated goroutine,
+	// because scheduleOne function hangs on getting the next item
+	// from the SchedulingQueue.
+	// If there are no new pods to schedule, it will be hanging there
+	// and if done in this goroutine it will be blocking closing
+	// SchedulingQueue, in effect causing a deadlock on shutdown.
+	go wait.UntilWithContext(ctx, sched.scheduleOne, 0)


cc: @alculquicondor

aojea · 2022-05-26T08:55:41Z

staging/src/k8s.io/cloud-provider/app/controllermanager.go

-		run(context.TODO(), controllerInitializers)
-		panic("unreachable")
+		ctx, _ := wait.ContextForChannel(stopCh)
+		run(ctx, controllerInitializers)


hehe

https://github.com/kubernetes/kubernetes/blob/c09bf9623ec90b03e35e48af8b48bcb30f38bec9/staging/src/k8s.io/cloud-provider/app/controllermanager.go#L316

the run seems that will be running forever, startControllers doesn't leverage the context neither the stopCh ... that is the other funny part, we duplicate stop signals 🙃

They actually do utilize it:

we're passing ctx.Done() to startControllers

they are using it (as stopCh) in two places

But thanks for catching the select{} - fixed that now.

aojea · 2022-05-26T08:59:24Z

staging/src/k8s.io/cloud-provider/options/options.go

-	eventBroadcaster.StartStructuredLogging(0)
-	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})


do we need these too?

we do - we we do that now as part of Run() - see lines 146-150 in controllermanager.go above

wojtek-t

@aojea - comments addressed - PTAL

wojtek-t · 2022-05-26T09:15:46Z

cmd/kube-controller-manager/app/controllermanager.go

@@ -227,13 +232,14 @@ func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
 		controllerContext.ObjectOrMetadataInformerFactory.Start(stopCh)
 		close(controllerContext.InformersStarted)

-		select {}
+		<-ctx.Done()


we need to translate between function that expect stopCh and those that expect ctx (and ctx.Done()).
I think that's fine

wojtek-t · 2022-05-26T09:20:09Z

cmd/kube-controller-manager/app/controllermanager.go

-	select {}
+	<-stopCh
+	return nil


Removed the comment.

If one want it to never exit, they can pass wait.NeverStop as an argument (which we do in line 138).
In tests, we want it to stop actually.

wojtek-t · 2022-05-26T09:20:56Z

cmd/kube-controller-manager/app/options/options.go

-	eventBroadcaster.StartStructuredLogging(0)
-	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})


We need them - but they were moved to Run() - see lines 185-189 in controllermanager.go above.

wojtek-t · 2022-05-26T09:21:20Z

cmd/kube-scheduler/app/testing/testserver.go

+		if errCh != nil {
+			err, ok := <-errCh
+			if ok && err != nil {
+				klog.ErrorS(err, "Failed to run test server clearly")


wojtek-t · 2022-05-26T09:23:29Z

staging/src/k8s.io/cloud-provider/app/controllermanager.go

-		run(context.TODO(), controllerInitializers)
-		panic("unreachable")
+		ctx, _ := wait.ContextForChannel(stopCh)
+		run(ctx, controllerInitializers)


They actually do utilize it:

we're passing ctx.Done() to startControllers

they are using it (as stopCh) in two places

But thanks for catching the select{} - fixed that now.

wojtek-t · 2022-05-26T09:24:58Z

staging/src/k8s.io/cloud-provider/options/options.go

-	eventBroadcaster.StartStructuredLogging(0)
-	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})


we do - we we do that now as part of Run() - see lines 146-150 in controllermanager.go above

aojea · 2022-05-26T13:19:50Z

/test pull-kubernetes-verify-govet-levee

pod scheduling timeout

aojea · 2022-05-26T13:22:21Z

/lgtm

leilajal · 2022-05-26T15:32:38Z

/triage accepted

wojtek-t assigned aojea May 25, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 25, 2022

k8s-ci-robot requested review from denkensk and jingxu97 May 25, 2022 08:05

k8s-ci-robot added area/cloudprovider sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels May 25, 2022

wojtek-t force-pushed the clean_shutdown_managers branch from 424b0a8 to 9e78d54 Compare May 25, 2022 08:17

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 25, 2022

wojtek-t force-pushed the clean_shutdown_managers branch 2 times, most recently from 4c483eb to c09bf96 Compare May 25, 2022 10:42

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 25, 2022

aojea reviewed May 26, 2022

View reviewed changes

wojtek-t commented May 26, 2022

View reviewed changes

wojtek-t force-pushed the clean_shutdown_managers branch from c09bf96 to 4e8ffc6 Compare May 26, 2022 09:25

wojtek-t added 2 commits May 26, 2022 12:35

Make contextForChannel public

55130ae

Clean shutdown of kcm, ccm and scheduler

fe3616c

wojtek-t force-pushed the clean_shutdown_managers branch from 4e8ffc6 to fe3616c Compare May 26, 2022 10:37

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2022

k8s-ci-robot merged commit 029b1bb into kubernetes:master May 26, 2022

k8s-ci-robot added this to the v1.25 milestone May 26, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 26, 2022

MikeSpreitzer mentioned this pull request May 3, 2023

Improve and simplify maintenance of APF bootstrap objects without type assertions #111422

Merged

sharnoff mentioned this pull request May 5, 2023

scheduler takes a long time to shutdown neondatabase/autoscaling#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean shutdown of kcm, ccm and scheduler #110207

Clean shutdown of kcm, ccm and scheduler #110207

wojtek-t commented May 25, 2022

k8s-ci-robot commented May 25, 2022

k8s-ci-robot commented May 25, 2022

wojtek-t commented May 25, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

aojea May 26, 2022

aojea May 26, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

aojea May 26, 2022

wojtek-t May 26, 2022

wojtek-t left a comment

wojtek-t May 26, 2022

wojtek-t May 26, 2022

wojtek-t May 26, 2022

wojtek-t May 26, 2022

wojtek-t May 26, 2022

wojtek-t May 26, 2022

aojea commented May 26, 2022

aojea commented May 26, 2022

leilajal commented May 26, 2022

		eventBroadcaster.StartStructuredLogging(0)
		eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})

Clean shutdown of kcm, ccm and scheduler #110207

Clean shutdown of kcm, ccm and scheduler #110207

Conversation

wojtek-t commented May 25, 2022

k8s-ci-robot commented May 25, 2022

k8s-ci-robot commented May 25, 2022

wojtek-t commented May 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented May 26, 2022

aojea commented May 26, 2022

leilajal commented May 26, 2022