Schedule DaemonSet Pods in scheduler. #63223

k82cn · 2018-04-27T02:18:19Z

Signed-off-by: Da K. Ma klaus1982.cn@gmail.com

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
part of #59194

Special notes for your reviewer:

Release note:

`ScheduleDaemonSetPods` is an alpha feature (since v1.11) that causes DaemonSet Pods
to be scheduler by default scheduler, instead of Daemonset controller. When it is enabled,
the `NodeAffinity` term (instead of `.spec.nodeName`) is added to the DaemonSet Pods;
this enables the default scheduler to bind the Pod to the target host. If node affinity
of DaemonSet Pod already exists, it will be replaced.

DaemonSet controller will only perform these operations when creating DaemonSet Pods;
and those operations will only modify the Pods of DaemonSet, no changes are made to the
`.spec.template` of DaemonSet.

wgliang · 2018-04-27T03:37:45Z

pkg/controller/daemon/util/daemonset_util_test.go

 	kubeletapis "k8s.io/kubernetes/pkg/kubelet/apis"
+	"k8s.io/kubernetes/pkg/scheduler/algorithm"
 )

 func newPod(podName string, nodeName string, label map[string]string) *v1.Pod {


ReplaceDaemonSetPodHostnameNodeAffinity maybe ReplaceDaemonSetPodNodeNameNodeAffinity?

aveshagarwal · 2018-04-27T15:27:49Z

pkg/controller/daemon/daemon_controller.go

@@ -774,9 +774,14 @@ func (dsc *DaemonSetsController) getNodesToDaemonPods(ds *apps.DaemonSet) (map[s
 	// Group Pods by Node name.
 	nodeToDaemonPods := make(map[string][]*v1.Pod)
 	for _, pod := range claimedPods {
-		nodeName := pod.Spec.NodeName
+		nodeName, err := util.GetTargetNodeName(pod)


Here it seems to be changing the behavior if the feature is disabled by returning error if pod.Spec.NodeName is 0.

aveshagarwal · 2018-04-27T15:28:55Z

pkg/controller/daemon/util/daemonset_util.go

+		}
+	}
+
+	return "", fmt.Errorf("no node name found for pod %s/%s", pod.Namespace, pod.Name)


seems like here we are changing the behavior if the feature is disabled by returning this error which was not checked before.

Updated to get .spec.nodeName firstly, then nodeAffinity.

k82cn · 2018-05-07T03:28:47Z

/assign @janetkuo @bsalamat

k82cn · 2018-05-07T03:28:52Z

/sig apps

bsalamat · 2018-05-31T00:08:45Z

pkg/controller/daemon/daemon_controller.go

@@ -962,9 +969,11 @@ func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nod

 				podTemplate := &template

-				if false /*disabled for 1.10*/ && utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
+				if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {


We should enable this feature by default in 1.11.

We can't enable an alpha feature by default.

This cannot remain disabled in 1.11. Rescheduler is already removed from the code-base. If critical daemonsets cannot be scheduled, preemption must create room for them and DS controller is incapable of performing preemption.

Actually when I thought about it again, I realized that my concern may not be valid. IIUC Rescheduler could not help with scheduling critical DS pods anyway, because DS controller did not create a DS pod before it found a node that could run the pod. So, Rescheduler was not even aware that such critical DS pods needed to be scheduled.
In other words, DS controller never relied on Rescheduler to create room for DS pods. So, the fact that Rescheduler does not exist in 1.11 won't change anything here.

@bsalamat , here's the code about critical pod in daemonset: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1429 .

Thanks, Klaus. So, my initial concern is valid. DS controller does not run "resource check" for critical pods. This means that it creates critical DS Pods regardless of the resources available on the nodes and it relies on "Rescheduler" to free up resources on the nodes if necessary. In the absence of Rescheduler, it is important to let default scheduler schedule DS Pods. Otherwise, critical DS pods may never be scheduled when their corresponding nodes are out of resources.

Your concern is a good point :). I re-check the code that, critical pod (ExperimentalCriticalPodAnnotation) is still alpha feature (e2e was also passed in removed re-scheduler PR).

Let me also check whether it is enabled specially in test-infra :). If not enabled, I think that's safe for us to remove it; and we need to update yaml files about critical pods if any.

I arranged with @ravisantoshgudimetla to make Rescheduler aware of Pod priority and add it back to help create room for critical DS Pods. So, this PR can remain as is (no need to enable the feature in 1.11).

I have to add that it was @ravisantoshgudimetla's idea to add priority awareness and use Rescheduler in 1.11. It removes a blocker in moving priority and preemption to Beta.

k82cn · 2018-05-31T08:33:27Z

/retest

bsalamat · 2018-05-31T16:58:04Z

@k82cn Please add a very clear release note about this feature.

bsalamat · 2018-06-01T21:03:26Z

Please change the following phrase in the release notes:
s/that makes DaemonSet Pods are scheduled by default scheduler/that causes DaemonSet Pods to be scheduler by default scheduler/

otherwise, LGTM

janetkuo

Just some nits on comments. LGTM otherwise. Please squash commits. Thanks!

janetkuo · 2018-06-01T23:30:18Z

pkg/controller/daemon/util/daemonset_util.go

-// the given "nodeName" in the "affinity" terms.
-func ReplaceDaemonSetPodHostnameNodeAffinity(affinity *v1.Affinity, nodename string) *v1.Affinity {
+// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the NodeAffinity by a new NodeAffinity with
+// the given "nodeName" in the "affinityterms.


nit:

// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the RequiredDuringSchedulingIgnoredDuringExecution // NodeAffinity of the given affinity with a new NodeAffinity that selects the given nodeName. // Note that this function assumes that no NodeAffinity conflicts with the selected nodeName.

janetkuo · 2018-06-01T23:33:41Z

pkg/controller/daemon/daemon_controller.go

 					podTemplate = template.DeepCopy()
-					podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodHostnameNodeAffinity(
+					// The pod's NodeAffinity will be updated to make sure the Pod is bound
+					// to the target node by default scheduler.


Add:

// It is safe to do so because there should be no conflicting node affinity with the target node.

janetkuo · 2018-06-02T06:50:42Z

/lgtm

k8s-ci-robot · 2018-06-02T06:50:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, janetkuo, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/daemon/OWNERS~~ [janetkuo]
~~test/integration/daemonset/OWNERS~~ [janetkuo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

janetkuo · 2018-06-02T07:01:00Z

pkg/controller/daemon/daemon_controller.go

@@ -850,7 +857,7 @@ func (dsc *DaemonSetsController) podsShouldBeOnNode(
 		// If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods.
 		// Sort the daemon pods by creation time, so the oldest is preserved.


nit: scheduled pod is preserved first; if more than one pod can be preserved, the oldest pod is preserved.

k8s-github-robot · 2018-06-02T08:26:36Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-06-02T09:10:08Z

Automatic merge from submit-queue (batch tested with PRs 64057, 63223, 64346, 64562, 64408). If you want to cherry-pick this change to another branch, please follow the instructions here.

k8s-ci-robot · 2018-06-02T09:27:50Z

@k82cn: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-bazel-build	`9fd848e`	link	`/test pull-kubernetes-bazel-build`
pull-kubernetes-integration	`9fd848e`	link	`/test pull-kubernetes-integration`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bsalamat

…64364-remove-rescheduler Automatic merge from submit-queue (batch tested with PRs 63453, 64592, 64482, 64618, 64661). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Revert "Remove rescheduler and corresponding tests from master" Reverts kubernetes#64364 After discussing with @bsalamat on how DS controllers(ref: kubernetes#63223 (comment)) cannot create pods if the cluster is at capacity and they have to rely on rescheduler for making some space, we thought it is better to - Bring rescheduler back. - Make rescheduler priority aware. - If cluster is full and if **only** DS controller is not able to create pods, let rescheduler be run and let it evict some pods which have less priority. - The DS controller pods will be scheduled now. So, I am reverting this PR now. Step 2, 3 above are going to be in rescheduler. /cc @bsalamat @aveshagarwal @k82cn Please let me know your thoughts on this. ```release-note Revert kubernetes#64364 to resurrect rescheduler. More info kubernetes#64725 :) ```

k8s-ci-robot requested review from bsalamat, davidopp, 0xmichalis and lukaszo April 27, 2018 02:18

k82cn force-pushed the kep548_working branch from a708e45 to 8d83cc1 Compare April 27, 2018 02:19

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2018

wgliang reviewed Apr 27, 2018

View reviewed changes

k82cn force-pushed the kep548_working branch 3 times, most recently from d97fb22 to 8a76cd4 Compare April 27, 2018 14:51

aveshagarwal reviewed Apr 27, 2018

View reviewed changes

k82cn force-pushed the kep548_working branch 4 times, most recently from 10d48c4 to e333429 Compare April 30, 2018 10:05

k82cn changed the title ~~WIP: Schedule DaemonSet Pods in scheduler.~~ Schedule DaemonSet Pods in scheduler. Apr 30, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2018

0xmichalis removed their request for review May 4, 2018 20:04

k8s-ci-robot assigned bsalamat and janetkuo May 7, 2018

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 7, 2018

bsalamat reviewed May 31, 2018

View reviewed changes

k82cn force-pushed the kep548_working branch 3 times, most recently from 8329341 to 627baf3 Compare May 31, 2018 07:39

kow3ns added this to In Progress in Workloads May 31, 2018

ravisantoshgudimetla mentioned this pull request Jun 1, 2018

Revert "Remove rescheduler and corresponding tests from master" #64592

Merged

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jun 1, 2018

janetkuo approved these changes Jun 1, 2018

View reviewed changes

k82cn added 3 commits June 2, 2018 08:38

Updated helper funcs to use nodename.

faaa485

Eanbled schedule DaemonSet Pods by default scheduler.

8180e1e

Updated integration test.

9fd848e

k82cn force-pushed the kep548_working branch from 627baf3 to 9fd848e Compare June 2, 2018 00:39

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2018

janetkuo reviewed Jun 2, 2018

View reviewed changes

k8s-github-robot merged commit a0a4cc7 into kubernetes:master Jun 2, 2018

Workloads automation moved this from In Progress to Done Jun 2, 2018

k82cn deleted the kep548_working branch June 2, 2018 10:47

aveshagarwal mentioned this pull request Jun 5, 2018

Scheduling daemonset on busy cluster doesn't rebalance pods #64771

Closed

k82cn mentioned this pull request Jun 20, 2018

Before Suite running and failing occasionally on gci-gce-alpha-features job in master-blocking and release-1.11-blocking tests #65192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedule DaemonSet Pods in scheduler. #63223

Schedule DaemonSet Pods in scheduler. #63223

k82cn commented Apr 27, 2018 •

edited

Loading

wgliang Apr 27, 2018

aveshagarwal Apr 27, 2018

aveshagarwal Apr 27, 2018

k82cn May 18, 2018

k82cn commented May 7, 2018

k82cn commented May 7, 2018

bsalamat May 31, 2018

janetkuo May 31, 2018

bsalamat May 31, 2018

bsalamat May 31, 2018 •

edited

Loading

k82cn Jun 1, 2018

bsalamat Jun 1, 2018

k82cn Jun 1, 2018

bsalamat Jun 1, 2018

bsalamat Jun 1, 2018

k82cn commented May 31, 2018

bsalamat commented May 31, 2018

bsalamat commented Jun 1, 2018

janetkuo left a comment

janetkuo Jun 1, 2018

k82cn Jun 2, 2018

janetkuo Jun 1, 2018

k82cn Jun 2, 2018

janetkuo commented Jun 2, 2018

k8s-ci-robot commented Jun 2, 2018

janetkuo Jun 2, 2018

k8s-github-robot commented Jun 2, 2018

k8s-github-robot commented Jun 2, 2018

k8s-ci-robot commented Jun 2, 2018 •

edited

Loading

		@@ -850,7 +857,7 @@ func (dsc *DaemonSetsController) podsShouldBeOnNode(
		// If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods.
		// Sort the daemon pods by creation time, so the oldest is preserved.

Schedule DaemonSet Pods in scheduler. #63223

Schedule DaemonSet Pods in scheduler. #63223

Conversation

k82cn commented Apr 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented May 7, 2018

k82cn commented May 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bsalamat May 31, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k82cn commented May 31, 2018

bsalamat commented May 31, 2018

bsalamat commented Jun 1, 2018

janetkuo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janetkuo commented Jun 2, 2018

k8s-ci-robot commented Jun 2, 2018

Choose a reason for hiding this comment

k8s-github-robot commented Jun 2, 2018

k8s-github-robot commented Jun 2, 2018

k8s-ci-robot commented Jun 2, 2018 • edited Loading

k82cn commented Apr 27, 2018 •

edited

Loading

bsalamat May 31, 2018 •

edited

Loading

k8s-ci-robot commented Jun 2, 2018 •

edited

Loading