Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schedule DaemonSet Pods in scheduler. #63223

Merged
merged 3 commits into from
Jun 2, 2018

Conversation

k82cn
Copy link
Member

@k82cn k82cn commented Apr 27, 2018

Signed-off-by: Da K. Ma klaus1982.cn@gmail.com

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
part of #59194

Special notes for your reviewer:

Release note:

`ScheduleDaemonSetPods` is an alpha feature (since v1.11) that causes DaemonSet Pods
to be scheduler by default scheduler, instead of Daemonset controller. When it is enabled,
the `NodeAffinity` term (instead of `.spec.nodeName`) is added to the DaemonSet Pods;
this enables the default scheduler to bind the Pod to the target host. If node affinity
of DaemonSet Pod already exists, it will be replaced.

DaemonSet controller will only perform these operations when creating DaemonSet Pods;
and those operations will only modify the Pods of DaemonSet, no changes are made to the
`.spec.template` of DaemonSet.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 27, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 27, 2018
kubeletapis "k8s.io/kubernetes/pkg/kubelet/apis"
"k8s.io/kubernetes/pkg/scheduler/algorithm"
)

func newPod(podName string, nodeName string, label map[string]string) *v1.Pod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplaceDaemonSetPodHostnameNodeAffinity maybe ReplaceDaemonSetPodNodeNameNodeAffinity?

@k82cn k82cn force-pushed the kep548_working branch 3 times, most recently from d97fb22 to 8a76cd4 Compare April 27, 2018 14:51
@@ -774,9 +774,14 @@ func (dsc *DaemonSetsController) getNodesToDaemonPods(ds *apps.DaemonSet) (map[s
// Group Pods by Node name.
nodeToDaemonPods := make(map[string][]*v1.Pod)
for _, pod := range claimedPods {
nodeName := pod.Spec.NodeName
nodeName, err := util.GetTargetNodeName(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it seems to be changing the behavior if the feature is disabled by returning error if pod.Spec.NodeName is 0.

}
}

return "", fmt.Errorf("no node name found for pod %s/%s", pod.Namespace, pod.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like here we are changing the behavior if the feature is disabled by returning this error which was not checked before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to get .spec.nodeName firstly, then nodeAffinity.

@k82cn k82cn force-pushed the kep548_working branch 4 times, most recently from 10d48c4 to e333429 Compare April 30, 2018 10:05
@k82cn k82cn changed the title WIP: Schedule DaemonSet Pods in scheduler. Schedule DaemonSet Pods in scheduler. Apr 30, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2018
@0xmichalis 0xmichalis removed their request for review May 4, 2018 20:04
@k82cn
Copy link
Member Author

k82cn commented May 7, 2018

/assign @janetkuo @bsalamat

@k82cn
Copy link
Member Author

k82cn commented May 7, 2018

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label May 7, 2018
@@ -962,9 +969,11 @@ func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nod

podTemplate := &template

if false /*disabled for 1.10*/ && utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should enable this feature by default in 1.11.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't enable an alpha feature by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot remain disabled in 1.11. Rescheduler is already removed from the code-base. If critical daemonsets cannot be scheduled, preemption must create room for them and DS controller is incapable of performing preemption.

Copy link
Member

@bsalamat bsalamat May 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when I thought about it again, I realized that my concern may not be valid. IIUC Rescheduler could not help with scheduling critical DS pods anyway, because DS controller did not create a DS pod before it found a node that could run the pod. So, Rescheduler was not even aware that such critical DS pods needed to be scheduled.
In other words, DS controller never relied on Rescheduler to create room for DS pods. So, the fact that Rescheduler does not exist in 1.11 won't change anything here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Klaus. So, my initial concern is valid. DS controller does not run "resource check" for critical pods. This means that it creates critical DS Pods regardless of the resources available on the nodes and it relies on "Rescheduler" to free up resources on the nodes if necessary. In the absence of Rescheduler, it is important to let default scheduler schedule DS Pods. Otherwise, critical DS pods may never be scheduled when their corresponding nodes are out of resources.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your concern is a good point :). I re-check the code that, critical pod (ExperimentalCriticalPodAnnotation) is still alpha feature (e2e was also passed in removed re-scheduler PR).

Let me also check whether it is enabled specially in test-infra :). If not enabled, I think that's safe for us to remove it; and we need to update yaml files about critical pods if any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I arranged with @ravisantoshgudimetla to make Rescheduler aware of Pod priority and add it back to help create room for critical DS Pods. So, this PR can remain as is (no need to enable the feature in 1.11).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to add that it was @ravisantoshgudimetla's idea to add priority awareness and use Rescheduler in 1.11. It removes a blocker in moving priority and preemption to Beta.

@k82cn k82cn force-pushed the kep548_working branch 3 times, most recently from 8329341 to 627baf3 Compare May 31, 2018 07:39
@k82cn
Copy link
Member Author

k82cn commented May 31, 2018

/retest

@bsalamat
Copy link
Member

@k82cn Please add a very clear release note about this feature.

@kow3ns kow3ns added this to In Progress in Workloads May 31, 2018
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jun 1, 2018
@bsalamat
Copy link
Member

bsalamat commented Jun 1, 2018

Please change the following phrase in the release notes:
s/that makes DaemonSet Pods are scheduled by default scheduler/that causes DaemonSet Pods to be scheduler by default scheduler/

otherwise, LGTM

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits on comments. LGTM otherwise. Please squash commits. Thanks!

// the given "nodeName" in the "affinity" terms.
func ReplaceDaemonSetPodHostnameNodeAffinity(affinity *v1.Affinity, nodename string) *v1.Affinity {
// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the NodeAffinity by a new NodeAffinity with
// the given "nodeName" in the "affinityterms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the RequiredDuringSchedulingIgnoredDuringExecution 
// NodeAffinity of the given affinity with a new NodeAffinity that selects the given nodeName.
// Note that this function assumes that no NodeAffinity conflicts with the selected nodeName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

podTemplate = template.DeepCopy()
podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodHostnameNodeAffinity(
// The pod's NodeAffinity will be updated to make sure the Pod is bound
// to the target node by default scheduler.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add:

// It is safe to do so because there should be no conflicting node affinity with the target node. 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@janetkuo
Copy link
Member

janetkuo commented Jun 2, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 2, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, janetkuo, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 2, 2018
@@ -850,7 +857,7 @@ func (dsc *DaemonSetsController) podsShouldBeOnNode(
// If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods.
// Sort the daemon pods by creation time, so the oldest is preserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: scheduled pod is preserved first; if more than one pod can be preserved, the oldest pod is preserved.

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 64057, 63223, 64346, 64562, 64408). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit a0a4cc7 into kubernetes:master Jun 2, 2018
Workloads automation moved this from In Progress to Done Jun 2, 2018
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jun 2, 2018

@k82cn: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-bazel-build 9fd848e link /test pull-kubernetes-bazel-build
pull-kubernetes-integration 9fd848e link /test pull-kubernetes-integration

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k82cn k82cn deleted the kep548_working branch June 2, 2018 10:47
mgdevstack pushed a commit to mgdevstack/kubernetes that referenced this pull request Jun 4, 2018
…64364-remove-rescheduler

Automatic merge from submit-queue (batch tested with PRs 63453, 64592, 64482, 64618, 64661). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Revert "Remove rescheduler and corresponding tests from master"

Reverts kubernetes#64364

After discussing with @bsalamat on how DS controllers(ref: kubernetes#63223 (comment)) cannot create pods if the cluster is at capacity and they have to rely on rescheduler for making some space, we thought it is better to 

- Bring rescheduler back.
- Make rescheduler priority aware.
- If cluster is full and if **only** DS controller is not able to create pods, let rescheduler be run and let it evict some pods which have less priority.
- The DS controller pods will be scheduled now.

So, I am reverting this PR now. Step 2, 3 above are going to be in rescheduler.

/cc @bsalamat @aveshagarwal @k82cn 

Please let me know your thoughts on this. 

```release-note
Revert kubernetes#64364 to resurrect rescheduler. More info kubernetes#64725 :)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Workloads
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

8 participants