-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule DaemonSet Pods in scheduler. #63223
Conversation
kubeletapis "k8s.io/kubernetes/pkg/kubelet/apis" | ||
"k8s.io/kubernetes/pkg/scheduler/algorithm" | ||
) | ||
|
||
func newPod(podName string, nodeName string, label map[string]string) *v1.Pod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReplaceDaemonSetPodHostnameNodeAffinity
maybe ReplaceDaemonSetPodNodeNameNodeAffinity
?
d97fb22
to
8a76cd4
Compare
@@ -774,9 +774,14 @@ func (dsc *DaemonSetsController) getNodesToDaemonPods(ds *apps.DaemonSet) (map[s | |||
// Group Pods by Node name. | |||
nodeToDaemonPods := make(map[string][]*v1.Pod) | |||
for _, pod := range claimedPods { | |||
nodeName := pod.Spec.NodeName | |||
nodeName, err := util.GetTargetNodeName(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it seems to be changing the behavior if the feature is disabled by returning error if pod.Spec.NodeName is 0.
} | ||
} | ||
|
||
return "", fmt.Errorf("no node name found for pod %s/%s", pod.Namespace, pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like here we are changing the behavior if the feature is disabled by returning this error which was not checked before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to get .spec.nodeName
firstly, then nodeAffinity
.
10d48c4
to
e333429
Compare
/sig apps |
@@ -962,9 +969,11 @@ func (dsc *DaemonSetsController) syncNodes(ds *apps.DaemonSet, podsToDelete, nod | |||
|
|||
podTemplate := &template | |||
|
|||
if false /*disabled for 1.10*/ && utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) { | |||
if utilfeature.DefaultFeatureGate.Enabled(features.ScheduleDaemonSetPods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should enable this feature by default in 1.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't enable an alpha feature by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot remain disabled in 1.11. Rescheduler is already removed from the code-base. If critical daemonsets cannot be scheduled, preemption must create room for them and DS controller is incapable of performing preemption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually when I thought about it again, I realized that my concern may not be valid. IIUC Rescheduler could not help with scheduling critical DS pods anyway, because DS controller did not create a DS pod before it found a node that could run the pod. So, Rescheduler was not even aware that such critical DS pods needed to be scheduled.
In other words, DS controller never relied on Rescheduler to create room for DS pods. So, the fact that Rescheduler does not exist in 1.11 won't change anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat , here's the code about critical pod in daemonset: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/daemon/daemon_controller.go#L1429 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Klaus. So, my initial concern is valid. DS controller does not run "resource check" for critical pods. This means that it creates critical DS Pods regardless of the resources available on the nodes and it relies on "Rescheduler" to free up resources on the nodes if necessary. In the absence of Rescheduler, it is important to let default scheduler schedule DS Pods. Otherwise, critical DS pods may never be scheduled when their corresponding nodes are out of resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your concern is a good point :). I re-check the code that, critical pod (ExperimentalCriticalPodAnnotation
) is still alpha feature (e2e was also passed in removed re-scheduler PR).
Let me also check whether it is enabled specially in test-infra :). If not enabled, I think that's safe for us to remove it; and we need to update yaml files about critical pods if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I arranged with @ravisantoshgudimetla to make Rescheduler aware of Pod priority and add it back to help create room for critical DS Pods. So, this PR can remain as is (no need to enable the feature in 1.11).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to add that it was @ravisantoshgudimetla's idea to add priority awareness and use Rescheduler in 1.11. It removes a blocker in moving priority and preemption to Beta.
8329341
to
627baf3
Compare
/retest |
@k82cn Please add a very clear release note about this feature. |
Please change the following phrase in the release notes: otherwise, LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits on comments. LGTM otherwise. Please squash commits. Thanks!
// the given "nodeName" in the "affinity" terms. | ||
func ReplaceDaemonSetPodHostnameNodeAffinity(affinity *v1.Affinity, nodename string) *v1.Affinity { | ||
// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the NodeAffinity by a new NodeAffinity with | ||
// the given "nodeName" in the "affinityterms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// ReplaceDaemonSetPodNodeNameNodeAffinity replaces the RequiredDuringSchedulingIgnoredDuringExecution
// NodeAffinity of the given affinity with a new NodeAffinity that selects the given nodeName.
// Note that this function assumes that no NodeAffinity conflicts with the selected nodeName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
podTemplate = template.DeepCopy() | ||
podTemplate.Spec.Affinity = util.ReplaceDaemonSetPodHostnameNodeAffinity( | ||
// The pod's NodeAffinity will be updated to make sure the Pod is bound | ||
// to the target node by default scheduler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add:
// It is safe to do so because there should be no conflicting node affinity with the target node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, janetkuo, k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -850,7 +857,7 @@ func (dsc *DaemonSetsController) podsShouldBeOnNode( | |||
// If daemon pod is supposed to be running on node, but more than 1 daemon pod is running, delete the excess daemon pods. | |||
// Sort the daemon pods by creation time, so the oldest is preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: scheduled pod is preserved first; if more than one pod can be preserved, the oldest pod is preserved.
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 64057, 63223, 64346, 64562, 64408). If you want to cherry-pick this change to another branch, please follow the instructions here. |
@k82cn: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
…64364-remove-rescheduler Automatic merge from submit-queue (batch tested with PRs 63453, 64592, 64482, 64618, 64661). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Revert "Remove rescheduler and corresponding tests from master" Reverts kubernetes#64364 After discussing with @bsalamat on how DS controllers(ref: kubernetes#63223 (comment)) cannot create pods if the cluster is at capacity and they have to rely on rescheduler for making some space, we thought it is better to - Bring rescheduler back. - Make rescheduler priority aware. - If cluster is full and if **only** DS controller is not able to create pods, let rescheduler be run and let it evict some pods which have less priority. - The DS controller pods will be scheduled now. So, I am reverting this PR now. Step 2, 3 above are going to be in rescheduler. /cc @bsalamat @aveshagarwal @k82cn Please let me know your thoughts on this. ```release-note Revert kubernetes#64364 to resurrect rescheduler. More info kubernetes#64725 :) ```
Signed-off-by: Da K. Ma klaus1982.cn@gmail.com
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):part of #59194
Special notes for your reviewer:
Release note: