Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeUnschedulable: scheduler queueing hints #119396

Merged
merged 1 commit into from
Sep 14, 2023

Conversation

wackxu
Copy link
Contributor

@wackxu wackxu commented Jul 18, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Part of #118893

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kube-scheduler implements scheduling hints for the NodeUnschedulable plugin.
The scheduling hints allow the scheduler to only retry scheduling a Pod
that was previously rejected by the NodeSchedulable plugin if a new Node or a Node update sets .spec.unschedulable to false.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 18, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 18, 2023
@wackxu wackxu force-pushed the NodeUnschedulableHintFunc branch from f5fe523 to ea00d6d Compare July 18, 2023 11:32
@wackxu
Copy link
Contributor Author

wackxu commented Jul 18, 2023

/assign @sanposhiho

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 19, 2023
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay 🙏 (summer holidays + other prioritized tasks)
Looks good overall, just leave several comments to make them cleaner.

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@kubernetes/sig-scheduling-leads @kerthcet
Can someone take another review for /approve? (the bot assigned me to both reviewer/approver 😓 )

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 13, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: befa96a3d596786b219342710518e3eb58ba4ea7

return framework.QueueAfterBackoff
}

originalNodeSchedulable, modifiedNodeSchedulable := false, !modifiedNode.Spec.Unschedulable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to care about the originalNodeSchedulable for it's always false, if it's true, we'll pass the nodeUnschedulable plugin, which makes the logic simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need originalNodeSchedulable.
Let's say NodeA.Spec.Unschedulable=false and NodeB.Spec.Unschedulable=true, and Pod is failed by this plugin.
In this case, in thisisSchedulableAfterNodeChange, we should ignore all changes to NodeA because NodeA is not related to this plugin's failure. What we need to care about is the change to NodeB only. But, if we don't have originalNodeSchedulable, we always return QueueAfterBackoff for all changes to NodeA.

So, in order to filter out such non-related events for NodeA, we need to have originalNodeSchedulable to return QueueAfterBackoff only when NodeB.Spec.Unschedulable=true → NodeB.Spec.Unschedulable=false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would be return QueueAfterBackoff if nodeB.Spec.Unchedulable=true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean QueueAfterBackoff should be returned only when NodeB gets changed from Unschedulable=true to Unschedulable=false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need to restrict returning QueueAfterBackoff to the transition.

I get that, in general, it shouldn't matter: once there is a transition to Unschedulable=false, the plugin would only return Success, so this function shouldn't be called.

But before #120334, we would call this function again. I think it's actually safer not to restrict passing this check to transitions only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say there is only one Node with Unschedulable=true in the cluster but almost all other Nodes are Unschedulable=false. In that case, all Pods would get NodeUnschedulable in the unschedulable plugins.
And, if we didn't restrict returning QueueAfterBackoff to the transition and returned QueueAfterBackoff to all Unschedulable=false Nodes, we would keep requeueing rejected Pods to activeQ/backoffQ with tons of unrelated events to unrelated Nodes (Unschedulable=false from the first). What we should observe is the event to the Node which is/was Unschedulable=true only.

But before #120334, we would call this function again. I think it's actually safer not to restrict passing this check to transitions only.

So... yes. If we want to consider unknown bugs like #120334 which aren't found yet, we should make all plugins to be more conservative, returning QueueAfterBackoff not only to the transition. But, I'm not sure if it's worth doing for unknown bugs.
Like NodeUnschedulable, many Node level scheduling constraint (like NodeTaint, NodeAffinity...etc), the same thing would happen -- like we need to return QueueAfterBackoff to changes to all untainted Nodes, we need to return QueueAfterBackoff to changes to all match Nodes, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for the additional context.
Let's pass the check only for transitions then.

return framework.QueueAfterBackoff
}

podToleratesUnschedulable := v1helper.TolerationsTolerateTaint(pod.Spec.Tolerations, &v1.Taint{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right check order should be:

if newNode.spec.unschedulable == False {
    return queueAfterBackoff
}else {
    verifyTolerations()
}

Because when the node.spec.unschedulable is false, we no longer need to check the taint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Always do the faster calculations first

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the logic to check the taint, because Pod.Spec.Tolerations is an immutable field and this plugin never rejects Pod with Toleration for Unschedulable taint. So I think we do not need this logic

@kerthcet
Copy link
Member

@kubernetes/sig-scheduling-leads @kerthcet Can someone take another review for /approve? (the bot assigned me to both reviewer/approver 😓 )

Can we have a sig-scheduling-approvers? Kensei always @ me extraly and I guess it's normal to @kubernetes/sig-scheduling-approvers for kubefolks.

@alculquicondor
Copy link
Member

@sanposhiho
Copy link
Member

Can we have a sig-scheduling-approvers?

+100, let me create the PR.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2023
@wackxu wackxu force-pushed the NodeUnschedulableHintFunc branch 2 times, most recently from db7ef9c to c123e1b Compare September 14, 2023 02:15
Signed-off-by: wackxu <xushiwei5@huawei.com>
Copy link
Member

@kerthcet kerthcet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold for @sanposhiho

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 14, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: cee7efc4c0b0d4a233f4d9c8f573f707e63e062a

@kerthcet
Copy link
Member

/retest

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 14, 2023
Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/unhold

Thanks!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kerthcet, sanposhiho, wackxu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit fc786dc into kubernetes:master Sep 14, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Sep 14, 2023
@wackxu
Copy link
Contributor Author

wackxu commented Sep 15, 2023

Thank you for your patience review and guidance @sanposhiho @alculquicondor @kerthcet

@alculquicondor
Copy link
Member

/release-note-edit

kube-scheduler implements scheduling hints for the NodeUnschedulable plugin.
The scheduling hints allow the scheduler to only retry scheduling a Pod
that was previously rejected by the NodeSchedulable plugin if a new Node or a Node update sets .spec.unschedulable to false.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants