Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

register unschedulable plugin for those plugins that PreFilter's PreFilterResult filter out some nodes #122251

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

olderTaoist
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

unschedulable plugin isn't registered for those plugins, when some PreFilter filter out some Nodes. In this case, some changes in the cluster may change the result from those PreFilter and may make this Pod schedulable.

Which issue(s) this PR fixes:

Fixes #122018

Special notes for your reviewer:

@sanposhiho

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 10, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.29 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.29.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sun Dec 10 10:17:03 UTC 2023.

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Dec 10, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 10, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @olderTaoist. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AxeZhan
Copy link
Member

AxeZhan commented Dec 11, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 11, 2023
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 20, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 20, 2023
@olderTaoist olderTaoist force-pushed the unschedulable-plugin branch 3 times, most recently from 7a9b583 to bea0b63 Compare December 21, 2023 01:25
@sanposhiho
Copy link
Member

/assign

I'll review this in this weekend hopefully, during the new year holiday at the latest.

@olderTaoist
Copy link
Contributor Author

olderTaoist commented Jun 7, 2024

Actually, #119779, which is just merged, changed how we treat PreFilterResult. In this PR, I guess we can do something like:

  1. add diagnosis into RunPreFilterPlugins function's parameters.
  2. update diagnosis.UnschedulablePlugins and diagnosis.NodeToStatusMap when PreFilter returns PreFilterResult.

Then we can cleanup some logic in findNodesThatFitPod that modify diagnosis after RunPreFilterPlugins.

@olderTaoist @sanposhiho

I think this pr needs some rework/refactor.

This pr has done huge changes to achieve update diagnosis.NodeToStatusMap when PreFilter returns PreFilterResult. However, this action will greatly harm the performance. We're planning to remove this in recent pr(#125197 (comment)). This is because we try to implement option2 proposed here.

So, after #125197, I think all we have to do is

  1. add diagnosis into RunPreFilterPlugins function's parameters.
  2. update diagnosis.UnschedulablePlugins when PreFilter returns PreFilterResult with allNodes() false.

if !r.AllNodes() {
pluginsWithNodes = append(pluginsWithNodes, pl.Name())
}

 if !r.AllNodes() { 
 	pluginsWithNodes = append(pluginsWithNodes, pl.Name()) 
        diagnosis.UnschedulablePlugins.Insert(pl.Name())
 } 

I have read the relevant diagnosis.NodeToStatusMap PR carefully recently, and I agree with your idea, please review again @AxeZhan .

@olderTaoist
Copy link
Contributor Author

Sry, just realized that there are out-of-tree plugins which may return Unschedulable during preFilter. Then we should also update diagnosis.NodeToStatusMap, but only when preFilter returns an Unschedulable status?

if s.Code() == framework.Unschedulable {
// In this case, the preemption should happen later in this scheduling cycle.
// So we need to execute all PreFilter.
// https://github.com/kubernetes/kubernetes/issues/119770
returnStatus = s
continue
}

when all nodes are filtered by PreFilter that returns PreFilterResult with allNodes() false, RunPreFilterPlugins will return an UnschedulableAndUnresolvable status

if !result.AllNodes() && len(result.NodeNames) == 0 {
msg := fmt.Sprintf("node(s) didn't satisfy plugin(s) %v simultaneously", pluginsWithNodes)
if len(pluginsWithNodes) == 1 {
msg = fmt.Sprintf("node(s) didn't satisfy plugin %v", pluginsWithNodes[0])
}
// When PreFilterResult filters out Nodes, the framework considers Nodes that are filtered out as getting "UnschedulableAndUnresolvable".
return result, framework.NewStatus(framework.UnschedulableAndUnresolvable, msg)

The original logic will update diagnosis.NodeToStatusMap no matter UnschedulableAndUnresolvable or Unschedulable, I am confused about whether it is necessary to update diagnosis.NodeToStatusMap when there are some prefilters that returns PreFilterResult with allNodes() false and status is nil. @alculquicondor @gabesaba . this pr just focus on diagnosis.UnschedulablePlugins

@AxeZhan
Copy link
Member

AxeZhan commented Jun 7, 2024

I am confused about whether it is necessary to update diagnosis.NodeToStatusMap when there are some prefilters that returns PreFilterResult with allNodes() false and status is nil.

No need to update those NodeStatus, nodes passed prefilter will get their status updated during filter stage.
In preFilter, we only make sure if prefilter returned Unschedulable, we set status for all nodes to let preemption select candidates.

And we'll need to find a better way than iterating all nodes and set status in a map. But that is out of this pr's scope.

pkg/scheduler/framework/runtime/framework.go Outdated Show resolved Hide resolved
pkg/scheduler/extender_test.go Outdated Show resolved Hide resolved
pkg/scheduler/framework/runtime/framework_test.go Outdated Show resolved Hide resolved
pkg/scheduler/framework/runtime/framework_test.go Outdated Show resolved Hide resolved
pkg/scheduler/schedule_one_test.go Outdated Show resolved Hide resolved
@olderTaoist olderTaoist force-pushed the unschedulable-plugin branch 2 times, most recently from e0db92a to 8a8c040 Compare June 7, 2024 10:14
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 7, 2024
@olderTaoist olderTaoist requested a review from AxeZhan June 7, 2024 10:19
@@ -588,7 +588,7 @@ type Framework interface {
// cycle is aborted.
// It also returns a PreFilterResult, which may influence what or how many nodes to
// evaluate downstream.
RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod) (*PreFilterResult, *Status)
RunPreFilterPlugins(ctx context.Context, state *CycleState, pod *v1.Pod, diagnosis *Diagnosis) (*PreFilterResult, *Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option would be to add UnschedulablePlugins to the PreFilterResult and have the caller of RunPrefilterPlugins use that result to update diagnosis. It avoids changing the interface, and makes the data flow clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that.
Me myself prefer the current implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to the current implementation that I'm not seeing which justifies having side effects? My concern is that readers/maintainers of this code will have to read the implementation of RunPreFilterPlugins to see why Diagnosis is passed, and which fields in it are read/mutated. Passing this type also makes it easier in the future for other fields in Diagnosis to be modified without careful consideration, increasing complexity (suppose NodeToStatusMap is also modified inside of RunPreFilterPlugins, rather than just by findNodesThatFitPod).

While the data flow/mutability is my primary concern, this PR also becomes smaller (8 -> 4 files), if we include result in PreFilterResult

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gabesaba's rationale.
Please update the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already updated, please review again @AxeZhan @gabesaba @alculquicondor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just return UnschedulablePlugins as a return value from RunPreFilterPlugins, instead of a new field in PreFilterResult?

PreFilterResult is a data structure we also use between the framework and plugins, that is, supposed to be exposed to plugin developers, while UnschedulablePlugins would not be a field for plugins developers. It looks better not to expose such field in PreFilterResult, if we haven't got a special reason doing so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabesaba has raised a similar point. #122251 (comment)

We can do this as a follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be done in this PR.

Copy link
Member

@AxeZhan AxeZhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Only few nits.

pkg/scheduler/schedule_one_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 11, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 11, 2024
@olderTaoist olderTaoist requested a review from AxeZhan June 11, 2024 01:29
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: olderTaoist
Once this PR has been reviewed and has the lgtm label, please ask for approval from sanposhiho. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -740,6 +743,7 @@ func (p *PreFilterResult) Merge(in *PreFilterResult) *PreFilterResult {
}

r.NodeNames = p.NodeNames.Intersection(in.NodeNames)
r.UnschedulablePlugins = p.UnschedulablePlugins.Clone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we handle case when in has UnschedulablePlugins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in will be no UnschedulablePlugins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in will be no UnschedulablePlugins

In current implementation in this pr, yes. But I also prefer a union here for future usage.

@@ -726,10 +726,14 @@ func (f *frameworkImpl) RunPreFilterPlugins(ctx context.Context, state *framewor
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to record unschedulable plugins in the branches above - lines 715 and 720? or only when the plugin filters out some nodes (as title of PR indicates)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation that unschedulable plugins was also added in lines 713 and 714

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand now. We set the plugin name, and later we add it via this call?

But if more than one plugin returns Unschedulable (or that, +a plugin returns UnschedulableAndUnresolvable), we may miss some plugins, right?

Comment on lines 729 to 736
result = result.Merge(r)
if !r.AllNodes() {
if result.UnschedulablePlugins == nil {
result.UnschedulablePlugins = sets.New[string]()
}
result.UnschedulablePlugins.Insert(pl.Name())
pluginsWithNodes = append(pluginsWithNodes, pl.Name())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result = result.Merge(r)
if !r.AllNodes() {
if result.UnschedulablePlugins == nil {
result.UnschedulablePlugins = sets.New[string]()
}
result.UnschedulablePlugins.Insert(pl.Name())
pluginsWithNodes = append(pluginsWithNodes, pl.Name())
}
if !r.AllNodes() {
r.UnschedulablePlugins = sets.New(pl.Name())
pluginsWithNodes = append(pluginsWithNodes, pl.Name())
}
result = result.Merge(r)

or, if we want fewer allocs, create set outside of the loop and assign to result before final return. (requires small refactor, to break outside of loop on line 744, rather than return)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything wrong with me writing this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation works, but there is room for simplification (fewer branches) and less code. After all, we just want to do a set union of unschedulable plugin names, not merge some complex data structure.

create set outside of the loop and assign to result before final return

I think this is the simplest option, as we don't have to worry about the merge logic (which, in the general case, becomes complex if we want to properly handle all combinations of nil/non-nil)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your new version is even better than what I had in mind :)

@@ -718,6 +718,9 @@ type PreFilterResult struct {
// The set of nodes that should be considered downstream; if nil then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps outside of the scope of this PR (something we can fix as followup), but a question for @AxeZhan, @alculquicondor: Does it make sense to share this result type between PreFilter and RunPreFilterPlugins? As now, there is a field which users of the framework should never need to set

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate? I don't quite get it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunPreFilterPlugins acts on all plugins, while PreFilter runs a single plugin, but they both return the same PreFilterResult type. My concerns are:

  1. With the addition to PreFilterResult I suggested, the field UnschedulablePlugins should only be used by RunPreFilterPlugins. As PreFilter shouldn't modify this field, this may be confusing in the interface, and we may have to defensively code against the case when it was modified.
  2. We assume we can fold or "merge" these PreFilterResults into a single PreFilterResult. This may not always be true, and it is already a little clumsy. All we want to do is a set intersection on NodeNames, and a set union on UnschedulablePlugins. We don't need a complex merge function for that, where we have to safely handle one or both sides being nil. We just need two sets in RunPreFilterPlugins.

I don't think this should block this PR, but waited to raise it as something we might clean up after, and therefore wanted to get your feedback. If you agree with this, I'm happy to make these changes after this PR merges.

Copy link
Contributor

@gabesaba gabesaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a few nits left. otherwise it looks good. thanks for addressing the comments!

Comment on lines +454 to +456
if preRes != nil && preRes.UnschedulablePlugins != nil {
diagnosis.UnschedulablePlugins = preRes.UnschedulablePlugins.Clone()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if preRes != nil && preRes.UnschedulablePlugins != nil {
diagnosis.UnschedulablePlugins = preRes.UnschedulablePlugins.Clone()
}
if preRes != nil {
diagnosis.UnschedulablePlugins = preRes.UnschedulablePlugins
}

nit: since we don't use this field elsewhere, we may omit copy (and propagating nil value is fine)

Comment on lines +745 to +746
result.UnschedulablePlugins = sets.New[string]()
result.UnschedulablePlugins.Insert(pluginsWithNodes...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
result.UnschedulablePlugins = sets.New[string]()
result.UnschedulablePlugins.Insert(pluginsWithNodes...)
result.UnschedulablePlugins = sets.New(pluginsWithNodes...)

does this compile?

@@ -718,6 +718,9 @@ type PreFilterResult struct {
// The set of nodes that should be considered downstream; if nil then
// all nodes are eligible.
NodeNames sets.Set[string]

// UnschedulablePlugins are plugins that returns Unschedulable or UnschedulableAndUnresolvable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// UnschedulablePlugins are plugins that returns Unschedulable or UnschedulableAndUnresolvable.
// UnschedulablePlugins are plugins which filter out nodes.

if plugin returns UnschedulableAndUnresolvable or Unschedulable, we don't add to this set. How about this comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update this comment too, as we will now include plugins which return success, but filtered some nodes.

// UnschedulablePlugins are plugins that returns Unschedulable or UnschedulableAndUnresolvable.
UnschedulablePlugins sets.Set[string]

@sanposhiho
Copy link
Member

@olderTaoist Also, please fill in the release note, e.g.,

The scheduler now retries Pods rejected with PreFilterResult more appropriately.

(Maybe we can say more, but no need to explain very deeply.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
8 participants