Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run all PreFilter when the preemption will happen later in the same scheduling cycle #119779

Merged
merged 2 commits into from Jan 3, 2024

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented Aug 6, 2023

What type of PR is this?

/kind bug
/triage accepted
/priority important-soon

What this PR does / why we need it:

Not all PreFilter plugins aren't executed either

  • when one PreFilter plugin returns Unschedulable
  • when PreFilter plugin(s) filter out all nodes by PreFilterResult

But maybe their Filter()s are executed in the preemption and could cause error by trying to read data, which is supposed to be stored by PreFilter (actually not), from cycle state.


The former case shouldn't happen in the default scheduler since we don't have PreFilter plugins returning Unschedulable.
But, the latter case may happen in the default scheduler, and cherry-pick is needed.


#119777 adds integ test for this scenario and #119780 proves this patch fixing the bug.

Which issue(s) this PR fixes:

Part of #119770

Special notes for your reviewer:

PreFilterResult was introduced at 1.24. So, we need to cherry-pick this PR for all supported versions.

Does this PR introduce a user-facing change?

When PreFilterResult filters out some Nodes, the scheduling framework assumes them as rejected via `UnschedulableAndUnresolvable`, 
that is those nodes won't be in the candidates of preemption process.
Also, corrected how the scheduling framework handle Unschedulable status from PreFilter. 
Before this PR, if PreFilter return `Unschedulable`, it may result in an unexpected abortion in the preemption, 
which shouldn't happen in the default scheduler, but may happen in schedulers with a custom plugin.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Aug 6, 2023
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.28 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.28.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Sat Aug 5 22:30:03 UTC 2023.

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 6, 2023
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 6, 2023
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 6, 2023
@sanposhiho
Copy link
Member Author

will soon add UTs

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 6, 2023
@sanposhiho
Copy link
Member Author

/retest

@sanposhiho
Copy link
Member Author

/cc @Huang-Wei

@@ -655,7 +656,16 @@ func (f *frameworkImpl) RunPreFilterPlugins(ctx context.Context, state *framewor
if !s.IsSuccess() {
s.SetFailedPlugin(pl.Name())
if s.IsUnschedulable() {
return nil, s
if s.Code() == framework.UnschedulableAndUnresolvable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing to always continue running PreFilter upon Unschedulable, I'm thinking of conditional continue/stop:

  • option 1: offer a knob in scheduler config to indicate whether it's intended to continue run PreFilter plugin upon the first Unschedulable - it should be a profile-specific parameter.
  • option 2: try to deduce the user's intent w/o offering a config knob. Technically, it's doable to tell whether a PreFilter plugin returns nil in PreFilterExtensions. For example, suppose pluginA and pluginB both implement PreFilter, pluginB is placed after pluginA. When pluginA fails with Unschedulable
    • runPreFilterPlugins() stops if pluginB returns nil PreFilterExtensions
    • runPreFilterPlugins() continues if pluginB returns non-nil PreFilterExtensions

Copy link
Member Author

@sanposhiho sanposhiho Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg. I prefer option2, as option1 seems a bit difficult for users to understand appropriately, and I don't come up with any scenario that option2 cannot cover, but people want to continue or stop running other PreFilter.

But, can we leave this enhancement to a follow-up so that this PR won't go bigger? because I'm thinking this PR deserves to be cherry-picked to older versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @alculquicondor @ahg-g @kerthcet A simple approach is to keep running all PreFilter plugins upon Unschedulable, which should be no penalty for in-tree plugins (at least for now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: even for out-of-tree plugins, it's still very strict to trigger this issue, basically you have to satisfy the following conditions:

#119770 (comment)

  • an out-of-tree pluginA returns Unschedulable in PreFilter
  • an out-of-tree pluginB is placed after pluginA, and
    • it implements PreFilter, and write its own key to cycleState
    • it implements PreFilterExtensions#Add/RemovePod, and report error upon missing key

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I haven't got time to check the sig-meeting rec, but is the conclusion just to go with keeping running all PreFilter when Unschedulable and leave the idea of conditional continue at least for now?
Also, is there anything to improve on this PR that I need to pile up to get /lgtm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the current approach. And it's not that severe comparing to other 2 regressions we recently discussed. Let's polish this (I guess there will be conflicts on tests) after those 2 regression PRs get merged first, and then discuss if we want to cherry-pick.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2023
@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 24, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2023
@sanposhiho
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 19, 2023
@sanposhiho
Copy link
Member Author

Rebased.

@alculquicondor
Copy link
Member

when PreFilter plugin(s) filter out all nodes by PreFilterResult

In which case can this happen in the default scheduler?

@Huang-Wei
Copy link
Member

/retest

@alculquicondor
Copy link
Member

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 2, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, sanposhiho

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Member

Please update the release note to reflect that this is a new feature that external plugins can take advantage of. It's not a bug for the existing in-tree plugins.

/lgtm
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 2, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 2, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 02649e6e98e33ac1e694b098455063f19ee28f57

@sanposhiho
Copy link
Member Author

sanposhiho commented Jan 3, 2024

@alculquicondor Updated. Also I remarked on a bug that could happen in custom schedulers though, feel free to eliminate that part if you don't prefer to include it.

@alculquicondor
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 3, 2024
@k8s-ci-robot k8s-ci-robot merged commit 8609b8c into kubernetes:master Jan 3, 2024
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Jan 3, 2024
@alculquicondor
Copy link
Member

We are going to need an updated release note. And then prepare the cherry-pick, please.

chengjoey added a commit to chengjoey/kubernetes that referenced this pull request Apr 26, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
chengjoey added a commit to chengjoey/kubernetes that referenced this pull request Apr 27, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
chengjoey added a commit to chengjoey/kubernetes that referenced this pull request May 3, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
@@ -2227,7 +2227,7 @@ func TestSchedulerSchedulePod(t *testing.T) {
nodes: []string{"node1", "node2", "node3"},
pod: st.MakePod().Name("test-prefilter").UID("test-prefilter").Obj(),
wantNodes: sets.New("node2"),
wantEvaluatedNodes: ptr.To[int32](1),
wantEvaluatedNodes: ptr.To[int32](3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this number shouldn't have changed?

Although it's only useful for debugging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be changed.

Please refer to the change in pkg/scheduler/scheduler.go in this PR: After this PR, EvaluatedNodes contains the number of nodes that filtered out by PreFilterResult.
https://github.com/kubernetes/kubernetes/pull/119779/files#r1439161728

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand that it's not a bug. However, it was useful to have the value 1 to know that only one node passed PreFilter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then, we have to change how to calculate EvaluatedNodes.
https://github.com/sanposhiho/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L435


Created #124705

chengjoey added a commit to chengjoey/kubernetes that referenced this pull request May 9, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
chengjoey added a commit to chengjoey/kubernetes that referenced this pull request May 9, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
hswong3i pushed a commit to alvistack/kubernetes-kubernetes that referenced this pull request May 15, 2024
…egal nodes, the pod scheduling flow will abort immediately.

this is a minimum fix from kubernetes#119779

Signed-off-by: joey <zchengjoey@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

5 participants