Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup(framework): return earlier if all Score plugins are skipped #118297

Closed
wants to merge 1 commit into from

Conversation

sanposhiho
Copy link
Member

@sanposhiho sanposhiho commented May 27, 2023

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

do an earlier return when all Score plugins returned Skip in PreScore.

This PR originally had the fix for panic bug

The scheduler panics when all these conditions are satisfied:

  • some PreScore plugins which returned Skip, but are disabled in the Score extension point
  • "the number of PreScore plugins that returned Skip and disabled in the Score" is more than the number of all Score plugins enabled in the scheduler

You can reproduce the panic with newly added UT like the following.

Running tool: /home/gitpod/go/bin/go test -timeout 30s -run ^TestRunScorePlugins$ k8s.io/kubernetes/pkg/scheduler/framework/runtime

I0527 12:20:50.858222  101963 framework.go:509] "MultiPoint plugin is explicitly re-configured; overriding" plugin="score-plugin-1"
--- FAIL: TestRunScorePlugins (0.00s)
    --- FAIL: TestRunScorePlugins/PreScore_returns_Skip_but_isn't_registered_in_Score (0.00s)
panic: runtime error: makeslice: cap out of range [recovered]
	panic: runtime error: makeslice: cap out of range

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR originally had the fix for panic bug
=

@kubernetes/sig-scheduling-leads
In my opinion, we don't need to cherry-pick this into v1.27, but please anyone double-check.

This shouldn't break the default Kubernetes scheduler because any in-tree Score plugins don't return Skip in v1.27 in the first place. (ref)
It might break the custom scheduler, for example, the scheduler with some custom PreScore plugins which may return Skip, but are disabled in the Score extension point. Also, to see this bug, "the number of PreScore plugins that returned Skip and disabled in the Score" should be more than the number of all Score plugins enabled in the scheduler, which is an extremely minor case.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 27, 2023
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 27, 2023
@sanposhiho
Copy link
Member Author

sanposhiho commented May 27, 2023

/priority important-soon
/triage accepted

As described, it's not a critical bug. But, it's a blocker to implement Skip in Score plugins because it's breaking the e2e test in the PR of Skip implementation (e.g., #117024).

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 27, 2023
@@ -1014,49 +1014,57 @@ func (f *frameworkImpl) RunScorePlugins(ctx context.Context, state *framework.Cy
metrics.FrameworkExtensionPointDuration.WithLabelValues(metrics.Score, status.Code().String(), f.profileName).Observe(metrics.SinceInSeconds(startTime))
}()
allNodePluginScores := make([]framework.NodePluginScores, len(nodes))
numPlugins := len(f.scorePlugins) - state.SkipScorePlugins.Len()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trigger of this bug is that numPlugins becomes less than 0.

@sanposhiho sanposhiho changed the title fix(framework): ignore Skip from PreScore plugins that aren't registered in Score fix(framework): fix a panic rarely caused by Skip from PreScore plugins that aren't registered in Score May 27, 2023
@AxeZhan
Copy link
Member

AxeZhan commented May 27, 2023

I ran into this before in some tests, but is there really a scenario that a plugin is enabled for prescore and disabled for score in reality?

@sanposhiho
Copy link
Member Author

I ran into this before in some tests, but is there really a scenario that a plugin is enabled for prescore and disabled for score in reality?

I'd say 100% No if no one made mistakes in the scheduler component config. This fix is for mistakers.

/remove-priority important-soon

I'll fix the e2e test side to unblock the Skip PR.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels May 27, 2023
@sanposhiho
Copy link
Member Author

But, on second thought, we may just want to accept this bug because this PR will remove pre-allocation for pluginToNodeScores and plugins. 🤔

@sanposhiho
Copy link
Member Author

sanposhiho commented May 27, 2023

But, on second thought, we may just want to accept this bug because this PR will remove pre-allocation for pluginToNodeScores and plugins. 🤔

Let's go with this thought unless other people want this bug fix.
I'll convert this PR to the cleanup one. Feel free to give your thought if you think that was worthwhile to have.

/remove-kind bug
/retitle cleanup(framework): return earlier if all Score plugins are skipped

@k8s-ci-robot k8s-ci-robot changed the title fix(framework): fix a panic rarely caused by Skip from PreScore plugins that aren't registered in Score cleanup(framework): return earlier if all Score plugins are skipped May 27, 2023
@k8s-ci-robot k8s-ci-robot removed the kind/bug Categorizes issue or PR as related to a bug. label May 27, 2023
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 27, 2023
@sanposhiho
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit 6b6ab1b link true /test pull-kubernetes-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanposhiho, Tusenka
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kerthcet
Copy link
Member

/assign

@kerthcet
Copy link
Member

Can you fix the test?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 21, 2023
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 4, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Member

Are you still planning this?

@sanposhiho
Copy link
Member Author

It's in icebox state in my todo list. I don't think I can take time to take care of it for now. Anyone feel free to raise a similar PR. (or maybe future me will)

/close

@k8s-ci-robot
Copy link
Contributor

@sanposhiho: Closed this PR.

In response to this:

It's in icebox state in my todo list. I don't think I can take time to take care of it for now. Anyone feel free to raise a similar PR. (or maybe future me will)

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants