scheduler: support scheduling profile-level configuration parameters #93270

yuanchen8911 · 2020-07-20T23:44:57Z

What would you like to be added:

The current scheduler parameters are set in a scheduler configuration file as global settings. As the scheduling framework and multiple scheduling profiles are introduced, it will be useful to support scheduling profile level parameters, e.g., percentageOfNodeToScore and related parameters minFeasibleNodesToFind.

Global configuration

ApiVersion: kubescheduler.config.k8s.io/v1alpha2
kind: KubeSchedulerConfiguration
...

profiles
- schedulerName: batch-scheduler 
   plugins:
  ...

- schedulerName: service-scheduler
   plugins:
  ...

percentageOfNodesToScore: 50

Per-profile configuration

ApiVersion: kubescheduler.config.k8s.io/v1alpha2
kind: KubeSchedulerConfiguration
...

profiles
- schedulerName: batch-scheduler 
   plugins:
   ...
   percentageOfNodesToScore: 10

- schedulerName: service-scheduler
   plugins:
   ...
   percentageOfNodesToScore: 50

Why is this needed:

Scheduling profile-level configuration parameters will provide a simple way to customize the scheduling behavior with different scheduling profiles. For example, different thresholds of percentageOfNodesToScore can be used to conduct performance tuning and achieve a better balance of scheduling performance and quality for different workloads. For example, a long running service typically cares more about the scheduling quality and can simply use a profile with a high threshold to achieve a better scheduling quality while a large batch job looks for a quick turnaround and will use a scheduling profile with a lower threshold for a quicker scheduling.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2020-07-20T23:45:38Z

@yuanchen8911: The label(s) sig/scheduler cannot be applied, because the repository doesn't have them

In response to this:

/sig scheduler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuanchen8911 · 2020-07-20T23:46:32Z

/sig scheduling

yuanchen8911 · 2020-07-20T23:52:49Z

/assign @Huang-Wei
/assign @ahg-g

Huang-Wei · 2020-07-21T00:23:37Z

For example, a long running service typically cares more about the scheduling quality and can simply use a profile with a high threshold to achieve a better scheduling quality while a large batch job looks for a quick turnaround and will uses a scheduling profile with a lower threshold for a quicker scheduling.

This requirement looks reasonable to me.

/cc @alculquicondor

ahg-g · 2020-07-21T01:37:55Z

Moving PercentageOfNodesToScore to inside the profile config make sense.

yuanchen8911 · 2020-07-21T02:15:02Z

Thanks, @Huang-Wei @ahg-g. How should we proceed with it?

yuanchen8911 · 2020-07-21T03:05:41Z

Moving PercentageOfNodesToScore to inside the profile config make sense.

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

alculquicondor · 2020-07-21T13:37:46Z

The problem with "moving" is that it increases maintenance, as we have to support 2 API versions for some time. Should we keep the global as well?

alculquicondor · 2020-07-21T13:39:06Z

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

Are you sure 100 is too big for your use case? I prefer we don't add parameters if we don't need them.

ahg-g · 2020-07-21T13:49:31Z

the moving in this case shouldn't be difficult (we keep both and the per-profile value takes precedence), and we need to support it for two releases only, right?

ahg-g · 2020-07-21T13:53:25Z

for Beta it will be 9 months or 3 releases, whichever is longer.

alculquicondor · 2020-07-21T14:07:34Z

the moving in this case shouldn't be difficult (we keep both and the per-profile value takes precedence), and we need to support it for two releases only, right?

If we keep both, we don't need a new API version.

ahg-g · 2020-07-21T14:09:32Z

yeah, we could keep both in Beta, and remove the global one in GA.

alculquicondor · 2020-07-21T14:26:00Z

It is my understanding that the strong preference is that there are no changes between Beta and GA, apart from new fields.

ahg-g · 2020-07-21T14:35:00Z

True, but I think in this case it is fine because we are not removing functionality and not changing the config significantly.

yuanchen8911 · 2020-07-21T16:49:56Z

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

Are you sure 100 is too big for your use case? I prefer we don't add parameters if we don't need them.

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

In the above cases, the number of feasible nodes to find has a specific small value. Once the specified number of nodes are found, filtering will be canceled. A (optional) per-profile parameter like numFeasibleNodesToFind would be useful (better than than minFeasibleNodesToFind).

yuanchen8911 · 2020-07-21T16:52:02Z

The problem with "moving" is that it increases maintenance, as we have to support 2 API versions for some time. Should we keep the global as well?

Keeping the global one makes sense. Local ones override the global one.

yuanchen8911 · 2020-07-21T16:58:29Z

yeah, we could keep both in Beta, and remove the global one in GA.

I feel keeping the global one would work better. The global one is the default for all profiles and a local one overrides the global one. It makes a lot of sense and the compatibility is maintained.

Huang-Wei · 2020-07-21T17:55:44Z

In the above cases, the number of feasible nodes to find has a specific small value. Once the specified number of nodes are found, filtering will be canceled.

We can continue the discussion in #86630.

alculquicondor · 2020-07-21T18:20:56Z

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

I know this is true, but the question is whether it's worth adding such optimization. The current scheduler already can handle 100 pod/s in clusters with 5k nodes. What is your target and how far are we from it?

yuanchen8911 · 2020-07-21T21:18:36Z

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

I know this is true, but the question is whether it's worth adding such optimization. The current scheduler already can handle 100 pod/s in clusters with 5k nodes. What is your target and how far are we from it?

A salient feature of the scheduling framework and plugin is the ability to customize scheduling for specific workloads and use cases. Providing flexible mechanisms is hence important for developing various customizations to meet different needs of diverse workloads.

As far as scheduling a huge number of small batch jobs/tasks in ultra large clusters, 100 pods/second may not be good enough. Hadoop YARN's throughput is in the high end of hundreds of tasks/sec. We are aware of batch systems that can schedule more than 1000 tasks/per second. The scheduling scalability can become (if not yet) a roadblock to run batch jobs in k8s at scale. If we talk about (sub-second) tiny-tasks as described in the Sparrow paper, even 1000 pods/second might not be adequate.

Adding an additional parameter like numFeasibleNodes to each profile is straightforward. It would be worthwhile If it can facilitate the development of advanced scheduling like scalable scheduling. What optimization or algorithms should be used would be a separate problem.

yuanchen8911 · 2020-07-21T21:29:31Z

If there's still a concern, we can just add percentageOfNodesToFind as a profile parameter first.

alculquicondor · 2020-11-02T14:07:04Z

@SataQiu are you no longer working on this?

fejta-bot · 2021-01-31T15:02:26Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2021-03-02T15:48:14Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

Jerry-Ge · 2021-03-26T03:03:48Z

@SataQiu are you no longer working on this?

Is there anyone still working on this issue? I may help.

Jerry-Ge · 2021-03-26T03:04:10Z

/remove-lifecycle rotten

alculquicondor · 2021-03-29T13:11:46Z

I doesn't look like there is any interest on this anymore

/close

k8s-ci-robot · 2021-03-29T13:11:55Z

@alculquicondor: Closing this issue.

In response to this:

I doesn't look like there is any interest in this anymore

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuanchen8911 · 2022-09-14T19:45:39Z

Hi @ahg-g @alculquicondor, can we reopen this issue? We discussed the proposed change and everyone agreed with it last year. An implementation PR was filed by @SataQiu , but it was incompleted and closed for some reason #97263.

We'd like to submit an PR for it. A use case is to better support workload-specific scheduling with soft InterPodAffinity requirement. With the default percentageOfNodesToScore, the feasible nodes for different pods returned by Filter plugins may be disjoint sets. As a result, there's no chance the pods can be co-located on a same node by the scoring plugins. We want to increase percentageOfNodeToScore for these workloads to improve their chance of co-location. In a multi-tenant cluster, we prefer a higher percentageOfNodesToScore for certain workloads in a separate scheduling profile only without changing the global value and affecting other workloads. In this case, per-profile percentageOfNodeToScore will be very useful.

/cc @Huang-Wei

alculquicondor · 2022-09-14T19:55:45Z

/reopen

It looks like it was closed by the author because they couldn't continue working on it. I think the agreement was to remove the original percentageOfNodesToScore, but we can no longer do that because the configuration API is a now stable.

In any case, we can still add the new field, but we need to document properly what happens when the field outside of the profile is also set.

k8s-ci-robot · 2022-09-14T19:55:50Z

@alculquicondor: Reopened this issue.

In response to this:

/reopen

It looks like it was closed by the author because they couldn't continue working on it. I think the agreement was to remove the original percentageOfNodesToScore, but we can no longer do that because the configuration API is a now stable.

In any case, we can still add the new field, but we need to document properly what happens when the field outside of the profile is also set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Huang-Wei · 2022-09-15T20:56:26Z

/triage accepted

k8s-triage-robot · 2022-12-14T21:52:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

kerthcet · 2022-12-15T09:37:17Z

Once #112521 is merged, is there any other configurations requested for now?

Huang-Wei · 2022-12-16T18:54:34Z

We can close it now.

/close

k8s-ci-robot · 2022-12-16T18:54:39Z

@Huang-Wei: Closing this issue.

In response to this:

We can close it now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuanchen8911 added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 20, 2020

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 20, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 20, 2020

yuanchen8911 mentioned this issue Jul 20, 2020

Custom percentageOfNodesToScore in a PreFilter plugin kubernetes-sigs/scheduler-plugins#14

Closed

k8s-ci-robot assigned ahg-g and Huang-Wei Jul 20, 2020

yuanchen8911 mentioned this issue Jul 21, 2020

Scheduler: use the power of two random choices to select nodes #86630

Closed

SataQiu mentioned this issue Oct 23, 2020

scheduler: move percentagesOfNodesToScore to the scheduler profile #95823

Closed

SataQiu mentioned this issue Dec 13, 2020

scheduler: move percentagesOfNodesToScore to the scheduler profile #97263

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 2, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2021

k8s-ci-robot closed this as completed Mar 29, 2021

tzzcfrank mentioned this issue Jan 20, 2022

scheduler: add numOfNodesToScore to the scheduler profile #107661

Closed

k8s-ci-robot reopened this Sep 14, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 14, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2022

yuanchen8911 mentioned this issue Sep 16, 2022

Add a scheduler profile level parameter percentageOfNodesToScore #112521

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot closed this as completed Dec 16, 2022

scheduler: support scheduling profile-level configuration parameters #93270

scheduler: support scheduling profile-level configuration parameters #93270

Comments

yuanchen8911 commented Jul 20, 2020 • edited

k8s-ci-robot commented Jul 20, 2020

yuanchen8911 commented Jul 20, 2020

yuanchen8911 commented Jul 20, 2020

Huang-Wei commented Jul 21, 2020

ahg-g commented Jul 21, 2020

yuanchen8911 commented Jul 21, 2020

yuanchen8911 commented Jul 21, 2020

alculquicondor commented Jul 21, 2020

alculquicondor commented Jul 21, 2020

ahg-g commented Jul 21, 2020

ahg-g commented Jul 21, 2020

alculquicondor commented Jul 21, 2020

ahg-g commented Jul 21, 2020

alculquicondor commented Jul 21, 2020

ahg-g commented Jul 21, 2020

yuanchen8911 commented Jul 21, 2020 • edited

yuanchen8911 commented Jul 21, 2020

yuanchen8911 commented Jul 21, 2020

Huang-Wei commented Jul 21, 2020

alculquicondor commented Jul 21, 2020

yuanchen8911 commented Jul 21, 2020 • edited

yuanchen8911 commented Jul 21, 2020

alculquicondor commented Nov 2, 2020

fejta-bot commented Jan 31, 2021

fejta-bot commented Mar 2, 2021

Jerry-Ge commented Mar 26, 2021

Jerry-Ge commented Mar 26, 2021

alculquicondor commented Mar 29, 2021 • edited

k8s-ci-robot commented Mar 29, 2021

yuanchen8911 commented Sep 14, 2022

alculquicondor commented Sep 14, 2022

k8s-ci-robot commented Sep 14, 2022

Huang-Wei commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

kerthcet commented Dec 15, 2022

Huang-Wei commented Dec 16, 2022

k8s-ci-robot commented Dec 16, 2022

yuanchen8911 commented Jul 20, 2020 •

edited

yuanchen8911 commented Jul 21, 2020 •

edited

yuanchen8911 commented Jul 21, 2020 •

edited

alculquicondor commented Mar 29, 2021 •

edited