Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: support scheduling profile-level configuration parameters #93270

Closed
yuanchen8911 opened this issue Jul 20, 2020 · 40 comments
Closed
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@yuanchen8911
Copy link
Member

yuanchen8911 commented Jul 20, 2020

What would you like to be added:

The current scheduler parameters are set in a scheduler configuration file as global settings. As the scheduling framework and multiple scheduling profiles are introduced, it will be useful to support scheduling profile level parameters, e.g., percentageOfNodeToScore and related parameters minFeasibleNodesToFind.

Global configuration

ApiVersion: kubescheduler.config.k8s.io/v1alpha2
kind: KubeSchedulerConfiguration
...

profiles
- schedulerName: batch-scheduler 
   plugins:
  ...

- schedulerName: service-scheduler
   plugins:
  ...

percentageOfNodesToScore: 50

Per-profile configuration

ApiVersion: kubescheduler.config.k8s.io/v1alpha2
kind: KubeSchedulerConfiguration
...

profiles
- schedulerName: batch-scheduler 
   plugins:
   ...
   percentageOfNodesToScore: 10

- schedulerName: service-scheduler
   plugins:
   ...
   percentageOfNodesToScore: 50

Why is this needed:

Scheduling profile-level configuration parameters will provide a simple way to customize the scheduling behavior with different scheduling profiles. For example, different thresholds of percentageOfNodesToScore can be used to conduct performance tuning and achieve a better balance of scheduling performance and quality for different workloads. For example, a long running service typically cares more about the scheduling quality and can simply use a profile with a high threshold to achieve a better scheduling quality while a large batch job looks for a quick turnaround and will use a scheduling profile with a lower threshold for a quicker scheduling.

@yuanchen8911 yuanchen8911 added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 20, 2020
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 20, 2020
@k8s-ci-robot
Copy link
Contributor

@yuanchen8911: The label(s) sig/scheduler cannot be applied, because the repository doesn't have them

In response to this:

/sig scheduler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuanchen8911
Copy link
Member Author

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 20, 2020
@yuanchen8911
Copy link
Member Author

/assign @Huang-Wei
/assign @ahg-g

@Huang-Wei
Copy link
Member

For example, a long running service typically cares more about the scheduling quality and can simply use a profile with a high threshold to achieve a better scheduling quality while a large batch job looks for a quick turnaround and will uses a scheduling profile with a lower threshold for a quicker scheduling.

This requirement looks reasonable to me.

/cc @alculquicondor

@ahg-g
Copy link
Member

ahg-g commented Jul 21, 2020

Moving PercentageOfNodesToScore to inside the profile config make sense.

@yuanchen8911
Copy link
Member Author

Thanks, @Huang-Wei @ahg-g. How should we proceed with it?

@yuanchen8911
Copy link
Member Author

Moving PercentageOfNodesToScore to inside the profile config make sense.

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

@alculquicondor
Copy link
Member

The problem with "moving" is that it increases maintenance, as we have to support 2 API versions for some time. Should we keep the global as well?

@alculquicondor
Copy link
Member

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

Are you sure 100 is too big for your use case? I prefer we don't add parameters if we don't need them.

@ahg-g
Copy link
Member

ahg-g commented Jul 21, 2020

the moving in this case shouldn't be difficult (we keep both and the per-profile value takes precedence), and we need to support it for two releases only, right?

@ahg-g
Copy link
Member

ahg-g commented Jul 21, 2020

for Beta it will be 9 months or 3 releases, whichever is longer.

@alculquicondor
Copy link
Member

the moving in this case shouldn't be difficult (we keep both and the per-profile value takes precedence), and we need to support it for two releases only, right?

If we keep both, we don't need a new API version.

@ahg-g
Copy link
Member

ahg-g commented Jul 21, 2020

yeah, we could keep both in Beta, and remove the global one in GA.

@alculquicondor
Copy link
Member

It is my understanding that the strong preference is that there are no changes between Beta and GA, apart from new fields.

@ahg-g
Copy link
Member

ahg-g commented Jul 21, 2020

True, but I think in this case it is fine because we are not removing functionality and not changing the config significantly.

@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Jul 21, 2020

should probably include the related parameters such as "minFeasibleNodesToFind". The current global setting is 100.

Are you sure 100 is too big for your use case? I prefer we don't add parameters if we don't need them.

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

In the above cases, the number of feasible nodes to find has a specific small value. Once the specified number of nodes are found, filtering will be canceled. A (optional) per-profile parameter like numFeasibleNodesToFind would be useful (better than than minFeasibleNodesToFind).

@yuanchen8911
Copy link
Member Author

The problem with "moving" is that it increases maintenance, as we have to support 2 API versions for some time. Should we keep the global as well?

Keeping the global one makes sense. Local ones override the global one.

@yuanchen8911
Copy link
Member Author

yeah, we could keep both in Beta, and remove the global one in GA.

I feel keeping the global one would work better. The global one is the default for all profiles and a local one overrides the global one. It makes a lot of sense and the compatibility is maintained.

@Huang-Wei
Copy link
Member

In the above cases, the number of feasible nodes to find has a specific small value. Once the specified number of nodes are found, filtering will be canceled.

We can continue the discussion in #86630.

@alculquicondor
Copy link
Member

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

I know this is true, but the question is whether it's worth adding such optimization. The current scheduler already can handle 100 pod/s in clusters with 5k nodes. What is your target and how far are we from it?

@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Jul 21, 2020

There are use cases where the number of feasible nodes to find is known as priori. For example, scheduling a Pod on a certain node (like NodeName plugin) or any feasible node. The number of nodes to find is 1. A scalable scheduling algorithm finds two feasible nodes and then chooses the better one (Sparrow https://cs.stanford.edu/~matei/papers/2013/sosp_sparrow.pdf). In this case, the number is 2.

I know this is true, but the question is whether it's worth adding such optimization. The current scheduler already can handle 100 pod/s in clusters with 5k nodes. What is your target and how far are we from it?

A salient feature of the scheduling framework and plugin is the ability to customize scheduling for specific workloads and use cases. Providing flexible mechanisms is hence important for developing various customizations to meet different needs of diverse workloads.

As far as scheduling a huge number of small batch jobs/tasks in ultra large clusters, 100 pods/second may not be good enough. Hadoop YARN's throughput is in the high end of hundreds of tasks/sec. We are aware of batch systems that can schedule more than 1000 tasks/per second. The scheduling scalability can become (if not yet) a roadblock to run batch jobs in k8s at scale. If we talk about (sub-second) tiny-tasks as described in the Sparrow paper, even 1000 pods/second might not be adequate.

Adding an additional parameter like numFeasibleNodes to each profile is straightforward. It would be worthwhile If it can facilitate the development of advanced scheduling like scalable scheduling. What optimization or algorithms should be used would be a separate problem.

@yuanchen8911
Copy link
Member Author

If there's still a concern, we can just add percentageOfNodesToFind as a profile parameter first.

@alculquicondor
Copy link
Member

@SataQiu are you no longer working on this?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 2, 2021
@Jerry-Ge
Copy link
Contributor

@SataQiu are you no longer working on this?

Is there anyone still working on this issue? I may help.

@Jerry-Ge
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2021
@alculquicondor
Copy link
Member

alculquicondor commented Mar 29, 2021

I doesn't look like there is any interest on this anymore

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

I doesn't look like there is any interest in this anymore

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuanchen8911
Copy link
Member Author

Hi @ahg-g @alculquicondor, can we reopen this issue? We discussed the proposed change and everyone agreed with it last year. An implementation PR was filed by @SataQiu , but it was incompleted and closed for some reason #97263.

We'd like to submit an PR for it. A use case is to better support workload-specific scheduling with soft InterPodAffinity requirement. With the default percentageOfNodesToScore, the feasible nodes for different pods returned by Filter plugins may be disjoint sets. As a result, there's no chance the pods can be co-located on a same node by the scoring plugins. We want to increase percentageOfNodeToScore for these workloads to improve their chance of co-location. In a multi-tenant cluster, we prefer a higher percentageOfNodesToScore for certain workloads in a separate scheduling profile only without changing the global value and affecting other workloads. In this case, per-profile percentageOfNodeToScore will be very useful.

/cc @Huang-Wei

@alculquicondor
Copy link
Member

/reopen

It looks like it was closed by the author because they couldn't continue working on it. I think the agreement was to remove the original percentageOfNodesToScore, but we can no longer do that because the configuration API is a now stable.

In any case, we can still add the new field, but we need to document properly what happens when the field outside of the profile is also set.

@k8s-ci-robot k8s-ci-robot reopened this Sep 14, 2022
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Reopened this issue.

In response to this:

/reopen

It looks like it was closed by the author because they couldn't continue working on it. I think the agreement was to remove the original percentageOfNodesToScore, but we can no longer do that because the configuration API is a now stable.

In any case, we can still add the new field, but we need to document properly what happens when the field outside of the profile is also set.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 14, 2022
@Huang-Wei
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 15, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022
@kerthcet
Copy link
Member

Once #112521 is merged, is there any other configurations requested for now?

@Huang-Wei
Copy link
Member

We can close it now.

/close

@k8s-ci-robot
Copy link
Contributor

@Huang-Wei: Closing this issue.

In response to this:

We can close it now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
9 participants