Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient re-queueing of unschedulable workloads #8

Closed
ahg-g opened this issue Feb 17, 2022 · 25 comments
Closed

Efficient re-queueing of unschedulable workloads #8

ahg-g opened this issue Feb 17, 2022 · 25 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Comments

@ahg-g
Copy link
Contributor

ahg-g commented Feb 17, 2022

Currently we relentlessly keep trying to schedule jobs.

We need to do something similar to what we did in the scheduler: re-queue based on capacity/workload/queue events.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 17, 2022
@ahg-g ahg-g added priority/backlog Higher priority than priority/awaiting-more-evidence. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 18, 2022
@denkensk
Copy link
Member

/assign
List the parts that may need to be developed:

  1. Add UnschedulableQ to store the schedule failed workload.
  2. Support re-queue by capacity update (includes the capacities in the same Cohort)
  3. Support re-queue by workload update/delete (in the same Cohort). Conditional judgment is required
  4. Support re-queue by queue events? Maybe other queues is deleted?
  5. Backoff Mechanism is needed.

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 21, 2022

good initial list, but can we focus on the important-soon ones first? they are relatively simple enhancements with big impact: https://github.com/kubernetes-sigs/kueue/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fimportant-soon

@ahg-g ahg-g added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 22, 2022
@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 22, 2022

Actually this is important to prevent starvation. A short term fix is to re-queue based on last scheduling attempt just to make the system usable, but we should start working on event-based re-queueing now.

I think the solution should ideally avoid time-based backoff, a workload should not be re-tried at all unless there was an event that could potentially make it schedulable.

@denkensk
Copy link
Member

put the comments from @ahg-g here #46 (comment)

@denkensk
Copy link
Member

I think that in order to avoid blocking the development of other features, we can choose the simple fix as you said: use the last scheduling attempt timestamp to sort the workloads in the queue firstly.

Then I will start working on adding the event driven for re-queueing for now. But it's not enough. Starvation may be divided into different situations: like older block the new one(something we can solve it by backoff) and some small jobs block a large one(A more specific design is needed here).

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 22, 2022

ok, can you start a doc discussing the possible scenarios of starvation so we can evaluate possible solutions?

@denkensk
Copy link
Member

ok, can you start a doc discussing the possible scenarios of starvation so we can evaluate possible solutions?

Okay, I will draft a doc for discussion.

@alculquicondor
Copy link
Contributor

use the last scheduling attempt timestamp to sort the workloads in the queue firstly.

I think this should be an option. There are users of kube-scheduler that wanted strict FIFO guarantees. kubernetes/kubernetes#83834

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 22, 2022

Strict FIFO as in: even if the next job is schedulable, it shouldn't be assigned until the head is?

Another temporary solution is to take a snapshot of the queues, continue to process them as long as the capacity snapshot didn't change (need a generation id to track that). This will force FIFO as long as the jobs are schedulable.

@alculquicondor
Copy link
Contributor

Strict FIFO as in: even if the next job is schedulable, it shouldn't be assigned until the head is?

Correct, because the old job could be starved by a constant influx of small jobs.

Another temporary solution is to take a snapshot of the queues, continue to process them as long as the capacity snapshot didn't change (need a generation id to track that).

Not sure if I understand. We continue to process them today and that wouldn't change if we place add an "unschedulable" staging area.

However, I think we need to distinguish between 2 types of requeues, given the current algorithm:

  • A workload that didn't fit should go through the unschedulable queue.
  • A workload that could have fit but we didn't assign because other queues where pointing to the same capacity or cohort should go back directly to the queue.

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 22, 2022

Correct, because the old job could be starved by a constant influx of small jobs.

That would also lead to starvation of smaller jobs by larger jobs as you can see here.

Not sure if I understand. We continue to process them today and that wouldn't change if we place add an "unschedulable" staging area.

We currently re-queue in the same queue that we continue to process, so we are indefinitely processing the same head, so the next job never gets a scheduling attempt. The queue sanpshot wouldn't include the re-queue of the unschedulable job; so as long as nothing has changed in the capacity snapshot (meaning no new capacity added or a job deleted), we continue to process the jobs from the last queue snapshot until done, then we snapshot again.

I am looking to explore a quick fixes that strike a reasonable balance and makes the current system a little more usable until we flush out a comprehensive re-queueing solution. In the taints integration test for example, a job is completely blocking another that could schedule using capacity not usable by the first.

However, I think we need to distinguish between 2 types of requeues, given the current algorithm:

  • A workload that didn't fit should go through the unschedulable queue.
  • A workload that could have fit but we didn't assign because other queues where pointing to the same capacity or cohort should go back directly to the queue.

Right, the latter is a detail of the current implementation that is ideally something users shouldn't care about or tune and should just be handled in a way aligned with the general semantics of queueing promised to the user. We are discussing the re-queueing semantics of a completely unschedulable job.

@denkensk
Copy link
Member

I think this should be an option. There are users of kube-scheduler that wanted strict FIFO guarantees. kubernetes/kubernetes#83834

Strict FIFO guarantees make sense when all jobs are requesting the same kind of resources with constrains. If we let new and small job scheduled successfully, it may lead to old and big job starvation. But if things get a little more complicated like the old one requests gpu card and the new one only requests cpu. The new job won't influence the schedule of old one. This leads to a waste of cpu resources in the cluster if the new one is schedulable and can't be scheduled because of it's not the head. This is very similar to the problem we are facing in #46 Strict FIFO is guaranteed at the expense of cluster resource utilization.

Another temporary solution is to take a snapshot of the queues, continue to process them as long as the capacity snapshot didn't change (need a generation id to track that). This will force FIFO as long as the jobs are schedulable.

We currently re-queue in the same queue that we continue to process, so we are indefinitely processing the same head, so the next job never gets a scheduling attempt. The queue sanpshot wouldn't include the re-queue of the unschedulable job; so as long as nothing has changed in the capacity snapshot (meaning no new capacity added or a job deleted), we continue to process the jobs from the last queue snapshot until done, then we snapshot again.

This is probably the same idea as adding unscheduleQ. We don't include the re-queue of the unschedulable job in sanpshot. It it like we put them in unscheduleQ. And after new capacity added or a job deleted, we snapshot it again. It is like moving all unschedulable jobs which belongs to the same Cohort to the activeQ. I simulated the implementation and felt that using snapshot might be more complicated or worse performance because we at leaset need the deepcopy for the current queue and a version id to track the snapshot.

The easiest and fastest fix I can think of is the one mentioned by Abdullah at the beginning: use the last scheduling attempt timestamp to sort the workloads in the queue as in the default-scheduler. This one at least doesn't make people feel weird

@alculquicondor
Copy link
Contributor

That would also lead to starvation of smaller jobs by larger jobs

That's working as intended, if the user wants it. I think we should at least allow strict FIFO at the queue level.

But if things get a little more complicated like the old one requests gpu card and the new one only requests cpu.

Maybe you should use a different queue for GPUs?

The easiest and fastest fix I can think of is the one mentioned by Abdullah at the beginning: use the last scheduling attempt timestamp to sort the workloads in the queue as in the default-scheduler. This one at least doesn't make people feel weird

It does kubernetes/kubernetes#83834

But I'm fine with it as long as it's an option.
The next step would be to add the unschedulableQ and add elements back to regular queues based on events. Maybe we won't ever need a backoffQ.

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 23, 2022

Maybe you should use a different queue for GPUs?

I am not sure, I feel this pattern will lead to creating many queues and complicates the job user experience (which queue I should submit to), and it is not something we should be recommending to deal with taints.

I am fine with making the change as an option; it seems we are all on the same page that this is a temporary "fix" while we work the details of a more comprehensive re-queueing solution which I think is better discussed over a google doc; I would like to see a list of the use cases that we intend to cover because likely there will be conflicting ones that will require managing via knobs on either the capacity or the queue objects (and perhaps a cohort one).

@denkensk
Copy link
Member

I'm now working on providing a quick fix firstly. But I need to hear suggestions on how to make the option configurable. like CreateTimeFIFO / EnqueueTimeFIFO

@ahg-g @alculquicondor There are maybe serval ways here:

  1. a feature gate
  2. a cmd flag
  3. add a policy field to the API of the queue. like
  4. config file like KubeSchedulerConfiguration (This is definitely needed in the future, as we may have many different strategies in different part)

@alculquicondor
Copy link
Contributor

I vote for 3, maybe sortingPolicy or sortingStrategy

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 24, 2022

yeah, this needs to be a field in the API so that it is easy to experiment with by users in the initial iterations of Kueue.

@denkensk
Copy link
Member

denkensk commented Mar 9, 2022

Depends on #97
It will continue after #97 has merged.

@denkensk
Copy link
Member

denkensk commented Mar 10, 2022

https://docs.google.com/document/d/1VQ0qxWA-jwgvLq_WYG46OkXWW00O6q7b1BsR_Uv-acs/edit?usp=sharing

@ahg-g @alculquicondor
I draft a doc for the different queueing strategies in Kueue. We can discuss on the doc.

@alculquicondor
Copy link
Contributor

You forgot to open the doc for comments :)

@denkensk
Copy link
Member

You forgot to open the doc for comments :)

Sorry opened.

@ahg-g
Copy link
Contributor Author

ahg-g commented Mar 20, 2022

Hi Alex, what is your execution plan for this issue and expected timelines (not that we are late, just asking :))?

@denkensk
Copy link
Member

denkensk commented Mar 21, 2022

Hi Abdullah @ahg-g Thanks for your reminder. I think it's better to add an umbrella here for tracking the process on this feature. :)

@alculquicondor
Copy link
Contributor

Let's open a different issue for Balanced (although I would wait for user feedback) and consider this one done.

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

Let's open a different issue for Balanced (although I would wait for user feedback) and consider this one done.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

No branches or pull requests

4 participants