-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient re-queueing of unschedulable workloads #8
Comments
/assign
|
good initial list, but can we focus on the important-soon ones first? they are relatively simple enhancements with big impact: https://github.com/kubernetes-sigs/kueue/issues?q=is%3Aissue+is%3Aopen+label%3Apriority%2Fimportant-soon |
Actually this is important to prevent starvation. A short term fix is to re-queue based on last scheduling attempt just to make the system usable, but we should start working on event-based re-queueing now. I think the solution should ideally avoid time-based backoff, a workload should not be re-tried at all unless there was an event that could potentially make it schedulable. |
put the comments from @ahg-g here #46 (comment) |
I think that in order to avoid blocking the development of other features, we can choose the simple fix as you said: use the last scheduling attempt timestamp to sort the workloads in the queue firstly. Then I will start working on adding the event driven for re-queueing for now. But it's not enough. Starvation may be divided into different situations: like older block the new one(something we can solve it by backoff) and some small jobs block a large one(A more specific design is needed here). |
ok, can you start a doc discussing the possible scenarios of starvation so we can evaluate possible solutions? |
Okay, I will draft a doc for discussion. |
I think this should be an option. There are users of kube-scheduler that wanted strict FIFO guarantees. kubernetes/kubernetes#83834 |
Strict FIFO as in: even if the next job is schedulable, it shouldn't be assigned until the head is? Another temporary solution is to take a snapshot of the queues, continue to process them as long as the capacity snapshot didn't change (need a generation id to track that). This will force FIFO as long as the jobs are schedulable. |
Correct, because the old job could be starved by a constant influx of small jobs.
Not sure if I understand. We continue to process them today and that wouldn't change if we place add an "unschedulable" staging area. However, I think we need to distinguish between 2 types of requeues, given the current algorithm:
|
That would also lead to starvation of smaller jobs by larger jobs as you can see here.
We currently re-queue in the same queue that we continue to process, so we are indefinitely processing the same head, so the next job never gets a scheduling attempt. The queue sanpshot wouldn't include the re-queue of the unschedulable job; so as long as nothing has changed in the capacity snapshot (meaning no new capacity added or a job deleted), we continue to process the jobs from the last queue snapshot until done, then we snapshot again. I am looking to explore a quick fixes that strike a reasonable balance and makes the current system a little more usable until we flush out a comprehensive re-queueing solution. In the taints integration test for example, a job is completely blocking another that could schedule using capacity not usable by the first.
Right, the latter is a detail of the current implementation that is ideally something users shouldn't care about or tune and should just be handled in a way aligned with the general semantics of queueing promised to the user. We are discussing the re-queueing semantics of a completely unschedulable job. |
Strict FIFO guarantees make sense when all jobs are requesting the same kind of resources with constrains. If we let new and small job scheduled successfully, it may lead to old and big job starvation. But if things get a little more complicated like the old one requests gpu card and the new one only requests cpu. The new job won't influence the schedule of old one. This leads to a waste of cpu resources in the cluster if the new one is schedulable and can't be scheduled because of it's not the head. This is very similar to the problem we are facing in #46 Strict FIFO is guaranteed at the expense of cluster resource utilization.
This is probably the same idea as adding unscheduleQ. We don't include the re-queue of the unschedulable job in sanpshot. It it like we put them in unscheduleQ. And after new capacity added or a job deleted, we snapshot it again. It is like moving all unschedulable jobs which belongs to the same Cohort to the activeQ. I simulated the implementation and felt that using snapshot might be more complicated or worse performance because we at leaset need the deepcopy for the current queue and a version id to track the snapshot. The easiest and fastest fix I can think of is the one mentioned by Abdullah at the beginning: use the last scheduling attempt timestamp to sort the workloads in the queue as in the default-scheduler. This one at least doesn't make people feel weird |
That's working as intended, if the user wants it. I think we should at least allow strict FIFO at the queue level.
Maybe you should use a different queue for GPUs?
It does kubernetes/kubernetes#83834 But I'm fine with it as long as it's an option. |
I am not sure, I feel this pattern will lead to creating many queues and complicates the job user experience (which queue I should submit to), and it is not something we should be recommending to deal with taints. I am fine with making the change as an option; it seems we are all on the same page that this is a temporary "fix" while we work the details of a more comprehensive re-queueing solution which I think is better discussed over a google doc; I would like to see a list of the use cases that we intend to cover because likely there will be conflicting ones that will require managing via knobs on either the capacity or the queue objects (and perhaps a cohort one). |
I'm now working on providing a quick fix firstly. But I need to hear suggestions on how to make the option configurable. like CreateTimeFIFO / EnqueueTimeFIFO @ahg-g @alculquicondor There are maybe serval ways here:
|
I vote for 3, maybe |
yeah, this needs to be a field in the API so that it is easy to experiment with by users in the initial iterations of Kueue. |
https://docs.google.com/document/d/1VQ0qxWA-jwgvLq_WYG46OkXWW00O6q7b1BsR_Uv-acs/edit?usp=sharing @ahg-g @alculquicondor |
You forgot to open the doc for comments :) |
Sorry opened. |
Hi Alex, what is your execution plan for this issue and expected timelines (not that we are late, just asking :))? |
Hi Abdullah @ahg-g Thanks for your reminder. I think it's better to add an umbrella here for tracking the process on this feature. :)
|
Let's open a different issue for Balanced (although I would wait for user feedback) and consider this one done. /close |
@alculquicondor: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently we relentlessly keep trying to schedule jobs.
We need to do something similar to what we did in the scheduler: re-queue based on capacity/workload/queue events.
/kind feature
The text was updated successfully, but these errors were encountered: