A summary of Workloads order in queues #168

alculquicondor · 2022-03-31T17:53:44Z

What would you like to be added:

For a pending Workload, I would like to know how many pods are ahead of it.

Of course, this has to be in a best-effort fashion, but it could get pretty accurate in a cluster with lots of long running jobs.

However, this is not trivial to implement, as we use a heap, not a literal queue.

Why is this needed:

To provide some level of prediction to end-users for how long their workload will be queued.

ahg-g · 2022-04-09T04:52:04Z

Another concern is that for each update to the queue (new Workload added or removed), many Workloads statuses will need to be reconciled. This is expensive and can't be batched since the updates will need to be done for individual objects.

Perhaps we can update the Queue status with the order of the workloads, up to a specific limit. The order includes the workload name and its rank in the ClusterQueue.

k8s-triage-robot · 2022-07-12T03:12:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alculquicondor · 2022-07-12T13:14:31Z

/lifecycle frozen

maaft · 2023-02-07T10:29:10Z

I'm also very interested in this feature. Has there been any work in this direction? Are there workarounds that I can use now?

alculquicondor · 2023-02-07T14:45:20Z

No progress yet. The exact number is hard to calculate or even keep accurate, depending on how fast jobs are being created. It might be easier to give a time estimate based on historical data?

Feel free to add proposals.

cc @mimowo

maaft · 2023-02-08T07:49:38Z

I mean the obvious solution would be to switch out the heap you mentioned for actual (distributed) queue(s).

Out of curiosity, what were the reasons to use a heap data-structure when implementing a "kueue"? And what kind of heap are you using? Knowing this, I can better start to think about proposals.

But another quick suggestion would be to store:

Per Queue:

current max position int64 m
number of processed elements int64 p
when a workload finishes, increment p.

Per Workload:

has an index parameter i int64
on creation: i = m ( also update queue's max positon)

Then, a workloads queue position could be evaluated as pos = i - p.

In this way, we'd only need to update 2 variables per queue instead of every node.

Of course one have to figure out what happens when a workload is deleted, but I'm optimistic that a solution can be found. E.g.

store indices that were deleted
queue position: pos = i - p - number_of_deleted_items_before_the_node
some smart mechanism for housekeeping to decrease storage requirements

mimowo · 2023-02-08T09:31:34Z

@maaft IIUC the complication with the proposal is that workloads are ordered by priority and creationTimestamp's: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/queue/cluster_queue_strict_fifo.go#L48-L58. So for a newly created workload with the highest priority the position will be pos=1 (not i-p).

Two proposals from me (I suppose the ideas can be combined):

extend the kubectl describe command with the computed information. The on-the-fly computation could probably use the cache and would only happen on demand - when kubectl describe is invoked.
a dedicated CRD, say WorkloadQueueOrder which in the Reconcile loop would update its status with the workload order within a ClusterQueue, similarly as @ahg-g suggested with the Queue status, but extract the information to a dedicated CRD to avoid conflicts with other BAU updates.

@alculquicondor @ahg-g @maaft WDYT?

alculquicondor · 2023-02-08T15:10:38Z

Correct, a heap allows us to efficiently maintain a head that satisfies the criteria: O(log(n)) insertion, O(1) query. A red-black tree could potentially give us similar performance, but there is no built-in implementation in golang. We probably shouldn't implement our own, but use a well tested implementation.

A linked-list based queue would probably be too slow in clusters with lots of jobs.

Back to proposals from @mimowo:

kubectl describe can't access information from cache or even trigger an on-the-fly computation, unless it's all client-side. Unless I misunderstood the suggestion.
I would not encourage yet another object, for performance reasons of etcd/apiserver. We should probably maintain the status of the Workloads somehow. But ideally it should be best-effort, to avoid consuming valuable QPS.

mimowo · 2023-02-08T15:21:01Z

kubectl describe can't access information from cache or even trigger an on-the-fly computation, unless it's all client-side. Unless I misunderstood the suggestion.

I was hoping to be able to instantiate a kubectl describe handler for workloads, passing it a pointer to cache so that it has access. Once created register it as a handler in the extension point for kubectl describe. However, I haven't investigated yet if the API of the extension point actually allows to create such a handler.

ahg-g · 2023-02-08T23:20:38Z

@mimowo Which cache are you referring to?

mimowo · 2023-02-09T07:34:41Z

@mimowo Which cache are you referring to?

The Kueue's server side cache. There is server-side printing: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#additional-printer-columns, but apparently it only allows to use jsonPath syntax to compute the value, so seems no go. I was hoping the API would allow to run custom code to compute the value.

alculquicondor · 2023-07-11T13:29:42Z

/assign @mimowo

tenzen-y · 2023-08-08T10:11:22Z

/reopen

k8s-ci-robot · 2023-08-08T10:11:26Z

@tenzen-y: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2024-01-26T17:53:43Z

/close

Follow up split in #1657

k8s-ci-robot · 2024-01-26T17:53:47Z

@alculquicondor: Closing this issue.

In response to this:

/close

Follow up split in #1657

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 31, 2022

ahg-g changed the title ~~Include ClusterQueue depth in QueuedWorkload status~~ Include ClusterQueue depth in Workload status Apr 9, 2022

ahg-g added kind/ux priority/backlog Higher priority than priority/awaiting-more-evidence. labels Apr 13, 2022

ahg-g mentioned this issue Apr 15, 2022

[Umbrella] ☂️ Requirements for release 0.2.0 #222

Closed

16 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 12, 2022

tenzen-y mentioned this issue May 15, 2023

Add the job ranking information in the queue #770

Closed

3 tasks

k8s-ci-robot assigned mimowo Jul 11, 2023

alculquicondor changed the title ~~Include ClusterQueue depth in Workload status~~ A summary of Workloads' depth in queues Jul 11, 2023

This was referenced Jul 11, 2023

☂️ Requirements for v0.5 #974

Closed

add simple queue control example to python docs #967

Merged

mimowo mentioned this issue Jul 17, 2023

KEP: Pending workloads visibility [expose in queue statuses] #991

Merged

k8s-ci-robot closed this as completed in #991 Aug 8, 2023

k8s-ci-robot reopened this Aug 8, 2023

mimowo mentioned this issue Sep 5, 2023

KEP 168: Visibility for cluster queue #1069

Merged

mimowo mentioned this issue Sep 25, 2023

Docs on using queue visibility feature for cluster queues #1159

Closed

PBundyra mentioned this issue Oct 24, 2023

Introduce an on-demand API endpoint for fetching pending workloads in a cluster queue #1251

Merged

alculquicondor mentioned this issue Oct 26, 2023

Requirements for v0.6 #1269

Closed

PBundyra mentioned this issue Oct 30, 2023

KEP-168-2: Pending workloads visibility #1300

Merged

This was referenced Nov 27, 2023

Extend visibility ClusterQueue PendingWorkloads endpoint #1362

Merged

Add integration test for visibility server #1377

Open

alculquicondor changed the title ~~A summary of Workloads' depth in queues~~ A summary of Workloads order in ClusterQueues Jan 26, 2024

alculquicondor changed the title ~~A summary of Workloads order in ClusterQueues~~ A summary of Workloads order in queues Jan 26, 2024

alculquicondor mentioned this issue Jan 26, 2024

Report the position of a Workload in a queue via the visibility API #1657

Open

3 tasks

k8s-ci-robot closed this as completed Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A summary of Workloads order in queues #168

A summary of Workloads order in queues #168

alculquicondor commented Mar 31, 2022 •

edited by ahg-g

Loading

ahg-g commented Apr 9, 2022

k8s-triage-robot commented Jul 12, 2022

alculquicondor commented Jul 12, 2022

maaft commented Feb 7, 2023 •

edited

Loading

alculquicondor commented Feb 7, 2023

maaft commented Feb 8, 2023 •

edited

Loading

mimowo commented Feb 8, 2023

alculquicondor commented Feb 8, 2023

mimowo commented Feb 8, 2023 •

edited

Loading

ahg-g commented Feb 8, 2023

mimowo commented Feb 9, 2023

alculquicondor commented Jul 11, 2023

tenzen-y commented Aug 8, 2023

k8s-ci-robot commented Aug 8, 2023

alculquicondor commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

A summary of Workloads order in queues #168

A summary of Workloads order in queues #168

Comments

alculquicondor commented Mar 31, 2022 • edited by ahg-g Loading

ahg-g commented Apr 9, 2022

k8s-triage-robot commented Jul 12, 2022

alculquicondor commented Jul 12, 2022

maaft commented Feb 7, 2023 • edited Loading

alculquicondor commented Feb 7, 2023

maaft commented Feb 8, 2023 • edited Loading

mimowo commented Feb 8, 2023

alculquicondor commented Feb 8, 2023

mimowo commented Feb 8, 2023 • edited Loading

ahg-g commented Feb 8, 2023

mimowo commented Feb 9, 2023

alculquicondor commented Jul 11, 2023

tenzen-y commented Aug 8, 2023

k8s-ci-robot commented Aug 8, 2023

alculquicondor commented Jan 26, 2024

k8s-ci-robot commented Jan 26, 2024

alculquicondor commented Mar 31, 2022 •

edited by ahg-g

Loading

maaft commented Feb 7, 2023 •

edited

Loading

maaft commented Feb 8, 2023 •

edited

Loading

mimowo commented Feb 8, 2023 •

edited

Loading