Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support guaranteed resources in kueue #1224

Closed
3 tasks done
Tracked by #1269
kerthcet opened this issue Oct 19, 2023 · 18 comments
Closed
3 tasks done
Tracked by #1269

Support guaranteed resources in kueue #1224

kerthcet opened this issue Oct 19, 2023 · 18 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@kerthcet
Copy link
Contributor

What would you like to be added:

Based on the current implementation, one clusterQueue's resources can be consumed totally by another clusterQueue(within the same cohort), we have the borrowingLimit, but as a consumer, I hope I can use as much resources as possible and it's hard to decide how much I want to borrow. On the contrary, as a provider, I hope I have some resources not shard with others, so I need a guaranteed resource pool.

Why is this needed:

Better resource management among different teams.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@kerthcet kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 19, 2023
@alculquicondor
Copy link
Contributor

Are you suggesting the opposite of BorrowingLimit? That is, a ClusterQueue can say: I only want others to borrow up to x of my nominalQuota?

Otherwise.... guaranteed resources are already possible in kueue, like:

  • Do not put the CQ in a cohort :)
  • Enable preemption (reclaimWithinCohort: Any)

@kerthcet
Copy link
Contributor Author

Are you suggesting the opposite of BorrowingLimit? That is, a ClusterQueue can say: I only want others to borrow up to x of my nominalQuota?

No, what I mean is I hope I can reserve resource and no other clusterQueues can borrow.

Do not put the CQ in a cohort :)

I do want the resource sharing for improving resource utilization.

@KunWuLuan
Copy link
Contributor

Workloads in CQs can reclaim the resources that be borrowed by others by preemption.
Is there any problems if the resources were reclaimed?

@kerthcet
Copy link
Contributor Author

What I mean is I want to reserve some resources that can't be borrowed.

@tenzen-y
Copy link
Member

You mean that CQ with guaranteed mode can borrow resources from other CQs, and that CQ isn't stolen resources from other CQs?

@tenzen-y
Copy link
Member

It might be useful for serving model use cases.
I think serving needs to have more resources to scale out to prepare spiking requests. So we should guarantee resources.

@alculquicondor
Copy link
Contributor

Are you suggesting the opposite of BorrowingLimit? That is, a ClusterQueue can say: I only want others to borrow up to x of my nominalQuota?

No, what I mean is I hope I can reserve resource and no other clusterQueues can borrow.

Then you would set x=0. I'm trying to generalize.
Is that what you want?

@kerthcet
Copy link
Contributor Author

kerthcet commented Nov 7, 2023

Then you would set x=0. I'm trying to generalize.

Set borrow to zero means I won't borrow any resource from other clusterQueues, but what I want is I hope to reserve x resources that won't be borrowed by other clusterQueues, it generally looks like:

  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
        BorrowingLimit: 5
        guaranteedQuota: 8

Then it means the clusterQueue has at least 9 cpus, and can borrow up to 5 then we have 14 cpus at the most, but when being borrowed by other clusterQueues, only 9-8 = 1 cpu is allowed.

@alculquicondor
Copy link
Contributor

We are suggesting the same thing but with different names :)
Lending limit means how much you allow others to borrow from you.

@B1F030
Copy link
Member

B1F030 commented Nov 13, 2023

/assign
I'm working on this, writing a KEP based on LendingLimit.

@kerthcet
Copy link
Contributor Author

By the way, with lendingLimit introduced, do we need the borrowingLimit anymore? ClusterQueue can claim the guaranteed resources and all the neighbors can borrowing as much as possible.

@alculquicondor
Copy link
Contributor

For backwards compatibility, yes.

I think both knobs are useful anyways.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@tenzen-y
Copy link
Member

/remove-lifecycle stale

@B1F030 Could you add documentation?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 13, 2024
@B1F030
Copy link
Member

B1F030 commented Feb 18, 2024

@B1F030 Could you add documentation?

Sure.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2024
@kerthcet
Copy link
Contributor Author

/close
As completed.

@k8s-ci-robot
Copy link
Contributor

@kerthcet: Closing this issue.

In response to this:

/close
As completed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

7 participants