Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterQueue not considering next ResourceFlavor when whenCanPreempt: Preempt is set #1344

Closed
nstogner opened this issue Nov 17, 2023 · 6 comments · Fixed by #1366
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@nstogner
Copy link
Contributor

What happened:

Setting whenCanPreempt: Preempt in a ClusterQueue with 2 resource flavors appears to stop the ClusterQueue from considering the 2nd resource flavor.

How to reproduce it (as minimally and precisely as possible):

  • Created a cluster queue with 2 resource flavors
  • Each resource flavor has quota for 2 cpus
  • Submitting 3 jobs that require 1 cpu each (via Job or plain Pod API) will result in 2 being admitted and 1 not

NOTE: ClusterQueue is configured as follows:

  preemption:
    reclaimWithinCohort: Any
    withinClusterQueue: LowerOrNewerEqualPriority
  flavorFungibility:
    whenCanBorrow: Borrow
    whenCanPreempt: Preempt

What you expected to happen:

All 3 jobs should be admitted (quota from resource flavor 2 should be used).

Anything else we need to know?:

Condition of non-admitted Workload:

status:
  conditions:
  - lastTransitionTime: "2023-11-17T22:37:24Z"
    message: 'couldn''t assign flavors to pod set main: borrowing limit for cpu in
      flavor flav-1 exceeded'
    reason: Pending
    status: "False"
    type: QuotaReserved

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3-gke.100", GitCommit:"6466b51b762a5c49ae3fb6c2c7233ffe1c96e48c", GitTreeState:"clean", BuildDate:"2023-06-23T09:27:28Z", GoVersion:"go1.20.5 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
  • Kueue version (use git describe --tags --dirty --always): Built from main (commit 4405e35b51bb153611e0a01f48884aa2c131055c - deployed with manifests from 0.5.0)
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@nstogner nstogner added the kind/bug Categorizes issue or PR as related to a bug. label Nov 17, 2023
@nstogner
Copy link
Contributor Author

I will plan on submitting a test case reproducing this issue if nobody has gotten to it at my next chance - might be Monday.

@alculquicondor
Copy link
Contributor

/assign

@alculquicondor
Copy link
Contributor

cc @KunWuLuan if you have any ideas.

@KunWuLuan
Copy link
Contributor

I will try to reproduce the case today.

@KunWuLuan
Copy link
Contributor

There is some problem in cq.AllocatableResourceGeneration and wl.LastAssignment.ClusterQueueGeneration. I am working on it. cc @alculquicondor

@KunWuLuan
Copy link
Contributor

It seem that cluster queue is updated and cq.AllocatableResourceGeneration is added after the third job is scheduled. After I add a check about RGs, the problem is solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
3 participants