Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterQueue update events continue to fire forever #902

Closed
tenzen-y opened this issue Jun 26, 2023 · 7 comments · Fixed by #907
Closed

ClusterQueue update events continue to fire forever #902

tenzen-y opened this issue Jun 26, 2023 · 7 comments · Fixed by #907
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tenzen-y
Copy link
Member

What happened:
ClusterQueueStatus.flavorsUsage.resources with random order continues to fire ClusterQueue update events forever, and the number of reconciles will explode.
Therefore kueue-controller-manager much increases the load of kube-apiserver and etcd.

What you expected to happen:
ClusterQueues aren't updated if the old and the new ones have the same usage.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): v1.25
  • Kueue version (use git describe --tags --dirty --always): v0.3.2
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@tenzen-y tenzen-y added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2023
@tenzen-y
Copy link
Member Author

/assign

@tenzen-y
Copy link
Member Author

The problem part:

kueue/pkg/cache/cache.go

Lines 867 to 881 in b463573

for rName, rQuota := range flvQuotas.Resources {
used := flvUsage[rName]
rUsage := kueue.ResourceUsage{
Name: rName,
Total: workload.ResourceQuantity(rName, used),
}
// Enforce `borrowed=0` if the clusterQueue doesn't belong to a cohort.
if cq.Cohort != nil {
borrowed := used - rQuota.Nominal
if borrowed > 0 {
rUsage.Borrowed = workload.ResourceQuantity(rName, borrowed)
}
}
outFlvUsage.Resources = append(outFlvUsage.Resources, rUsage)
}

@alculquicondor
Copy link
Contributor

Are we triggering a reconcile even if the usage didn't change?

@tenzen-y
Copy link
Member Author

tenzen-y commented Jun 26, 2023

Are we triggering a reconcile even if the usage didn't change?

Yes.
if the order of ClusterQueueStatus.flavorsUsage.resources is changed, it fires an update event.

For example, in the following diffs fire update event:

status:
...
  flavorsUsage:
  - name: foo
    resources:
    - borrowed: "0"
      name: "cpu"
      total: "4"
    - borrowed: "0"
      name: "memory"
      total: "4"
status:
...
  flavorsUsage:
  - name: foo
    resources:
    - borrowed: "0"
      name: "memory"
      total: "4"
    - borrowed: "0"
      name: "cpu"
      total: "4"

@tenzen-y
Copy link
Member Author

This issue will often happen in the HPC cluster. Because in HPC cluster, pods have multiple SR-IOV Virtual functions in resources.Resources like this:

resources:
  requests:
    cpu: 256
    memory: 512Gi
    example.com/gpu: 8
    example.com/vf-0: 1
    example.com/vf-1: 1
    example.com/vf-2: 1
    example.com/vf-3: 1
    example.com/vf-4: 1
    example.com/vf-5: 1
    example.com/vf-6: 1
    example.com/vf-7: 1

@tenzen-y tenzen-y changed the title ClusterQueue update events continue firing forever ClusterQueue update events continue to fire forever Jun 26, 2023
@alculquicondor
Copy link
Contributor

ah I see. we should probably do an internal alphabetical sort

@tenzen-y
Copy link
Member Author

tenzen-y commented Jun 26, 2023

ah I see. we should probably do an internal alphabetical sort

Right. I will sort outFlvUsage.Resources according to resourceName in

func (c *Cache) Usage(cqObj *kueue.ClusterQueue) ([]kueue.FlavorUsage, int, error) {
.

Also, I will apply this way to the localQueue usage as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants