-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClusterQueue update events continue to fire forever #902
Comments
/assign |
The problem part: Lines 867 to 881 in b463573
|
Are we triggering a reconcile even if the usage didn't change? |
Yes. For example, in the following diffs fire update event: status:
...
flavorsUsage:
- name: foo
resources:
- borrowed: "0"
name: "cpu"
total: "4"
- borrowed: "0"
name: "memory"
total: "4" status:
...
flavorsUsage:
- name: foo
resources:
- borrowed: "0"
name: "memory"
total: "4"
- borrowed: "0"
name: "cpu"
total: "4" |
This issue will often happen in the HPC cluster. Because in HPC cluster, pods have multiple SR-IOV Virtual functions in resources.Resources like this: resources:
requests:
cpu: 256
memory: 512Gi
example.com/gpu: 8
example.com/vf-0: 1
example.com/vf-1: 1
example.com/vf-2: 1
example.com/vf-3: 1
example.com/vf-4: 1
example.com/vf-5: 1
example.com/vf-6: 1
example.com/vf-7: 1 |
ah I see. we should probably do an internal alphabetical sort |
Right. I will sort Line 850 in b463573
Also, I will apply this way to the localQueue usage as well. |
What happened:
ClusterQueueStatus.flavorsUsage.resources
with random order continues to fire ClusterQueue update events forever, and the number of reconciles will explode.Therefore kueue-controller-manager much increases the load of kube-apiserver and etcd.
What you expected to happen:
ClusterQueues aren't updated if the old and the new ones have the same usage.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
): v1.25git describe --tags --dirty --always
): v0.3.2cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: