Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a troubleshooting page #1410

Open
alculquicondor opened this issue Dec 5, 2023 · 9 comments
Open

Add a troubleshooting page #1410

alculquicondor opened this issue Dec 5, 2023 · 9 comments
Assignees
Labels
kind/documentation Categorizes issue or PR as related to documentation.

Comments

@alculquicondor
Copy link
Contributor

What would you like to be added:

We can start documenting common user errors. For example:

  • Workload is not admitted
    • Check CQ status, verify flavors
  • Job starts but doesn't have any node selectors
    • Check whether the template has any requests, otherwise they won't get assigned a flavor. Or use quota per pod.

Why is this needed:

I think one of this scenarios was reported in #1407.

/kind documentation

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 5, 2023
@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Dec 5, 2023
@alculquicondor
Copy link
Contributor Author

@tenzen-y @kerthcet have you seen any common user errors?

@kerthcet
Copy link
Contributor

kerthcet commented Dec 6, 2023

If Workload is not admitted, check the workload status.

Also sometimes, I need to check the feature gates like kubernetes does kubectl get --raw /metrics | grep kubernetes_feature_enabled, maybe we should do the same in kueue. This is not an error.

The integrated component's version is also something we should consider, I used to meet our users complaining about kueue not working with kubeflow, he already installed kubeflow1.7, however, the training-operator is 1.6, but we need 1.7 specifically. Maybe we can take this as a special case.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 6, 2023

Q1. The desired flavor isn't assigned to the Job.
A2. The flavor in clusterQueue is evaluated from top to bottom and assigned to jobs. The highest-priority flavor need to be put on the top.

Q2. In spite of a job being admitted, pods from a job are pending.
A2. Kueue will consider only quotas defined in clusterQueues, not consider actual cluster usage. Please check if the cluster has free capacity.

Q3. In spite of enabled sequential admission, all pods can not be started, and the part of pods are started.
A3. Kueue isn't pod's scheduler. Kueue doesn't guarantee that all pods are started at the same time.

@tenzen-y
Copy link
Member

tenzen-y commented Dec 6, 2023

/remove-kind feature

@k8s-ci-robot k8s-ci-robot removed the kind/feature Categorizes issue or PR as related to a new feature. label Dec 6, 2023
@PBundyra
Copy link
Contributor

PBundyra commented Feb 7, 2024

/assign

@alculquicondor
Copy link
Contributor Author

A state diagram of Workload conditions would be useful. Annecdotically, I just got a question from a developer about what is QuotaReserved.

@alculquicondor
Copy link
Contributor Author

Another common user error: installing the integration (for example jobset or kuberay) after installing kueue.
Kueue will not monitor these jobs.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 9, 2024
@alculquicondor
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Categorizes issue or PR as related to documentation.
Projects
None yet
Development

No branches or pull requests

6 participants