diff --git a/_attributes/common-attributes.adoc b/_attributes/common-attributes.adoc index fb16a6df39d6..787e5d51b3e0 100644 --- a/_attributes/common-attributes.adoc +++ b/_attributes/common-attributes.adoc @@ -79,7 +79,10 @@ endif::[] :descheduler-operator: Kube Descheduler Operator :cli-manager: CLI Manager Operator :lws-operator: Leader Worker Set Operator -:kueue-prod-name: Red{nbsp}Hat build of Kueue +//Kueue +:kueue-name: Red{nbsp}Hat build of Kueue +:kueue-op: Red Hat Build of Kueue Operator +:ms: Red{nbsp}Hat build of MicroShift (MicroShift) // Backup and restore :launch: image:app-launcher.png[title="Application Launcher"] :mtc-first: Migration Toolkit for Containers (MTC) diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 7ed495bb482e..d50cc9de345d 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -3432,6 +3432,34 @@ Distros: openshift-enterprise Topics: - Name: Overview of AI workloads on OpenShift Container Platform File: index +- Name: Red Hat build of Kueue + Dir: kueue + Distros: openshift-enterprise + Topics: + - Name: Introduction to Red Hat build of Kueue + File: about-kueue + - Name: Release notes + File: release-notes + - Name: Installing Red Hat build of Kueue + File: install-kueue + - Name: Installing Red Hat build of Kueue in a disconnected environment + File: install-disconnected + - Name: Configuring role-based permissions + File: rbac-permissions + - Name: Configuring quotas + File: configuring-quotas + - Name: Managing jobs and workloads + File: managing-workloads + - Name: Using cohorts + File: using-cohorts + - Name: Configuring fair sharing + File: configuring-fairsharing + - Name: Gang scheduling + File: gangscheduling + - Name: Running jobs with quota limits + File: running-kueue-jobs + - Name: Getting support + File: getting-support - Name: Leader Worker Set Operator Dir: leader_worker_set Distros: openshift-enterprise diff --git a/ai_workloads/index.adoc b/ai_workloads/index.adoc index ca7304f3f847..65e4157b2d1a 100644 --- a/ai_workloads/index.adoc +++ b/ai_workloads/index.adoc @@ -15,6 +15,7 @@ include::modules/ai-operators.adoc[leveloffset=+1] [role="_additional-resources"] .Additional resources +* xref:../ai_workloads/kueue/about-kueue.adoc#about-kueue[Introduction to {kueue-name}] * xref:../ai_workloads/leader_worker_set/index.adoc#lws-about[{lws-operator} overview] // Exclude this for now until we can get it reviewed by the RHOAI team diff --git a/ai_workloads/kueue/_attributes b/ai_workloads/kueue/_attributes new file mode 120000 index 000000000000..20cc1dcb77bf --- /dev/null +++ b/ai_workloads/kueue/_attributes @@ -0,0 +1 @@ +../../_attributes/ \ No newline at end of file diff --git a/ai_workloads/kueue/about-kueue.adoc b/ai_workloads/kueue/about-kueue.adoc new file mode 100644 index 000000000000..7008010be355 --- /dev/null +++ b/ai_workloads/kueue/about-kueue.adoc @@ -0,0 +1,52 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="about-kueue"] += Introduction to {kueue-name} +:context: about-kueue + +toc::[] + +{kueue-name} is a Kubernetes-native system that manages access to resources for jobs. +{kueue-name} can determine when a job waits, is admitted to start by creating pods, or should be _preempted_, meaning that active pods for that job are deleted. + +[NOTE] +==== +In the context of {kueue-name}, a job can be defined as a one-time or on-demand task that runs to completion. +==== + +{kueue-name} is based on the link:https://kueue.sigs.k8s.io/docs/[Kueue] open source project. + +{kueue-name} is compatible with environments that use heterogeneous, elastic resources. This means that the environment has many different resource types, and those resources are capable of dynamic scaling. + +{kueue-name} does not replace any existing components in a Kubernetes cluster, but instead integrates with the existing Kubernetes API server, scheduler, and cluster autoscaler components. + +{kueue-name} supports all-or-nothing semantics. This means that either an entire job with all of its components is admitted to the cluster, or the entire job is rejected if it does not fit on the cluster. + +// Personas +[id="about-kueue-personas"] +== Personas + +Different personas exist in a {kueue-name} workflow. + +Batch administrators:: Batch administrators manage the cluster infrastructure and establish quotas and queues. +Batch users:: Batch users run jobs on the cluster. Examples of batch users might be researchers, AI/ML engineers, or data scientists. +Serving users:: Serving users run jobs on the cluster. For example, to expose a trained AI/ML model for inference. +Platform developers:: Platform developers integrate {kueue-name} with other software. They might also contribute to the Kueue open source project. + +[id="about-kueue-workflow"] +== Workflow overview + +The {kueue-name} workflow can be described at a high level as follows: + +. Batch administrators create and configure `ResourceFlavor`, `LocalQueue`, and `ClusterQueue` resources. +. User personas create jobs on the cluster. +. The Kubernetes API server validates and accepts job data. +. {kueue-name} admits jobs based on configured options, such as order or quota. It injects affinity into the job by using resource flavors, and creates a `Workload` object that corresponds to each job. +. The applicable controller for the job type creates pods. +. The Kubernetes scheduler assigns pods to a node in the cluster. +. The Kubernetes cluster autoscaler provisions more nodes as required. + +//// +TODO:Add docs explaining different job / workload types +These can be added as we add stories / docs for different use cases +//// diff --git a/ai_workloads/kueue/configuring-fairsharing.adoc b/ai_workloads/kueue/configuring-fairsharing.adoc new file mode 100644 index 000000000000..7b617a8e047c --- /dev/null +++ b/ai_workloads/kueue/configuring-fairsharing.adoc @@ -0,0 +1,18 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="configuring-fairsharing"] += Configuring fair sharing +:context: configuring-fairsharing + +toc::[] + +Fair sharing is a preemption strategy that is used to achieve an equal or weighted share of borrowable resources between the tenants of a cohort. Borrowable resources are the unused nominal quota of all the cluster queues in a cohort. + +You can configure fair sharing by setting the `preemptionPolicy` value in the `Kueue` custom resource (CR) to `FairSharing`. + +include::modules/kueue-clusterqueue-share-value.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="additional-resources_{context}"] +== Additional resources +* xref:../../ai_workloads/kueue/install-kueue.adoc#create-kueue-cr_install-kueue[Creating a `Kueue` custom resource] diff --git a/ai_workloads/kueue/configuring-quotas.adoc b/ai_workloads/kueue/configuring-quotas.adoc new file mode 100644 index 000000000000..f44b8c3c99c8 --- /dev/null +++ b/ai_workloads/kueue/configuring-quotas.adoc @@ -0,0 +1,38 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="configuring-quotas"] += Configuring quotas +:context: configuring-quotas + +toc::[] + +As an administrator, you can use {kueue-name} to configure quotas to optimize resource allocation and system throughput for user workloads. +You can configure quotas for compute resources such as CPU, memory, pods, and GPU. + +You can configure quotas in {kueue-name} by completing the following steps: + +. Configure a cluster queue. +. Configure a resource flavor. +. Configure a local queue. + +Users can then submit their workloads to the local queue. + +include::modules/kueue-configuring-clusterqueues.adoc[leveloffset=+1] + +[role="_next-steps"] +[id="clusterqueues-next-steps_{context}"] +.Next steps + +The cluster queue is not ready for use until a xref:../../ai_workloads/kueue/configuring-quotas.adoc#configuring-resourceflavors_configuring-quotas[`ResourceFlavor` object] has also been configured. + +include::modules/kueue-configuring-resourceflavors.adoc[leveloffset=+1] + +include::modules/kueue-configuring-localqueues.adoc[leveloffset=+1] + +include::modules/kueue-configuring-localqueue-defaults.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="clusterqueues-additional-resources_{context}"] +== Additional resources +* xref:../../ai_workloads/kueue/rbac-permissions.adoc#rbac-permissions[RBAC permissions] +* link:https://kueue.sigs.k8s.io/docs/concepts/cluster_queue/[Kubernetes documentation about cluster queues] diff --git a/ai_workloads/kueue/gangscheduling.adoc b/ai_workloads/kueue/gangscheduling.adoc new file mode 100644 index 000000000000..d6fa36f3fb04 --- /dev/null +++ b/ai_workloads/kueue/gangscheduling.adoc @@ -0,0 +1,27 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="gangscheduling"] += Gang scheduling +:context: gangscheduling + +toc::[] + +Gang scheduling ensures that a group or _gang_ of related jobs only start when all required resources are available. {kueue-name} enables gang scheduling by suspending jobs until the {product-title} cluster can guarantee the capacity to start and execute all of the related jobs in the gang together. This is also known as _all-or-nothing_ scheduling. + +Gang scheduling is important if you are working with expensive, limited resources, such as GPUs. Gang scheduling can prevent jobs from claiming but not using GPUs, which can improve GPU utilization and can reduce running costs. Gang scheduling can also help to prevent issues like resource segmentation and deadlocking. + +include::modules/kueue-configuring-gangscheduling.adoc[leveloffset=+1] + +[role="_additional-resources"] +[id="additional-resources_{context}"] +== Additional resources +* xref:../../ai_workloads/kueue/install-kueue.adoc#create-kueue-cr_install-kueue[Creating a Kueue custom resource] + +//// +// use case - deep learning +One classic example is in deep learning workloads. Deep learning frameworks (Tensorflow, PyTorch etc) require all the workers to be running during the training process. + +In this scenario, when you deploy training workloads, all the components should be scheduled and deployed to ensure the training works as expected. + +Gang Scheduling is a critical feature for Deep Learning workloads to enable all-or-nothing scheduling capability, as most DL frameworks requires all workers to be running to start training process. Gang Scheduling avoids resource inefficiency and scheduling deadlock sometimes. +//// diff --git a/ai_workloads/kueue/getting-support.adoc b/ai_workloads/kueue/getting-support.adoc new file mode 100644 index 000000000000..302b9fd71a6e --- /dev/null +++ b/ai_workloads/kueue/getting-support.adoc @@ -0,0 +1,27 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="getting-support"] += Getting support +:context: getting-support + +toc::[] + +If you experience difficulty with a procedure described in this documentation, or with {kueue-name} in general, visit the link:http://access.redhat.com[Red{nbsp}Hat Customer Portal]. + +From the Customer Portal, you can: + +* Search or browse through the Red{nbsp}Hat Knowledgebase of articles and solutions relating to Red{nbsp}Hat products. +* Submit a support case to Red{nbsp}Hat Support. +* Access other product documentation. + +[id="getting-support-rh-kb"] +== About the Red Hat Knowledgebase + +The link:https://access.redhat.com/knowledgebase[Red{nbsp}Hat Knowledgebase] provides rich content aimed at helping you make the most of Red{nbsp}Hat's products and technologies. The Red{nbsp}Hat Knowledgebase consists of articles, product documentation, and videos outlining best practices on installing, configuring, and using Red{nbsp}Hat products. In addition, you can search for solutions to known issues, each providing concise root cause descriptions and remedial steps. + +include::modules/kueue-gathering-cluster-data.adoc[leveloffset=+1] + +[id="getting-support-additional-resources"] +[role="_additional-resources"] +== Additional resources +* xref:../../support/index.adoc#support-overview[Support overview] diff --git a/ai_workloads/kueue/images b/ai_workloads/kueue/images new file mode 120000 index 000000000000..847b03ed0541 --- /dev/null +++ b/ai_workloads/kueue/images @@ -0,0 +1 @@ +../../images/ \ No newline at end of file diff --git a/ai_workloads/kueue/install-disconnected.adoc b/ai_workloads/kueue/install-disconnected.adoc new file mode 100644 index 000000000000..4757cbb4e506 --- /dev/null +++ b/ai_workloads/kueue/install-disconnected.adoc @@ -0,0 +1,31 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="install-disconnected"] += Installing {kueue-name} in a disconnected environment +:context: install-disconnected + +toc::[] + +Before you can install {kueue-name} on a disconnected {product-title} cluster, you must enable {olm-first} in disconnected environments by completing the following steps: + +* Disable the default remote OperatorHub sources for OLM. +* Use a workstation with full internet access to create and push local mirrors of the OperatorHub content to a mirror registry. +* Configure OLM to install and manage Operators from local sources on the mirror registry instead of the default remote sources. + +After enabling OLM in a disconnected environment, you can continue to use your unrestricted workstation to keep your local OperatorHub sources updated as newer versions of Operators are released. + +For full documentation on completing these steps, see the {product-title} documentation on xref:../../disconnected/using-olm.adoc#olm-restricted-networks[Using Operator Lifecycle Manager in disconnected environments]. + +include::modules/kueue-compatible-environments.adoc[leveloffset=+1] + +include::modules/kueue-install-kueue-operator.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources +* xref:../../security/cert_manager_operator/cert-manager-operator-install.adoc#installing-the-cert-manager-operator-for-red-hat-openshift[Installing the {cert-manager-operator}] + +include::modules/upgrading-kueue.adoc[leveloffset=+1] + +include::modules/kueue-create-kueue-cr.adoc[leveloffset=+1] + +include::modules/kueue-label-namespaces.adoc[leveloffset=+1] diff --git a/ai_workloads/kueue/install-kueue.adoc b/ai_workloads/kueue/install-kueue.adoc new file mode 100644 index 000000000000..bcc16174275a --- /dev/null +++ b/ai_workloads/kueue/install-kueue.adoc @@ -0,0 +1,23 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="install-kueue"] += Installing {kueue-name} +:context: install-kueue + +toc::[] + +You can install {kueue-name} by using the {kueue-op} in OperatorHub. + +include::modules/kueue-compatible-environments.adoc[leveloffset=+1] + +include::modules/kueue-install-kueue-operator.adoc[leveloffset=+1] + +[role="_additional-resources"] +.Additional resources +* xref:../../security/cert_manager_operator/cert-manager-operator-install.adoc#installing-the-cert-manager-operator-for-red-hat-openshift[Installing the {cert-manager-operator}] + +include::modules/upgrading-kueue.adoc[leveloffset=+1] + +include::modules/kueue-create-kueue-cr.adoc[leveloffset=+1] + +include::modules/kueue-label-namespaces.adoc[leveloffset=+1] diff --git a/ai_workloads/kueue/managing-workloads.adoc b/ai_workloads/kueue/managing-workloads.adoc new file mode 100644 index 000000000000..0a4bc3fc1529 --- /dev/null +++ b/ai_workloads/kueue/managing-workloads.adoc @@ -0,0 +1,13 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="managing-workloads"] += Managing jobs and workloads +:context: managing-workloads + +toc::[] + +{kueue-name} does not directly manipulate jobs that are created by users. Instead, Kueue manages `Workload` objects that represent the resource requirements of a job. {kueue-name} automatically creates a workload for each job, and syncs any decisions and statuses between the two objects. + +include::modules/kueue-label-namespaces.adoc[leveloffset=+1] + +include::modules/kueue-configuring-labelpolicy.adoc[leveloffset=+1] diff --git a/ai_workloads/kueue/modules b/ai_workloads/kueue/modules new file mode 120000 index 000000000000..36719b9de743 --- /dev/null +++ b/ai_workloads/kueue/modules @@ -0,0 +1 @@ +../../modules/ \ No newline at end of file diff --git a/ai_workloads/kueue/rbac-permissions.adoc b/ai_workloads/kueue/rbac-permissions.adoc new file mode 100644 index 000000000000..94b36af748d8 --- /dev/null +++ b/ai_workloads/kueue/rbac-permissions.adoc @@ -0,0 +1,26 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="rbac-permissions"] += Configuring role-based permissions +:context: rbac-permissions + +toc::[] + +The following procedures provide information about how you can configure role-based access control (RBAC) for your {kueue-name} deployment. These RBAC permissions determine which types of users can create which types of {kueue-name} objects. + +[id="authentication-clusterroles"] +== Cluster roles + +The {kueue-name} Operator deploys `kueue-batch-admin-role` and `kueue-batch-user-role` cluster roles by default. + +kueue-batch-admin-role:: This cluster role includes the permissions to manage cluster queues, local queues, workloads, and resource flavors. +kueue-batch-user-role:: This cluster role includes the permissions to manage jobs and to view local queues and workloads. + +include::modules/kueue-configure-rbac-batch-admins.adoc[leveloffset=+1] + +include::modules/kueue-configure-rbac-batch-users.adoc[leveloffset=+1] + +[role="_additional-resources"] +== Additional resources +* xref:../../authentication/using-rbac.adoc#using-rbac[Using RBAC to define and apply permissions] +* xref:../../authentication/index.adoc#openshift-auth-common-terms_overview-of-authentication-authorization[Glossary of common terms for {product-title} authentication and authorization] diff --git a/ai_workloads/kueue/release-notes.adoc b/ai_workloads/kueue/release-notes.adoc new file mode 100644 index 000000000000..b12fe49bde52 --- /dev/null +++ b/ai_workloads/kueue/release-notes.adoc @@ -0,0 +1,17 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="release-notes"] += Release notes +:context: release-notes + +toc::[] + +{kueue-name} is released as an Operator that is supported on {product-title}. + +include::modules/kueue-compatible-environments.adoc[leveloffset=+1] + +include::modules/kueue-release-notes-1.1.adoc[leveloffset=+1] + +include::modules/kueue-release-notes-1.0.1.adoc[leveloffset=+1] + +include::modules/kueue-release-notes-1.0.adoc[leveloffset=+1] diff --git a/ai_workloads/kueue/running-kueue-jobs.adoc b/ai_workloads/kueue/running-kueue-jobs.adoc new file mode 100644 index 000000000000..ecbfede8b51f --- /dev/null +++ b/ai_workloads/kueue/running-kueue-jobs.adoc @@ -0,0 +1,13 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="running-kueue-jobs"] += Running jobs with quota limits +:context: running-kueue-jobs + +toc::[] + +You can run Kubernetes jobs with {kueue-name} enabled to manage resource allocation within defined quota limits. This can help to ensure predictable resource availability, cluster stability, and optimized performance. + +include::modules/kueue-identifying-local-queues.adoc[leveloffset=+1] + +include::modules/kueue-defining-running-jobs.adoc[leveloffset=+1] diff --git a/ai_workloads/kueue/snippets b/ai_workloads/kueue/snippets new file mode 120000 index 000000000000..5a3f5add140e --- /dev/null +++ b/ai_workloads/kueue/snippets @@ -0,0 +1 @@ +../../snippets/ \ No newline at end of file diff --git a/ai_workloads/kueue/troubleshooting.adoc b/ai_workloads/kueue/troubleshooting.adoc new file mode 100644 index 000000000000..6dab1f8bec4f --- /dev/null +++ b/ai_workloads/kueue/troubleshooting.adoc @@ -0,0 +1,31 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="troubleshooting"] += Troubleshooting +:context: troubleshooting + +toc::[] + +// commented out - note for TS docs + +// Troubleshooting installations +// Verifying node health +// Troubleshooting network issues +// Troubleshooting Operator issues +// Investigating pod issues +// Diagnosing CLI issues +//// +Troubleshooting Jobs +Troubleshooting the status of a Job + +Troubleshooting Queues +Troubleshooting the status of a LocalQueue or ClusterQueue + +Troubleshooting Provisioning Request in Kueue +Troubleshooting the status of a Provisioning Request in Kueue + +Troubleshooting Pods +Troubleshooting the status of a Pod or group of Pods + +Troubleshooting delete ClusterQueue +//// diff --git a/ai_workloads/kueue/using-cohorts.adoc b/ai_workloads/kueue/using-cohorts.adoc new file mode 100644 index 000000000000..608dc6c64d9e --- /dev/null +++ b/ai_workloads/kueue/using-cohorts.adoc @@ -0,0 +1,16 @@ +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] +[id="using-cohorts"] += Using cohorts +:context: using-cohorts + +toc::[] + +You can use cohorts to group cluster queues and determine which cluster queues are able to share borrowable resources with each other. +Borrowable resources are defined as the unused nominal quota of all the cluster queues in a cohort. + +Using cohorts can help to optimize resource utilization by preventing under-utilization and enabling fair sharing configurations. +Cohorts can also help to simplify resource management and allocation between teams, because you can group cluster queues for related workloads or for each team. +You can also use cohorts to set resource quotas at a group level to define the limits for resources that a group of cluster queues can consume. + +include::modules/kueue-clusterqueue-configuring-cohorts-reference.adoc[leveloffset=+1] diff --git a/modules/ai-operators.adoc b/modules/ai-operators.adoc index 20b2fb9aa8fb..957c33d13702 100644 --- a/modules/ai-operators.adoc +++ b/modules/ai-operators.adoc @@ -10,18 +10,16 @@ You can use Operators to run artificial intelligence (AI) and machine learning ( {product-title} provides several Operators that can help you run AI workloads: +{kueue-name}:: +You can use {kueue-name} to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources. ++ +For more information, see "Introduction to {kueue-name}". + {lws-operator}:: You can use the {lws-operator} to enable large-scale AI inference workloads to run reliably across nodes with synchronization between leader and worker processes. Without proper coordination, large training runs might fail or stall. + For more information, see "{lws-operator} overview". -{kueue-prod-name}:: -You can use {kueue-prod-name} to provide structured queues and prioritization so that workloads are handled fairly and efficiently. Without proper prioritization, important jobs might be delayed while less critical jobs occupy resources. -+ -For more information, see link:https://docs.redhat.com/en/documentation/red_hat_build_of_kueue/latest/html/overview/about-kueue[Introduction to Red Hat build of Kueue] in the {kueue-prod-name} documentation. - -// TODO: Anything else to list yet? - //// Keep for future use (JobSet and DRA) - From Gaurav (PM): AI in OpenShift – Focus Areas diff --git a/modules/kueue-clusterqueue-configuring-cohorts-reference.adoc b/modules/kueue-clusterqueue-configuring-cohorts-reference.adoc new file mode 100644 index 000000000000..5fb8fa4e4530 --- /dev/null +++ b/modules/kueue-clusterqueue-configuring-cohorts-reference.adoc @@ -0,0 +1,25 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/using-cohorts.adoc + +:_mod-docs-content-type: REFERENCE +[id="clusterqueue-configuring-cohorts-reference_{context}"] += Configuring cohorts within a cluster queue spec + +You can add a cluster queue to a cohort by specifying the name of the cohort in the `.spec.cohort` field of the `ClusterQueue` object, as shown in the following example: + +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: cluster-queue +spec: +# ... + cohort: example-cohort +# ... +---- + +All cluster queues that have a matching `spec.cohort` are part of the same cohort. + +If the `spec.cohort` field is omitted, the cluster queue does not belong to any cohort and cannot access borrowable resources. diff --git a/modules/kueue-clusterqueue-share-value.adoc b/modules/kueue-clusterqueue-share-value.adoc new file mode 100644 index 000000000000..a298573a1916 --- /dev/null +++ b/modules/kueue-clusterqueue-share-value.adoc @@ -0,0 +1,39 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/configuring-fairsharing.adoc + +:_mod-docs-content-type: REFERENCE +[id="clusterqueue-share-value_{context}"] += Cluster queue weights + +After you have enabled fair sharing, you must set share values for each cluster queue before fair sharing can take place. Share values are represented as the `weight` value in a `ClusterQueue` object. + +Share values are important because they allow administrators to prioritize specific job types or teams. Critical applications or high-priority teams can be configured with a weighted value so that they receive a proportionally larger share of the available resources. Configuring weights ensures that unused resources are distributed according to defined organizational or project priorities rather than on a first-come, first-served basis. + +The `weight` value, or share value, defines a comparative advantage for the cluster queue when competing for borrowable resources. Generally, {kueue-name} admits jobs with a lower share value first. Jobs with a higher share value are more likely to be preempted before those with lower share values. + +.Example cluster queue with a fair sharing weight configured +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: cluster-queue +spec: + namespaceSelector: {} + resourceGroups: + - coveredResources: ["cpu"] + flavors: + - name: default-flavor + resources: + - name: cpu + nominalQuota: 9 + cohort: example-cohort + fairSharing: + weight: 2 +---- + +[id="clusterqueue-share-value-zero_{context}"] +== Zero weight + +A `weight` value of `0` represents an infinite share value. This means that the cluster queue is always at a disadvantage compared to others, and its workloads are always the first to be preempted when fair sharing is enabled. diff --git a/modules/kueue-compatible-environments.adoc b/modules/kueue-compatible-environments.adoc new file mode 100644 index 000000000000..0bf681a6e6ac --- /dev/null +++ b/modules/kueue-compatible-environments.adoc @@ -0,0 +1,34 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/install-kueue.adoc +// * ai_workloads/kueue/install-disconnected.adoc +// * ai_workloads/kueue/release-notes.adoc + +:_mod-docs-content-type: REFERENCE +[id="compatible-environments_{context}"] += Compatible environments + +Before you install {kueue-name}, review this section to ensure that your cluster meets the requirements. + +[id="compatible-environments-arch_{context}"] +== Supported architectures + +{kueue-name} version 1.1 and later is supported on the following architectures: + +* ARM64 +* 64-bit x86 +* ppc64le ({ibm-power-name}) +* s390x ({ibm-z-name}) + +[id="compatible-environments-platforms_{context}"] +== Supported platforms + +{kueue-name} version 1.1 and later is supported on the following platforms: + +* {product-title} +* {hcp-capital} for {product-title} + +[IMPORTANT] +==== +Currently, {kueue-name} is not supported on {ms}. +==== diff --git a/modules/kueue-configure-rbac-batch-admins.adoc b/modules/kueue-configure-rbac-batch-admins.adoc new file mode 100644 index 000000000000..b3a9c205a15e --- /dev/null +++ b/modules/kueue-configure-rbac-batch-admins.adoc @@ -0,0 +1,70 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/rbac-permissions.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configure-rbac-batch-admins_{context}"] += Configuring permissions for batch administrators + +You can configure permissions for batch administrators by binding the `kueue-batch-admin-role` cluster role to a user or group of users. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-admin.adoc[] + +.Procedure + +. Create a `ClusterRoleBinding` object as a YAML file: ++ +.Example `ClusterRoleBinding` object +[source,yaml] +---- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: kueue-admins <1> +subjects: <2> +- kind: User + name: admin@example.com + apiGroup: rbac.authorization.k8s.io +roleRef: <3> + kind: ClusterRole + name: kueue-batch-admin-role + apiGroup: rbac.authorization.k8s.io +---- +<1> Provide a name for the `ClusterRoleBinding` object. +<2> Add details about which user or group of users you want to provide user permissions for. +<3> Add details about the `kueue-batch-admin-role` cluster role. + +. Apply the `ClusterRoleBinding` object: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- + +.Verification + +* You can verify that the `ClusterRoleBinding` object was applied correctly by running the following command and verifying that the output contains the correct information for the `kueue-batch-admin-role` cluster role: ++ +[source,yaml] +---- +$ oc describe clusterrolebinding.rbac +---- ++ +.Example output +[source,terminal] +---- +... +Name: kueue-batch-admin-role +Labels: app.kubernetes.io/name=kueue +Annotations: +Role: + Kind: ClusterRole + Name: kueue-batch-admin-role +Subjects: + Kind Name Namespace + ---- ---- --------- + User admin@example.com admin-namespace +... +---- diff --git a/modules/kueue-configure-rbac-batch-users.adoc b/modules/kueue-configure-rbac-batch-users.adoc new file mode 100644 index 000000000000..b5712830af7f --- /dev/null +++ b/modules/kueue-configure-rbac-batch-users.adoc @@ -0,0 +1,73 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/rbac-permissions.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configure-rbac-batch-users_{context}"] += Configuring permissions for users + +You can configure permissions for {kueue-name} users by binding the `kueue-batch-user-role` cluster role to a user or group of users. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-admin.adoc[] + +.Procedure + +. Create a `RoleBinding` object as a YAML file: ++ +.Example `ClusterRoleBinding` object +[source,yaml] +---- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: kueue-users <1> + namespace: user-namespace <2> +subjects: <3> +- kind: Group + name: team-a@example.com + apiGroup: rbac.authorization.k8s.io +roleRef: <4> + kind: ClusterRole + name: kueue-batch-user-role + apiGroup: rbac.authorization.k8s.io + +---- +<1> Provide a name for the `RoleBinding` object. +<2> Add details about which namespace the `RoleBinding` object applies to. +<3> Add details about which user or group of users you want to provide user permissions for. +<4> Add details about the `kueue-batch-user-role` cluster role. + +. Apply the `RoleBinding` object: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- + +.Verification + +* You can verify that the `RoleBinding` object was applied correctly by running the following command and verifying that the output contains the correct information for the `kueue-batch-user-role` cluster role: ++ +[source,yaml] +---- +$ oc describe rolebinding.rbac +---- ++ +.Example output +[source,terminal] +---- +... +Name: kueue-users +Labels: app.kubernetes.io/name=kueue +Annotations: +Role: + Kind: ClusterRole + Name: kueue-batch-user-role +Subjects: + Kind Name Namespace + ---- ---- --------- + Group team-a@example.com user-namespace +... +---- diff --git a/modules/kueue-configuring-clusterqueues.adoc b/modules/kueue-configuring-clusterqueues.adoc new file mode 100644 index 000000000000..64fee90fb155 --- /dev/null +++ b/modules/kueue-configuring-clusterqueues.adoc @@ -0,0 +1,63 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/configuring-quotas.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configuring-clusterqueues_{context}"] += Configuring a cluster queue + +A cluster queue is a cluster-scoped resource, represented by a `ClusterQueue` object, that governs a pool of resources such as CPU, memory, and pods. +Cluster queues can be used to define usage limits, quotas for resource flavors, order of consumption, and fair sharing rules. + +[NOTE] +==== +The cluster queue is not ready for use until a `ResourceFlavor` object has also been configured. +==== + +.Prerequisites + +include::snippets/prereqs-snippet-yaml.adoc[] + +.Procedure + +. Create a `ClusterQueue` object as a YAML file: ++ +.Example of a basic `ClusterQueue` object using a single resource flavor +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ClusterQueue +metadata: + name: cluster-queue +spec: + namespaceSelector: {} # <1> + resourceGroups: + - coveredResources: ["cpu", "memory", "pods", "foo.com/gpu"] # <2> + flavors: + - name: "default-flavor" # <3> + resources: # <4> + - name: "cpu" + nominalQuota: 9 + - name: "memory" + nominalQuota: 36Gi + - name: "pods" + nominalQuota: 5 + - name: "foo.com/gpu" + nominalQuota: 100 +---- +<1> Defines which namespaces can use the resources governed by this cluster queue. An empty `namespaceSelector` as shown in the example means that all namespaces can use these resources. +<2> Defines the resource types governed by the cluster queue. This example `ClusterQueue` object governs CPU, memory, pod, and GPU resources. +<3> Defines the resource flavor that is applied to the resource types listed. In this example, the `default-flavor` resource flavor is applied to CPU, memory, pod, and GPU resources. +<4> Defines the resource requirements for admitting jobs. This example cluster queue only admits jobs if the following conditions are met: ++ +* The sum of the CPU requests is less than or equal to 9. +* The sum of the memory requests is less than or equal to 36Gi. +* The total number of pods is less than or equal to 5. +* The sum of the GPU requests is less than or equal to 100. + +. Apply the `ClusterQueue` object by running the following command: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- diff --git a/modules/kueue-configuring-gangscheduling.adoc b/modules/kueue-configuring-gangscheduling.adoc new file mode 100644 index 000000000000..a8ec50304141 --- /dev/null +++ b/modules/kueue-configuring-gangscheduling.adoc @@ -0,0 +1,43 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/gangscheduling.adoc + +:_mod-docs-content-type: REFERENCE +[id="configuring-gangscheduling_{context}"] += Configuring gang scheduling + +As a cluster administrator, you can configure gang scheduling by modifying the `gangScheduling` spec in the `Kueue` custom resource (CR). + +.Example `Kueue` CR with gang scheduling configured +[source,yaml] +---- +apiVersion: kueue.openshift.io/v1 +kind: Kueue +metadata: + name: cluster + labels: + app.kubernetes.io/managed-by: kustomize + app.kubernetes.io/name: kueue-operator + namespace: openshift-kueue-operator +spec: + config: + gangScheduling: + policy: ByWorkload # <1> + byWorkload: + admission: Parallel # <2> +# ... +---- +<1> You can set the `policy` value to enable or disable gang scheduling. The possible values are `ByWorkload`, `None`, or empty (`""`). ++ +`ByWorkload`:: When the `policy` value is set to `ByWorkload`, each job is processed and considered for admission as a single unit. If the job does not become ready within the specified time, the entire job is evicted and retried at a later time. ++ +`None`:: When the `policy` value is set to `None`, gang scheduling is disabled. ++ +Empty (`""`):: When the `policy` value is empty or set to `""`, the {kueue-name} Operator determines settings for gang scheduling. Currently, gang scheduling is disabled by default. +<2> If the `policy` value is set to `ByWorkload`, you must configure job admission settings. The possible values for the `admission` spec are `Parallel`, `Sequential`, or empty (`""`). ++ +`Parallel`:: When the `admission` value is set to `Parallel`, pods from any job can be admitted at any time. This can cause a deadlock, where jobs are in contention for cluster capacity. When a deadlock occurs, the successful scheduling of pods from another job can prevent the scheduling of pods from the current job. ++ +`Sequential`:: When the `admission` value is set to `Sequential`, only pods from the currently processing job are admitted. After all of the pods from the current job have been admitted and are ready, {kueue-name} processes the next job. Sequential processing can slow down admission when the cluster has sufficient capacity for multiple jobs, but provides a higher likelihood that all of the pods for a job are scheduled together successfully. ++ +Empty (`""`):: When the `admission` value is empty or set to `""`, the {kueue-name} Operator determines job admission settings. Currently, the `admission` value is set to `Parallel` by default. diff --git a/modules/kueue-configuring-labelpolicy.adoc b/modules/kueue-configuring-labelpolicy.adoc new file mode 100644 index 000000000000..2514d172dfc4 --- /dev/null +++ b/modules/kueue-configuring-labelpolicy.adoc @@ -0,0 +1,45 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/managing-workloads.adoc + +:_mod-docs-content-type: REFERENCE +[id="configuring-labelpolicy_{context}"] += Configuring label policies for jobs + +The `spec.config.workloadManagement.labelPolicy` spec in the `Kueue` custom resource (CR) is an optional field that controls how {kueue-name} decides whether to manage or ignore different jobs. The allowed values are `QueueName`, `None` and empty (`""`). + +If the `labelPolicy` setting is omitted or empty (`""`), the default policy is that {kueue-name} manages jobs that have a `kueue.x-k8s.io/queue-name` label, and ignores jobs that do not have the `kueue.x-k8s.io/queue-name` label. This is the same workflow as if the `labelPolicy` is set to `QueueName`. + +If the `labelPolicy` setting is set to `None`, jobs are managed by {kueue-name} even if they do not have the `kueue.x-k8s.io/queue-name` label. + +.Example `workloadManagement` spec configuration +[source,yaml] +---- +apiVersion: kueue.openshift.io/v1 +kind: Kueue +metadata: + labels: + app.kubernetes.io/name: kueue-operator + app.kubernetes.io/managed-by: kustomize + name: cluster + namespace: openshift-kueue-operator +spec: + config: + workloadManagement: + labelPolicy: QueueName +# ... +---- + +.Example user-created `Job` object containing the `kueue.x-k8s.io/queue-name` label +[source,yaml] +---- +apiVersion: batch/v1 +kind: Job +metadata: + generateName: sample-job- + namespace: my-namespace + labels: + kueue.x-k8s.io/queue-name: user-queue +spec: +# ... +---- diff --git a/modules/kueue-configuring-localqueue-defaults.adoc b/modules/kueue-configuring-localqueue-defaults.adoc new file mode 100644 index 000000000000..eb323b104484 --- /dev/null +++ b/modules/kueue-configuring-localqueue-defaults.adoc @@ -0,0 +1,50 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/configuring-quotas.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configuring-localqueue-defaults_{context}"] += Configuring a default local queue + +As a cluster administrator, you can improve quota enforcement in your cluster by managing all jobs in selected namespaces without needing to explicitly label each job. You can do this by creating a default local queue. + +A default local queue serves as the local queue for newly created jobs that do not have the `kueue.x-k8s.io/queue-name` label. After you create a default local queue, any new jobs created in the namespace without a `kueue.x-k8s.io/queue-name` label automatically update to have the `kueue.x-k8s.io/queue-name: default` label. + +[IMPORTANT] +==== +Preexisting jobs in a namespace are not affected when you create a default local queue. If jobs already exist in the namespace before you create the default local queue, you must label those jobs explicitly to assign them to a queue. +==== + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-1.1.adoc[] + +* You have created a `ClusterQueue` object. + +.Procedure + +. Create a `LocalQueue` object named `default` as a YAML file: ++ +.Example of a default `LocalQueue` object +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: LocalQueue +metadata: + namespace: team-namespace + name: default +spec: + clusterQueue: cluster-queue +---- + +. Apply the `LocalQueue` object by running the following command: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- + +.Verification + +. Create a job in the same namespace as the default local queue. +. Observe that the job updates with the `kueue.x-k8s.io/queue-name: default` label. diff --git a/modules/kueue-configuring-localqueues.adoc b/modules/kueue-configuring-localqueues.adoc new file mode 100644 index 000000000000..af28a3662c09 --- /dev/null +++ b/modules/kueue-configuring-localqueues.adoc @@ -0,0 +1,40 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/configuring-quotas.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configuring-localqueues_{context}"] += Configuring a local queue + +A local queue is a namespaced object, represented by a `LocalQueue` object, that groups closely related workloads that belong to a single namespace. + +As an administrator, you can configure a `LocalQueue` object to point to a cluster queue. This allocates resources from the cluster queue to workloads in the namespace specified in the `LocalQueue` object. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml.adoc[] + +* You have created a `ClusterQueue` object. + +.Procedure + +. Create a `LocalQueue` object as a YAML file: ++ +.Example of a basic `LocalQueue` object +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: LocalQueue +metadata: + namespace: team-namespace + name: user-queue +spec: + clusterQueue: cluster-queue +---- + +. Apply the `LocalQueue` object by running the following command: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- diff --git a/modules/kueue-configuring-resourceflavors.adoc b/modules/kueue-configuring-resourceflavors.adoc new file mode 100644 index 000000000000..4782de599a99 --- /dev/null +++ b/modules/kueue-configuring-resourceflavors.adoc @@ -0,0 +1,49 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/configuring-quotas.adoc + +:_mod-docs-content-type: PROCEDURE +[id="configuring-resourceflavors_{context}"] += Configuring a resource flavor + +After you have configured a `ClusterQueue` object, you can configure a `ResourceFlavor` object. + +Resources in a cluster are typically not homogeneous. If the resources in your cluster are homogeneous, you can use an empty `ResourceFlavor` instead of adding labels to custom resource flavors. + +You can use a custom `ResourceFlavor` object to represent different resource variations that are associated with cluster nodes through labels, taints, and tolerations. You can then associate workloads with specific node types to enable fine-grained resource management. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml.adoc[] + +.Procedure + +. Create a `ResourceFlavor` object as a YAML file: ++ +.Example of an empty `ResourceFlavor` object +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: default-flavor +---- ++ +.Example of a custom `ResourceFlavor` object +[source,yaml] +---- +apiVersion: kueue.x-k8s.io/v1beta1 +kind: ResourceFlavor +metadata: + name: "x86" +spec: + nodeLabels: + cpu-arch: x86 +---- + +. Apply the `ResourceFlavor` object by running the following command: ++ +[source,terminal] +---- +$ oc apply -f .yaml +---- diff --git a/modules/kueue-create-kueue-cr.adoc b/modules/kueue-create-kueue-cr.adoc new file mode 100644 index 000000000000..9aee2637fbda --- /dev/null +++ b/modules/kueue-create-kueue-cr.adoc @@ -0,0 +1,66 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/install-kueue.adoc +// * ai_workloads/kueue/install-disconnected.adoc + +:_mod-docs-content-type: PROCEDURE +[id="create-kueue-cr_{context}"] += Creating a Kueue custom resource + +After you have installed the {kueue-op}, you must create a `Kueue` custom resource (CR) to configure your installation. + +.Prerequisites + +include::snippets/prereqs-snippet-console.adoc[] + +.Procedure + +. In the {product-title} web console, click *Operators* -> *Installed Operators*. +. In the *Provided APIs* table column, click *Kueue*. This takes you to the *Kueue* tab of the *Operator details* page. +. Click *Create Kueue*. This takes you to the *Create Kueue* YAML view. +. Enter the details for your `Kueue` CR. ++ +.Example `Kueue` CR +[source,yaml] +---- +apiVersion: kueue.openshift.io/v1 +kind: Kueue +metadata: + labels: + app.kubernetes.io/name: kueue-operator + app.kubernetes.io/managed-by: kustomize + name: cluster # <1> + namespace: openshift-kueue-operator +spec: + managementState: Managed + config: + integrations: + frameworks: # <2> + - BatchJob + preemption: + preemptionPolicy: Classical # <3> +# ... +---- +<1> The name of the `Kueue` CR must be `cluster`. +<2> If you want to configure {kueue-name} for use with other workload types, add those types here. For the default configuration, only the `BatchJob` type is recommended and supported. +<3> Optional: If you want to configure fair sharing for {kueue-name}, set the `preemptionPolicy` value to `FairSharing`. The default setting in the `Kueue` CR is `Classical` preemption. +// Once conceptual docs are added mention those docs here. "For more information about X, see..." + +. Click *Create*. + +.Verification + +* After you create the `Kueue` CR, the web console brings you to the *Operator details* page, where you can see the CR in the list of *Kueues*. +* Optional: If you have the {oc-first} installed, you can run the following command and observe the output to confirm that your `Kueue` CR has been created successfully: ++ +[source,terminal] +---- +$ oc get kueue +---- ++ +.Example output +[source,terminal] +---- +NAME AGE +cluster 4m +---- diff --git a/modules/kueue-defining-running-jobs.adoc b/modules/kueue-defining-running-jobs.adoc new file mode 100644 index 000000000000..08bbc2ead4c4 --- /dev/null +++ b/modules/kueue-defining-running-jobs.adoc @@ -0,0 +1,90 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/running-kueue-jobs.adoc + +:_mod-docs-content-type: PROCEDURE +[id="defining-running-jobs_{context}"] += Defining a job to run with {kueue-name} + +When you are defining a job to run with {kueue-name}, ensure that it meets the following criteria: + +* Specify the local queue to submit the job to, by using the `kueue.x-k8s.io/queue-name` label. +* Include the resource requests for each job pod. + +{kueue-name} suspends the job, and then starts it when resources are available. {kueue-name} creates a corresponding workload, represented as a `Workload` object with a name that matches the job. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-user.adoc[] +* You have identified the name of the local queue that you want to submit jobs to. + +.Procedure + +. Create a `Job` object. ++ +.Example job +[source,yaml] +---- +apiVersion: batch/v1 +kind: Job # <1> +metadata: + generateName: sample-job- # <2> + namespace: my-namespace + labels: + kueue.x-k8s.io/queue-name: user-queue # <3> +spec: + parallelism: 3 + completions: 3 + template: + spec: + containers: + - name: dummy-job + image: registry.k8s.io/e2e-test-images/agnhost:2.53 + args: ["entrypoint-tester", "hello", "world"] + resources: # <4> + requests: + cpu: 1 + memory: "200Mi" + restartPolicy: Never +---- +<1> Defines the resource type as a `Job` object, which represents a batch computation task. +<2> Provides a prefix for generating a unique name for the job. +<3> Identifies the queue to send the job to. +<4> Defines the resource requests for each pod. + +. Run the job by running the following command: ++ +[source,terminal] +---- +$ oc create -f .yaml +---- + +.Verification + +* Verify that pods are running for the job you have created, by running the following command and observing the output: ++ +[source,terminal] +---- +$ oc get job +---- ++ +.Example output +[source,terminal] +---- +NAME STATUS COMPLETIONS DURATION AGE +sample-job-sk42x Suspended 0/1 2m12s +---- + +* Verify that a workload has been created in your namespace for the job, by running the following command and observing the output: ++ +[source,terminal] +---- +$ oc -n get workloads +---- ++ +.Example output +[source,terminal] +---- +NAME QUEUE RESERVED IN ADMITTED FINISHED AGE +job-sample-job-sk42x-77c03 user-queue 3m8s +---- diff --git a/modules/kueue-gathering-cluster-data.adoc b/modules/kueue-gathering-cluster-data.adoc new file mode 100644 index 000000000000..7c0bc9d92c66 --- /dev/null +++ b/modules/kueue-gathering-cluster-data.adoc @@ -0,0 +1,40 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/getting-support.adoc + +:_mod-docs-content-type: PROCEDURE +[id="gathering-cluster-data_{context}"] += Collecting data for Red Hat Support + +You can use the `oc adm must-gather` CLI command to collect the information about your {kueue-name} instance that is most likely needed for debugging issues, including: + +* {kueue-name} custom resources, such as workloads, cluster queues, local queues, resource flavors, admission checks, and their corresponding cluster resource definitions (CRDs) +* Services +* Endpoints +* Webhook configurations +* Logs from the `openshift-kueue-operator` namespace and `kueue-controller-manager` pods + +Collected data is written into a new directory named `must-gather/` in the current working directory by default. + +.Prerequisites + +* The {kueue-name} Operator is installed on your cluster. +* You have installed the {oc-first}. + +.Procedure + +. Navigate to the directory where you want to store the `must-gather` data. + +. Collect `must-gather` data by running the following command: ++ +[source,terminal] +---- +$ oc adm must-gather \ + --image=registry.redhat.io/kueue/kueue-must-gather-rhel9: +---- ++ +Where `` is your current version of {kueue-name}. + +. Create a compressed file from the `must-gather` directory that was just created in your working directory. Make sure you provide the date and cluster ID for the unique `must-gather` data. For more information about how to find the cluster ID, see link:https://access.redhat.com/solutions/5280291[How to find the cluster-id or name on OpenShift cluster]. + +. Attach the compressed file to your support case on the link:https://access.redhat.com/support/cases/#/case/list[the *Customer Support* page] of the Red{nbsp}Hat Customer Portal. diff --git a/modules/kueue-identifying-local-queues.adoc b/modules/kueue-identifying-local-queues.adoc new file mode 100644 index 000000000000..191d1c9a6868 --- /dev/null +++ b/modules/kueue-identifying-local-queues.adoc @@ -0,0 +1,29 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/running-kueue-jobs.adoc + +:_mod-docs-content-type: PROCEDURE +[id="identifying-local-queues_{context}"] += Identifying available local queues + +Before you can submit a job to a queue, you must find the name of the local queue. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-user.adoc[] + +.Procedure + +* Run the following command to list available local queues in your namespace: ++ +[source,terminal] +---- +$ oc -n get localqueues +---- ++ +.Example output +[source,terminal] +---- +NAME CLUSTERQUEUE PENDING WORKLOADS +user-queue cluster-queue 3 +---- diff --git a/modules/kueue-install-kueue-operator.adoc b/modules/kueue-install-kueue-operator.adoc new file mode 100644 index 000000000000..ff4d997f0312 --- /dev/null +++ b/modules/kueue-install-kueue-operator.adoc @@ -0,0 +1,25 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/install-kueue.adoc +// * ai_workloads/kueue/install-disconnected.adoc + +:_mod-docs-content-type: PROCEDURE +[id="install-kueue-operator_{context}"] += Installing the {kueue-op} + +You can install the {kueue-op} on a {product-title} cluster by using the OperatorHub in the web console. + +.Prerequisites + +* You have administrator permissions on a {product-title} cluster. +* You have access to the {product-title} web console. +* You have installed and configured the {cert-manager-operator} for your cluster. + +.Procedure + +. In the {product-title} web console, click *Operators* -> *OperatorHub*. +. Choose *{kueue-op}* from the list of available Operators, and click *Install*. + +.Verification + +* Go to *Operators* -> *Installed Operators* and confirm that the *{kueue-op}* is listed with *Status* as *Succeeded*. diff --git a/modules/kueue-label-namespaces.adoc b/modules/kueue-label-namespaces.adoc new file mode 100644 index 000000000000..11d808cc0737 --- /dev/null +++ b/modules/kueue-label-namespaces.adoc @@ -0,0 +1,29 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/install-kueue.adoc +// * ai_workloads/kueue/install-disconnected.adoc + +:_mod-docs-content-type: PROCEDURE +[id="label-namespaces_{context}"] += Labeling namespaces to allow {kueue-name} to manage jobs + +The {kueue-name} Operator uses an opt-in webhook mechanism to ensure that policies are only enforced for the jobs and namespaces that it is expected to target. + +You must label the namespaces where you want {kueue-name} to manage jobs with the `kueue.openshift.io/managed=true` label. + +.Prerequisites + +* You have cluster administrator permissions. +* The {kueue-name} Operator is installed on your cluster, and you have created a `Kueue` custom resource (CR). +* You have installed the {oc-first}. + +.Procedure + +* Add the `kueue.openshift.io/managed=true` label to a namespace by running the following command: ++ +[source,terminal] +---- +$ oc label namespace kueue.openshift.io/managed=true +---- + +When you add this label, you instruct the {kueue-name} Operator that the namespace is managed by its webhook admission controllers. As a result, any {kueue-name} resources within that namespace are properly validated and mutated. diff --git a/modules/kueue-release-notes-1.0.1.adoc b/modules/kueue-release-notes-1.0.1.adoc new file mode 100644 index 000000000000..20e35739b421 --- /dev/null +++ b/modules/kueue-release-notes-1.0.1.adoc @@ -0,0 +1,20 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/release-notes.adoc + +:_mod-docs-content-type: REFERENCE +[id="release-notes-1.0.1_{context}"] += Release notes for {kueue-name} version 1.0.1 + +{kueue-name} version 1.0.1 is a patch release that is supported on {product-title} versions 4.18 and 4.19 on the 64-bit x86 architecture. + +{kueue-name} version 1.0.1 uses link:https://kueue.sigs.k8s.io/docs/overview/[Kueue] version 0.11. + +[id="release-notes-1.0.1-bug-fixes_{context}"] +== Bug fixes in {kueue-name} version 1.0.1 + +* Previously, leader election for {kueue-name} was not configured to tolerate disruption, which resulted in frequent crashing. With this release, the leader election values for {kueue-name} have been updated to match the durations recommended for {product-title}. (link:https://issues.redhat.com/browse/OCPBUGS-58496[OCPBUGS-58496]) + +* Previously, the `ReadyReplicas` count was not set in the reconciler, which meant that the {kueue-name} Operator status would report that there were no replicas ready. With this release, the `ReadyReplicas` count is based on the number of ready replicas for the deployment, which ensures that the Operator shows as ready in the {product-title} console when the `kueue-controller-manager` pods are ready. (link:https://issues.redhat.com/browse/OCPBUGS-59261[OCPBUGS-59261]) + +* Previously, when the `Kueue` custom resource (CR) was deleted from the `openshift-kueue-operator` namespace, the `kueue-manager-config` config map was not deleted automatically and could remain in the namespace. With this release, the `kueue-manager-config` config map, `kueue-webhook-server-cert` secret, and `metrics-server-cert` secret are deleted automatically when the `Kueue` CR is deleted. (link:https://issues.redhat.com/browse/OCPBUGS-57960[OCPBUGS-57960]) diff --git a/modules/kueue-release-notes-1.0.adoc b/modules/kueue-release-notes-1.0.adoc new file mode 100644 index 000000000000..0e68a775b37d --- /dev/null +++ b/modules/kueue-release-notes-1.0.adoc @@ -0,0 +1,34 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/release-notes.adoc + +:_mod-docs-content-type: REFERENCE +[id="release-notes-1.0_{context}"] += Release notes for {kueue-name} version 1.0 + +[role="_abstract"] +{kueue-name} version 1.0 is a generally available release that is supported on {product-title} versions 4.18 and 4.19 on the 64-bit x86 architecture. {kueue-name} version 1.0 uses link:https://kueue.sigs.k8s.io/docs/overview/[Kueue] version 0.11. + +[id="release-notes-1.0-new-features_{context}"] +== New features and enhancements + +Role-based access control (RBAC):: Role-based access control (RBAC) enables you to control which types of users can create which types of {kueue-name} resources. + +Configure resource quotas:: Configuring resource quotas by creating cluster queues, resource flavors, and local queues enables you to control the amount of resources used by user-submitted jobs and workloads. + +Control job and workload management:: Labeling namespaces and configuring label policies enable you to control which jobs and workloads are managed by {kueue-name}. + +Share borrowable resources between queues:: Configuring cohorts, fair sharing, and gang scheduling settings enable you to share unused, borrowable resources between queues. + +[id="release-notes-1.0-known-issues_{context}"] +== Known issues + +Jobs in all namespaces are reconciled if they have the `kueue.x-k8s.io/queue-name` label:: {kueue-name} uses the `managedJobsNamespaceSelector` configuration field, so that administrators can configure which namespaces opt in to be managed by {kueue-name}. Because namespaces must be manually configured to opt in to being managed by {kueue-name}, resources in system or third-party namespaces are not impacted or managed by {kueue-name}. ++ +The behavior in {kueue-name} 1.0 allows reconciliation of `Job` resources that have the `kueue.x-k8s.io/queue-name` label, even if these resources are in namespaces that are not configured to opt in to being managed by {kueue-name}. This is inconsistent with the behavior for other core integrations like pods, deployments, and stateful sets, which are only reconciled if they are in namespaces that have been configured to opt in to being managed by {kueue-name}. ++ +(link:https://issues.redhat.com/browse/OCPBUGS-58205[OCPBUGS-58205]) + +You cannot create a `Kueue` custom resource by using the {product-title} web console:: If you try to use the {product-title} web console to create a `Kueue` custom resource (CR) by using the form view, the web console shows an error and the resource cannot be created. As a workaround, use the YAML view to create a `Kueue` CR instead. ++ +(link:https://issues.redhat.com/browse/OCPBUGS-58118[OCPBUGS-58118]) diff --git a/modules/kueue-release-notes-1.1.adoc b/modules/kueue-release-notes-1.1.adoc new file mode 100644 index 000000000000..75025213ffc4 --- /dev/null +++ b/modules/kueue-release-notes-1.1.adoc @@ -0,0 +1,46 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/release-notes.adoc + +:_mod-docs-content-type: REFERENCE +[id="kueue-release-notes-1.1_{context}"] += Release notes for {kueue-name} version 1.1 + +[role="_abstract"] +{kueue-name} version 1.1 is a generally available release that is supported on {product-title} versions 4.18 and later. {kueue-name} version 1.1 uses link:https://kueue.sigs.k8s.io/docs/overview/[Kueue] version 0.12. + +[IMPORTANT] +==== +If you have a previously installed version of {kueue-name} on your cluster, you must uninstall the Operator and manually install version 1.1. For information see xref:../../ai_workloads/kueue/install-kueue.adoc#upgrading-kueue_install-kueue[Upgrading {kueue-name}]. +==== + +[id="release-notes-1.1-new-features_{context}"] +== New features and enhancements + +Configure a default local queue:: A default local queue serves as the local queue for newly created jobs that do not have the `kueue.x-k8s.io/queue-name` label. After you create a default local queue, any new jobs created in the namespace without a `kueue.x-k8s.io/queue-name` label automatically update to have the `kueue.x-k8s.io/queue-name: default` label. ++ +(link:https://issues.redhat.com/browse/RFE-7615[RFE-7615]) + +Multi-architecture and {hcp-capital} support:: With this release, {kueue-name} is supported on multiple different architectures, including ARM64, 64-bit x86, ppc64le ({ibm-power-name}), and s390x ({ibm-z-name}), as well as on {hcp-capital} for {product-title}. ++ +(link:https://issues.redhat.com/browse/OCPSTRAT-2103[OCPSTRAT-2103]) ++ +(link:https://issues.redhat.com/browse/OCPSTRAT-2106[OCPSTRAT-2106]) + +[id="release-notes-1.1-fixed-issues_{context}"] +== Fixed issues + +You can create a `Kueue` custom resource by using the {product-title} web console:: Before this update, if you tried to use the {product-title} web console to create a `Kueue` custom resource (CR) by using the form view, the web console showed an error and the resource could not be created. With this release, the default namespace was removed from the `Kueue` CR template. As a result, you can use the {product-title} web console to create a `Kueue` CR by using the form view. ++ +(link:https://issues.redhat.com/browse/OCPBUGS-58118[OCPBUGS-58118]) + +[id="release-notes-1.1-known-issues_{context}"] +== Known issues + +`Kueue` CR description reads as "Not available" in the {product-title} web console:: After you install {kueue-name}, in the *Operator details* view, the description for the `Kueue` CR reads as "Not available". This issue does not affect or degrade the {kueue-name} Operator functionality. ++ +(link:https://issues.redhat.com/browse/OCPBUGS-62185[OCPBUGS-62185]) + +Custom resources are not deleted properly when you uninstall {kueue-name}:: After you uninstall the {kueue-op} using the *Delete all operand instances for this operator* option in the {product-title} web console, some {kueue-name} custom resources are not fully deleted. These resources can be viewed in the *Installed Operators* view with the status *Resource is being deleted*. As a workaround, you can manually delete the resource finalizers to remove them fully. ++ +(link:https://issues.redhat.com/browse/OCPBUGS-62254[OCPBUGS-62254]) diff --git a/modules/kueue-running-jobs.adoc b/modules/kueue-running-jobs.adoc new file mode 100644 index 000000000000..517b02bd3620 --- /dev/null +++ b/modules/kueue-running-jobs.adoc @@ -0,0 +1,17 @@ +// Module included in the following assemblies: +// +// * /develop/running-kueue-jobs.adoc + +:_mod-docs-content-type: PROCEDURE +[id="running-jobs_{context}"] += Running jobs with {product-title} + +When you run a job, {product-title} creates a corresponding workload for the job, which is represented by a `Workload` object. + +.Prerequisites + +include::snippets/prereqs-snippet-yaml-user.adoc[] + +.Procedure + +.Verification diff --git a/modules/upgrading-kueue.adoc b/modules/upgrading-kueue.adoc new file mode 100644 index 000000000000..1bbbe76811eb --- /dev/null +++ b/modules/upgrading-kueue.adoc @@ -0,0 +1,46 @@ +// Module included in the following assemblies: +// +// * ai_workloads/kueue/install-disconnected.adoc +// * ai_workloads/kueue/install-kueue.adoc + +:_mod-docs-content-type: PROCEDURE +[id="upgrading-kueue_{context}"] += Upgrading {kueue-name} + +[role="_abstract"] +If you have previously installed {kueue-name}, you must manually upgrade your deployment to the latest version to use the latest bug fixes and feature enhancements. + +.Prerequisites + +* You have installed a previous version of {kueue-name}. +* You are logged in to the {product-title} web console with cluster administrator permissions. + +.Procedure + +. In the {product-title} web console, click *Operators* -> *Installed Operators*, then select *{kueue-name}* from the list. + +. From the *Actions* drop-down menu, select *Uninstall Operator*. + +. The *Uninstall Operator?* dialog box opens. Click *Uninstall*. ++ +[IMPORTANT] +==== +Selecting the *Delete all operand instances for this operator* checkbox before clicking *Uninstall* deletes all existing resources from the cluster, including: + +* The `Kueue` CR +* Any cluster queues, local queues, or resource flavors that you have created + +Leave this box unchecked when upgrading your cluster to retain your created resources. +==== + +. In the {product-title} web console, click *Operators* -> *OperatorHub*. + +. Choose *{kueue-op}* from the list of available Operators, and click *Install*. + +.Verification + +. Go to *Operators* -> *Installed Operators*. + +. Confirm that the *{kueue-op}* is listed with *Status* as *Succeeded*. + +. Confirm that the version shown under the Operator name in the list is the latest version. diff --git a/snippets/prereqs-snippet-console.adoc b/snippets/prereqs-snippet-console.adoc new file mode 100644 index 000000000000..66b044ed507b --- /dev/null +++ b/snippets/prereqs-snippet-console.adoc @@ -0,0 +1,15 @@ +// Text snippet included in the following modules: +// +// * modules/kueue-create-kueue-cr.adoc +// +// Text snippet included in the following assemblies: +// +// * + +:_mod-docs-content-type: SNIPPET + +Ensure that you have completed the following prerequisites: + +* The {kueue-name} Operator is installed on your cluster. +* You have cluster administrator permissions and the `kueue-batch-admin-role` role. +* You have access to the {product-title} web console. diff --git a/snippets/prereqs-snippet-yaml-1.1.adoc b/snippets/prereqs-snippet-yaml-1.1.adoc new file mode 100644 index 000000000000..82898adea0a9 --- /dev/null +++ b/snippets/prereqs-snippet-yaml-1.1.adoc @@ -0,0 +1,13 @@ +// Text snippet included in the following modules: +// +// * modules/kueue-configuring-localqueue-defaults.adoc +// +// Text snippet included in the following assemblies: +// +// * + +:_mod-docs-content-type: SNIPPET + +* You have installed {kueue-name} version 1.1 on your cluster. +* You have cluster administrator permissions or the `kueue-batch-admin-role` role. +* You have installed the {oc-first}. diff --git a/snippets/prereqs-snippet-yaml-admin.adoc b/snippets/prereqs-snippet-yaml-admin.adoc new file mode 100644 index 000000000000..ece72d415cf4 --- /dev/null +++ b/snippets/prereqs-snippet-yaml-admin.adoc @@ -0,0 +1,14 @@ +// Text snippet included in the following modules: +// +// * modules/kueue-configure-rbac-batch-admins.adoc +// * modules/kueue-configure-rbac-batch-users.adoc +// +// Text snippet included in the following assemblies: +// +// * + +:_mod-docs-content-type: SNIPPET + +* The {kueue-name} Operator is installed on your cluster. +* You have cluster administrator permissions. +* You have installed the {oc-first}. diff --git a/snippets/prereqs-snippet-yaml-user.adoc b/snippets/prereqs-snippet-yaml-user.adoc new file mode 100644 index 000000000000..58a6cf3333d6 --- /dev/null +++ b/snippets/prereqs-snippet-yaml-user.adoc @@ -0,0 +1,14 @@ +// Text snippet included in the following modules: +// +// * kueue-identifying-local-queues.adoc +// * kueue-defining-running-jobs.adoc +// +// Text snippet included in the following assemblies: +// +// * + +:_mod-docs-content-type: SNIPPET + +* A cluster administrator has installed and configured {kueue-name} on your {product-title} cluster. +* A cluster administrator has assigned you the `kueue-batch-user-role` cluster role. +* You have installed the {oc-first}. diff --git a/snippets/prereqs-snippet-yaml.adoc b/snippets/prereqs-snippet-yaml.adoc new file mode 100644 index 000000000000..aedf7cbd9645 --- /dev/null +++ b/snippets/prereqs-snippet-yaml.adoc @@ -0,0 +1,15 @@ +// Text snippet included in the following modules: +// +// * modules/kueue-configuring-clusterqueues.adoc +// * modules/kueue-configuring-localqueues.adoc +// * modules/kueue-configuring-resourceflavors.adoc +// +// Text snippet included in the following assemblies: +// +// * + +:_mod-docs-content-type: SNIPPET + +* The {kueue-name} Operator is installed on your cluster. +* You have cluster administrator permissions or the `kueue-batch-admin-role` role. +* You have installed the {oc-first}.