generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 223
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
3 changed files
with
248 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,216 @@ | ||
# KEP-693: MultiKueue | ||
|
||
<!-- toc --> | ||
- [Summary](#summary) | ||
- [Motivation](#motivation) | ||
- [Goals](#goals) | ||
- [Non-Goals](#non-goals) | ||
- [Proposal](#proposal) | ||
- [User Stories (Optional)](#user-stories-optional) | ||
- [Story 1](#story-1) | ||
- [Story 2](#story-2) | ||
- [Risks and Mitigations](#risks-and-mitigations) | ||
- [Design Details](#design-details) | ||
- [Test Plan](#test-plan) | ||
- [Unit Tests](#unit-tests) | ||
- [Integration tests](#integration-tests) | ||
- [Graduation Criteria](#graduation-criteria) | ||
- [Implementation History](#implementation-history) | ||
- [Drawbacks](#drawbacks) | ||
- [Alternatives](#alternatives) | ||
<!-- /toc --> | ||
|
||
## Summary | ||
Introduce an new AdmissionCheck (called MultiKueu) with dedicated API | ||
and controller that will provide multi-cluster capabilities to Kueue. | ||
|
||
## Motivation | ||
Many of Kueue's users are running multiple clusters and would like to | ||
have a way to easily distribute batch jobs across them. | ||
|
||
### Goals | ||
* Allow Kueue distribute batch jobs across multiple clusters, | ||
while maintaining the specified quota limits. | ||
* Provide users with a single entry point though which the jobs | ||
can be submitted and monitored, just like they were running in | ||
a single cluster. | ||
* Be compatible with all Kueue's features (priorities, borrowing, preemptions, etc) | ||
and integrations. | ||
|
||
### Non-Goals | ||
* Solve storage problem. It is assumed that the distributed jobs are | ||
either location-flexible or are copying the data as a part of the | ||
startup process. | ||
* Synchronize configuration across the clusters. It is expected that the | ||
user will create the appropriate objects, roles and permissions | ||
in the clusters (manually, using gitops or some 3rd-party tooling). | ||
|
||
## Proposal | ||
|
||
Introduce MultiKueue AdmissionCheck, controller and configuration API. | ||
|
||
Establish the need for a designated management cluster. | ||
|
||
![Architecture](arch.png "Architecture") | ||
|
||
For each workload coming to a ClusterQueue (with the MultiKueue AdmissonCheck enabled) | ||
in the management cluster, and getting past the preadmission phase in the | ||
two-phase admission process (meaning that the global quota is ok), | ||
MultiKueue controller will create it in the defined worker clusters and wait | ||
until some Kueue running there admits the workload. | ||
Then it will remove the workloads from the remaining clusters and allow the | ||
single instance of the job to proceed. The workload will be also admitted in | ||
the management cluster. | ||
|
||
There will be no job controllers running in the management clusters (or they will be disabled). | ||
There will be just CRD/job definitions deployed. MultiKueue controller will copy the status | ||
of the job from the worker clusters, so that it will appear that the job | ||
is running inside the management clusters. However, as there is no job controller, | ||
no pods will be created in the management cluster. Noone will also overwrite | ||
the status that will be copied by the MultiKueue controller. | ||
|
||
If the job, for whatever reason, is suspended or deleted in the management cluster, | ||
it will be deleted from the worker clusters. Deletion/suspension of the job | ||
only in worker cluster will trigger the global job requeuing. | ||
Once the job finishes in the worker cluster, the job will also | ||
finish in the management cluster. | ||
|
||
### User Stories (Optional) | ||
|
||
#### Story 1 | ||
As a Kueue user I have clusters on different cloud providers and on-prem. | ||
I would like to run computation-heavy jobs across all of them, wherever | ||
I have free resources. | ||
|
||
#### Story 2 | ||
As a Kueue user I have clusters in multiple regions of the same cloud | ||
provider. I would like to run workloads that require the newest GPUs, | ||
whose on-demand availability is very volatile. The GPUs are available | ||
at random times at random regions. I want to use ProvisioningRequest | ||
to try to catch them. | ||
|
||
### Risks and Mitigations | ||
* Disabling the Job controller for all (or selected objects) may be problematic | ||
on environments where access to the master configuration is limited (like GKE). | ||
We are in talks with SIG-Apps to establish an acceptable way of using a non | ||
default controller (or none at all) | ||
|
||
## Design Details | ||
MultiKueue will be enabled on a cluster queue using the admission check fields. | ||
Just like ProvisioningRequest, MultiKueue will have its own configuration, | ||
MultiKueueConfig with the following definition. To allow reusig the same clusters | ||
across many Kueues, additional object, MultiKueueWorkerCluster, is added. | ||
|
||
```go | ||
type MultiKueueConfig struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
Spec MultiKueueConfigSpec `json:"spec,omitempty"` | ||
} | ||
|
||
type MultiKueueConfigSpec struct { | ||
// List of MultiKueueWorkerClusters names where the | ||
// workloads from the ClusterQueue should be distributed. | ||
Clusters []string `json:"clusters,omitempty"` | ||
} | ||
|
||
type MultiKueueCluster struct { | ||
metav1.TypeMeta `json:",inline"` | ||
metav1.ObjectMeta `json:"metadata,omitempty"` | ||
Spec MultiKueueClusterSpec `json:"spec,omitempty"` | ||
Status MultiKueueClusterStatus `json:"status,omitempty"` | ||
} | ||
|
||
const ( | ||
// Location is the path on the disk. | ||
PathLocationType LocationType = "Path" | ||
|
||
// Location is the name of the secret inside kueue-system namespace. | ||
SecretLocationType LocationType = "Secret" | ||
) | ||
|
||
type MultiKueueuClusterSpec { | ||
// Name of the custer inside the given KubeConfig. | ||
KubeconfigName string `json:"kubeconfigName"` | ||
|
||
// Location of the KubeConfig. | ||
KubeconfigLocation string `json:"kubeconfigLocation"` | ||
|
||
// Type of the KubeConfig location. | ||
KubeconfigLocationType LocationType `json:"kubeconfigLocationType"` | ||
} | ||
|
||
type MultiKueueClusterStatus { | ||
Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"` | ||
} | ||
``` | ||
|
||
MultiKueue controller will monitor all cluster definitions and maintain | ||
the Kube clients for all of them. Any connectivity problems will be reported both in | ||
MultiKueuCluster status as well as AdmissionCheckStatus and Events. MultiKueue controller | ||
will make sure that whenever the kubeconfig is refreshed, the appropriate | ||
clients will also be recreated. | ||
|
||
Creation of kubeconfig files is outside of the MultiKueue scope, and is cloud | ||
provider/environment dependant. | ||
|
||
MultiKueue controller, when pushing workloads to the worker clusters, will use the same | ||
namespace and local queue names as were used in the management cluster. It is user's | ||
responsibility to set up the appropriate namespaces and local queues. | ||
Worker ClusterQueue definitions may be different than in the management cluster. For example, | ||
quota settings may be specific to the given location. And/or cluster queue may have different | ||
admission checks, use ProvisioningRequest, etc. | ||
|
||
When distributing the workloads across clusters MultiKueue controller will first create | ||
the Kueue's internal Workload object. Only after the workload is admitted and other clusters | ||
are cleaned-up the real job will be created, to match the Workload. That gives the guarantee | ||
that the workload will not start in more than one cluster. | ||
|
||
When the job is running MultiKueue controller will copy its status from worker cluster | ||
to the management cluster, to keep the impression that the job is running in the management | ||
cluster. This is needed to allow pipelines and workflow engines to execute against | ||
the management cluster with MultiKueue without any extra changes. | ||
|
||
### Test Plan | ||
[x] I/we understand the owners of the involved components may require updates to | ||
existing tests to make this code solid enough prior to committing the changes necessary | ||
to implement this enhancement. | ||
|
||
#### Unit Tests | ||
The code will adhere to regular best practices for unit tests and coverage. | ||
|
||
#### Integration tests | ||
Integration tests will be executed against a mocked client that will provide predefined | ||
responses and allow to test various error scenarios, including situations like: | ||
* Job is created across multiple clusters and admitted in one. | ||
* Job is admitted at the same time by two clusters. | ||
* Job is rejected by a cluster. | ||
* Worker cluster doesn't have the corresponding namespace. | ||
* Worker cluster doesn't have the corresponding local/cluster queue. | ||
* Worker cluster is unresponsive. | ||
* Worker cluster deletes the job. | ||
* Job is correctly finished. | ||
* Job finishes with an error. | ||
* Job status changes frequently. | ||
|
||
### Graduation Criteria | ||
The feature starts at the alpha level, with a feature gate. | ||
Graduation to beta criterias: | ||
* Positive feedback from users. | ||
* Major bugs and deficiencies are not found/fixed. | ||
* Roadmap for missing features is defined. | ||
|
||
## Implementation History | ||
* 2023-11-28 Initial KEP. | ||
|
||
## Drawbacks | ||
MultiKueue has some drawbacks. | ||
* Doesn't solve storage problems. | ||
* Requires some manual works to sync configuration and authentication between clusters. | ||
* Requires management cluster. | ||
* Requires some external work to disable job controller(s) in management clusters. | ||
|
||
## Alternatives | ||
* Use Armada or Multi Cluster App Dispatcher. | ||
* Use multicluster-specific Job APIs. | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
title: MultiKueue | ||
kep-number: 693 | ||
authors: | ||
- "@mwielgus" | ||
status: draft | ||
creation-date: 2023-09-20 | ||
reviewers: | ||
- "@kerthcet" | ||
- "@alculquicondor" | ||
approvers: | ||
- "@alculquicondor" | ||
|
||
# The target maturity stage in the current dev cycle for this KEP. | ||
stage: alfa | ||
|
||
# The most recent milestone for which work toward delivery of this KEP has been | ||
# done. This can be the current (upcoming) milestone, if it is being actively | ||
# worked on. | ||
latest-milestone: "v0.6" | ||
|
||
# The milestone at which this feature was, or is targeted to be, at each stage. | ||
milestone: | ||
beta: "v0.7" | ||
|
||
# The following PRR answers are required at alpha release | ||
# List the feature gate name and the components for which it must be enabled | ||
disable-supported: true | ||
|
||
# The following PRR answers are required at beta release | ||
# metrics: | ||
# - my_feature_metric | ||
|