-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-693: MultiKueue #1380
KEP-693: MultiKueue #1380
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
cc: @trasc |
cc @dejanzele |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do another pass after I'm done with the deleate -> multikue rename.
keps/693-multikueue/README.md
Outdated
Then it will remove the workloads from the remaining clusters and allow the | ||
single instance of the job to proceed. The workload will be also admitted in | ||
the management cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it will remove the workloads from the remaining clusters and allow the | |
single instance of the job to proceed. The workload will be also admitted in | |
the management cluster. | |
Then it will remove the workloads from the remaining remote clusters and allow the | |
single instance of the job to proceed. The local workload will get the admission check set to retry in order to free the local quota. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why to set the setting to retry? There is no local quota in the managment cluster.
How will you handle if a worker cluster fails for some reason (network issues, cluster goes down...)? Will you have some sort of job leases and job periodically reporting back they are still executing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks |
LGTM label has been added. Git tree hash: 62b43f0304ec7815182c32fcebbfecfd87bdb00e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwielgus Can a management cluster serve concurrently in the role of a worker cluster?
We can imagine the following situation:
cluster A: Management And Worker cluster
cluster B: Worker cluster
When the job is running MultiKueue controller will copy its status from worker cluster | ||
to the management cluster, to keep the impression that the job is running in the management | ||
cluster. This is needed to allow pipelines and workflow engines to execute against | ||
the management cluster with MultiKueue without any extra changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the kueue manager loses connectivity to the worker cluster after some workloads are admitted?
Especially preemption and waitForPodsReady, what happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume the total loss of the cluster and all admitted workloads are suspended/requeued. Once the cluster is reconnected, we remove duplicated admitted workloads just as if two of them were admitted at the same time.
Added to the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If preemption targets exist in the connection loss cluster, what happens?
Kueue scheduler will try to preempt the targets forever, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If preemption targets exist in the connection loss cluster, what happens?
Kueue scheduler will try to preempt the targets forever, right?
I read an updated doc and then I understand what happens in the above situation.
@tenzen-y Yes, such configuration will be possible in the future, once we establish kubernetes/enhancements#4370 as a universal standard for selectively disabling controllers for other API/CRD objects. Right now the only option for CRDs in the management cluster is to install API definitions but without controllers, that prevents allowing two roles inside one cluster. |
@mwielgus It makes sense. Can we mention that in |
@tenzen-y Added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwielgus Thanks! I'm looking forward the MultiKueue 🎉
/lgtm
/approve
LGTM label has been added. Git tree hash: 828418fd930219441c5d1295b67184cce9512373
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mwielgus, tenzen-y The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Introduce MultiCluster support in Kueue.
Which issue(s) this PR fixes:
Fixes #693
Special notes for your reviewer:
Does this PR introduce a user-facing change?