Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-693: MultiKueue #1380

Merged
merged 1 commit into from
Dec 28, 2023
Merged

KEP-693: MultiKueue #1380

merged 1 commit into from
Dec 28, 2023

Conversation

mwielgus
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Introduce MultiCluster support in Kueue.

Which issue(s) this PR fixes:

Fixes #693

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 29, 2023
Copy link

netlify bot commented Nov 29, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit e8c208c
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/658dedb1c39dd70008011116

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 29, 2023
@mwielgus
Copy link
Contributor Author

cc: @trasc

keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/kep.yaml Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
@alculquicondor
Copy link
Contributor

cc @nstogner @ahaysx

@alculquicondor
Copy link
Contributor

cc @dejanzele

keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do another pass after I'm done with the deleate -> multikue rename.

keps/693-multikueue/README.md Outdated Show resolved Hide resolved
keps/693-multikueue/README.md Outdated Show resolved Hide resolved
Comment on lines 67 to 81
Then it will remove the workloads from the remaining clusters and allow the
single instance of the job to proceed. The workload will be also admitted in
the management cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Then it will remove the workloads from the remaining clusters and allow the
single instance of the job to proceed. The workload will be also admitted in
the management cluster.
Then it will remove the workloads from the remaining remote clusters and allow the
single instance of the job to proceed. The local workload will get the admission check set to retry in order to free the local quota.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why to set the setting to retry? There is no local quota in the managment cluster.

keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
keps/693-multikueue/README.md Show resolved Hide resolved
@dejanzele
Copy link

How will you handle if a worker cluster fails for some reason (network issues, cluster goes down...)? Will you have some sort of job leases and job periodically reporting back they are still executing?

Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Just incorporate some thoughts based on the last comments from @mimowo.

@tenzen-y anything to add?

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2023
@tenzen-y
Copy link
Member

tenzen-y commented Dec 28, 2023

/approve

Just incorporate some thoughts based on the last comments from @mimowo.

@tenzen-y anything to add?

I can review this KEP within this week.

@alculquicondor
Copy link
Contributor

Thanks
/lgtm
/hold for @tenzen-y

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 62b43f0304ec7815182c32fcebbfecfd87bdb00e

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus Can a management cluster serve concurrently in the role of a worker cluster?
We can imagine the following situation:

cluster A: Management And Worker cluster
cluster B: Worker cluster

keps/693-multikueue/README.md Show resolved Hide resolved
Comment on lines +205 to +211
When the job is running MultiKueue controller will copy its status from worker cluster
to the management cluster, to keep the impression that the job is running in the management
cluster. This is needed to allow pipelines and workflow engines to execute against
the management cluster with MultiKueue without any extra changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when the kueue manager loses connectivity to the worker cluster after some workloads are admitted?

Especially preemption and waitForPodsReady, what happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume the total loss of the cluster and all admitted workloads are suspended/requeued. Once the cluster is reconnected, we remove duplicated admitted workloads just as if two of them were admitted at the same time.
Added to the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preemption targets exist in the connection loss cluster, what happens?
Kueue scheduler will try to preempt the targets forever, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If preemption targets exist in the connection loss cluster, what happens?
Kueue scheduler will try to preempt the targets forever, right?

I read an updated doc and then I understand what happens in the above situation.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2023
@tenzen-y
Copy link
Member

@mwielgus Can a management cluster serve concurrently in the role of a worker cluster? We can imagine the following situation:

cluster A: Management And Worker cluster cluster B: Worker cluster

@mwielgus I'm waiting only for this.
If I miss any mentioning, please let me know.

@mwielgus
Copy link
Contributor Author

mwielgus commented Dec 28, 2023

@tenzen-y Yes, such configuration will be possible in the future, once we establish kubernetes/enhancements#4370 as a universal standard for selectively disabling controllers for other API/CRD objects. Right now the only option for CRDs in the management cluster is to install API definitions but without controllers, that prevents allowing two roles inside one cluster.

@tenzen-y
Copy link
Member

@tenzen-y Yes, such configuration will be possible in the future, once we establish kubernetes/enhancements#4370 as a universal standard for disabling controllers for other APIs/CRDs. Right now the only option for CRDs in the management cluster is to install API definitions but without controllers, that prevents allowing two roles inside one cluster.

@mwielgus It makes sense. Can we mention that in NonGoals? I think that we can say that multiple roles inside one cluster without kubernetes/enhancements#4370 can not be supported.

@mwielgus
Copy link
Contributor Author

@tenzen-y Added.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus Thanks! I'm looking forward the MultiKueue 🎉
/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 28, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 828418fd930219441c5d1295b67184cce9512373

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, mwielgus, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [alculquicondor,tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 28, 2023
@k8s-ci-robot k8s-ci-robot merged commit 51a1773 into kubernetes-sigs:main Dec 28, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Dec 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MVP Multi-cluster support