Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the coscheduling plugin of scheduler-plugins #538

Merged
merged 1 commit into from
Mar 29, 2023

Conversation

tenzen-y
Copy link
Member

I added the implementation for the coscheduling plugin of kubernetes-sigs/scheduler-plugins.

ref: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling

So I modified the following:

Follow ups:

  • Add E2E tests

Part-of: #500

@tenzen-y
Copy link
Member Author

tenzen-y commented Mar 20, 2023

I will add the integration tests to this PR.

PTAL
/assign @alculquicondor

k8s.io/klog v1.0.0
k8s.io/kube-openapi v0.0.0-20230109183929-3758b55a6596
k8s.io/utils v0.0.0-20220728103510-ee6ede2d64ed
sigs.k8s.io/controller-runtime v0.13.1
sigs.k8s.io/scheduler-plugins v0.25.7
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the newest version since this version introduces a new API group, scheduling.x-k8s.io.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no 0.26?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the newest version is 0.25.
Also, according to kubernetes-sigs/scheduler-plugins#503, the scheduler-plugins seems to be released every 3 months.
So 0.26 will be released in June.

Yes, we do. Usually, we cut official releases every ~3 months

@tenzen-y tenzen-y force-pushed the support-scheduler-plugins branch 2 times, most recently from fd79e7d to e5bd5e5 Compare March 21, 2023 13:46
@tenzen-y tenzen-y changed the title WIP: Support the coscheduler plugin of scheduler-plugins Support the coscheduler plugin of scheduler-plugins Mar 21, 2023
@tenzen-y tenzen-y changed the title Support the coscheduler plugin of scheduler-plugins Support the coscheduling plugin of scheduler-plugins Mar 21, 2023
@@ -59,8 +61,8 @@ vet:

.PHONY: test
test:
test: bin/envtest
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path)" go test -covermode atomic -coverprofile=profile.cov $(shell go list ./... | grep -v '/test/e2e')
test: bin/envtest scheduler-plugins-crd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this means that volcano doesn't have integration tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.

cmd/mpi-operator/app/options/options.go Outdated Show resolved Hide resolved
cmd/mpi-operator/app/server.go Outdated Show resolved Hide resolved
k8s.io/klog v1.0.0
k8s.io/kube-openapi v0.0.0-20230109183929-3758b55a6596
k8s.io/utils v0.0.0-20220728103510-ee6ede2d64ed
sigs.k8s.io/controller-runtime v0.13.1
sigs.k8s.io/scheduler-plugins v0.25.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no 0.26?

verbs:
- "*"
- apiGroups:
- scheduling.k8s.io
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both groups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean the below?

- apiGroups:
  - scheduling.x-k8s.io
  - scheduling.k8s.io
  resources:
  - podgroups
  - priorityclasses
  verbs:
    - "*"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I meant to ask if we really need to support both groups or just the newer one. I would vote for just the newer one.

Copy link
Member Author

@tenzen-y tenzen-y Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
I added the scheduling.k8s.io for priorityClass and the scheduling.x-k8s.io for podgroups. This means we support scheduling.x-k8s.io/v1alpha1 PodGroup and scheduling.k8s.io/v1 PriorityClass. So we don't support scheduling.k8s.io/v1alpha1 PodGroup.

Copy link
Member Author

@tenzen-y tenzen-y Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:

  • OLD: scheduling.k8s.io/v1alpha1 PodGroup
  • NEW: scheduling.x-k8s.io/v1alpha1 PodGroup

Copy link
Member Author

@tenzen-y tenzen-y Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the PriorityClass belongs to scheduling.k8s.io APIgroup, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, sorry, I didn't notice it was a different kind.

pkg/controller/mpi_job_controller.go Show resolved Hide resolved
pkg/controller/mpi_job_controller_test.go Outdated Show resolved Hide resolved
continue
}
for i := int32(0); i < *rp.Replicas; i++ {
for _, c := range rp.Template.Spec.Containers {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about initContainers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, It is enough to count only containers since Pods don't launch both initContainers and containers simultaneously.

Do you have any concerns?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource calculation in scheduler is: max(initContainers, sum(containers)) + pod overhead.

Also, this is ignoring any limit ranges or defaulting.

But I guess it's ok if this is just an estimate. I would guess that the scheduler ultimately considers the final pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess it's ok if this is just an estimate. I would guess that the scheduler ultimately considers the final pods.

I think so too. It's fine with only containers since the conscheduling plugin isn't quota management software :)

@tenzen-y tenzen-y force-pushed the support-scheduler-plugins branch 5 times, most recently from 0efb6d7 to 9450cdc Compare March 22, 2023 12:59
@alculquicondor
Copy link
Collaborator

I wonder if @denkensk could give a sanity check as one of the authors of the coscheduling plugin.

@tenzen-y
Copy link
Member Author

I wonder if @denkensk could give a sanity check as one of the authors of the coscheduling plugin.

@alculquicondor Thank you for the review!
I'd be happy if this PR is reviewed by @denkensk.

if rp.Replicas == nil {
continue
}
for i := int32(0); i < *rp.Replicas; i++ {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhm... can you replace this with a multiplication?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean for i := 0; i < (num(containers) × replicas); i++?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, avoid iterating over and over the pod containers. You should be able to multiply the resources for 1 pod by the number of replicas

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense. I will try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@alculquicondor
Copy link
Collaborator

/lgtm
/hold
for @denkensk

@@ -374,10 +374,11 @@ func addResources(minResources corev1.ResourceList, resources corev1.ResourceReq
}
}
for name, quantity := range merged {
addition := resource.MustParse(strconv.Itoa(int(quantity.Value()) * replicas))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't realize that Quantity doesn't have a multiplication operation.

This is an odd way of calculating it though. Have you verified if this works for something like 1m?

If not, maybe we can revert and follow up with a PR upstream to add multiplication.

Copy link
Collaborator

@alculquicondor alculquicondor Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, in kueue we convert everything to int64 before multiplying, using MilliValue for CPU.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified if this works for something like 1m?

I verified that situation right now. This way doesn't work well :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, in kueue we convert everything to int64 before multiplying, using MilliValue for CPU.

I verified the resource.MustParse(strconv.Itoa(int(quantity.MilliValue()) * replicas)) works well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, as you say, this way looks odd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure it gives you the right value?

also, it might not work well with memory.

If we use MilliValue for CPU and Memory, it seems to work well. However, for GPU, it doesn't work well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I’m not sure the above conversion works well in all situations.

So, It might be better to revert this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

we should follow up with a multiplication method upstream. It will become increasingly useful for batch operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create an issue in k/k tomorrow or next week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted.

@alculquicondor
Copy link
Collaborator

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Mar 23, 2023
Copy link
Member

@denkensk denkensk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tenzen-y Some small comments.

mpiJobLister listers.MPIJobLister
mpiJobSynced cache.InformerSynced
// podGroupCtrl is a client for PodGroups (volcano and scheduler-plugins).
podGroupCtrl PodGroupControl
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. A little confused. Why do we call it xxControl if it's a client?

Copy link
Member Author

@tenzen-y tenzen-y Mar 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the special reasons for calling it "xxControl", and I was inspired by https://github.com/kubeflow/common/tree/83259a047fc47895dab1012728fb829523363a6d/pkg/controller.v1/control. This means we can even name it "xxClient".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var podGroupCtrl controllersv1.PodGroupControl
if opt.GangSchedulingName == options.GangSchedulerVolcano {
podGroupCtrl = controllersv1.NewVolcanoCtrl(volcanoClientSet, namespace)
} else if len(opt.GangSchedulingName) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about if I set it as "xxx", will it still choose as scheduler-plugins? will we check if the name is invalid like not volcano and scheduler-plugins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you set some scheduler name excluding volcano to the gang-scheduling flag, the mpi-operator uses scheduler-plugins as we decided to block collaborating with some custom schedulers in kubeflow/common#209 (comment).

return nil, nil, nil, nil, nil, err
}

schedClientSet, err := schedclientset.NewForConfig(restclientset.AddUserAgent(config, "scheduler-plugins"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need init clientset both for scheduler-plugins and volcano?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we don't. We may be able to create either clientset in L145 - L 152 of this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it might better pass the gangScheulingName to createClientSets func.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if f.gangSchedulingName == options.GangSchedulerVolcano {
podGroupCtrl = NewVolcanoCtrl(f.volcanoClient, metav1.NamespaceAll)
} else if len(f.gangSchedulingName) != 0 {
podGroupCtrl = NewSchedulerPluginsCtrl(f.schedClient, metav1.NamespaceAll, "default-scheduler")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it “default-scheduler”?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if we set up the scheduler-plugins as single scheduler mode, we can use the scheduler-plugins as a part of the default scheduler.

https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/doc/install.md#install-release-v0257-and-use-coscheduling

@google-oss-prow google-oss-prow bot removed the lgtm label Mar 24, 2023
@alculquicondor
Copy link
Collaborator

can you squash?

/approve

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Member Author

@alculquicondor Squashed.

@alculquicondor
Copy link
Collaborator

/approve

leaving lgtm to @denkensk

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@denkensk
Copy link
Member

/lgtm

Thanks for your contribution. @tenzen-y

@denkensk
Copy link
Member

/unhold

@google-oss-prow google-oss-prow bot merged commit 2495860 into kubeflow:master Mar 29, 2023
@tenzen-y tenzen-y deleted the support-scheduler-plugins branch March 29, 2023 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants