-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support the coscheduling plugin of scheduler-plugins #538
Support the coscheduling plugin of scheduler-plugins #538
Conversation
I will add the integration tests to this PR. PTAL |
k8s.io/klog v1.0.0 | ||
k8s.io/kube-openapi v0.0.0-20230109183929-3758b55a6596 | ||
k8s.io/utils v0.0.0-20220728103510-ee6ede2d64ed | ||
sigs.k8s.io/controller-runtime v0.13.1 | ||
sigs.k8s.io/scheduler-plugins v0.25.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the newest version since this version introduces a new API group, scheduling.x-k8s.io
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no 0.26?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the newest version is 0.25.
Also, according to kubernetes-sigs/scheduler-plugins#503, the scheduler-plugins seems to be released every 3 months.
So 0.26 will be released in June.
Yes, we do. Usually, we cut official releases every ~3 months
fd79e7d
to
e5bd5e5
Compare
@@ -59,8 +61,8 @@ vet: | |||
|
|||
.PHONY: test | |||
test: | |||
test: bin/envtest | |||
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path)" go test -covermode atomic -coverprofile=profile.cov $(shell go list ./... | grep -v '/test/e2e') | |||
test: bin/envtest scheduler-plugins-crd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this means that volcano doesn't have integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right.
k8s.io/klog v1.0.0 | ||
k8s.io/kube-openapi v0.0.0-20230109183929-3758b55a6596 | ||
k8s.io/utils v0.0.0-20220728103510-ee6ede2d64ed | ||
sigs.k8s.io/controller-runtime v0.13.1 | ||
sigs.k8s.io/scheduler-plugins v0.25.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no 0.26?
verbs: | ||
- "*" | ||
- apiGroups: | ||
- scheduling.k8s.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean the below?
- apiGroups:
- scheduling.x-k8s.io
- scheduling.k8s.io
resources:
- podgroups
- priorityclasses
verbs:
- "*"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I meant to ask if we really need to support both groups or just the newer one. I would vote for just the newer one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
I added the scheduling.k8s.io
for priorityClass and the scheduling.x-k8s.io
for podgroups. This means we support scheduling.x-k8s.io/v1alpha1 PodGroup
and scheduling.k8s.io/v1 PriorityClass
. So we don't support scheduling.k8s.io/v1alpha1 PodGroup
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note:
- OLD:
scheduling.k8s.io/v1alpha1
PodGroup - NEW:
scheduling.x-k8s.io/v1alpha1
PodGroup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the PriorityClass
belongs to scheduling.k8s.io
APIgroup, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, sorry, I didn't notice it was a different kind.
continue | ||
} | ||
for i := int32(0); i < *rp.Replicas; i++ { | ||
for _, c := range rp.Template.Spec.Containers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about initContainers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, It is enough to count only containers since Pods don't launch both initContainers and containers simultaneously.
Do you have any concerns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resource calculation in scheduler is: max(initContainers, sum(containers)) + pod overhead
.
Also, this is ignoring any limit ranges or defaulting.
But I guess it's ok if this is just an estimate. I would guess that the scheduler ultimately considers the final pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess it's ok if this is just an estimate. I would guess that the scheduler ultimately considers the final pods.
I think so too. It's fine with only containers since the conscheduling plugin isn't quota management software :)
0efb6d7
to
9450cdc
Compare
I wonder if @denkensk could give a sanity check as one of the authors of the coscheduling plugin. |
@alculquicondor Thank you for the review! |
if rp.Replicas == nil { | ||
continue | ||
} | ||
for i := int32(0); i < *rp.Replicas; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uhm... can you replace this with a multiplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean for i := 0; i < (num(containers) × replicas); i++
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, avoid iterating over and over the pod containers. You should be able to multiply the resources for 1 pod by the number of replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Makes sense. I will try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/lgtm |
5945ac5
to
1fb5394
Compare
pkg/controller/podgroup.go
Outdated
@@ -374,10 +374,11 @@ func addResources(minResources corev1.ResourceList, resources corev1.ResourceReq | |||
} | |||
} | |||
for name, quantity := range merged { | |||
addition := resource.MustParse(strconv.Itoa(int(quantity.Value()) * replicas)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't realize that Quantity doesn't have a multiplication operation.
This is an odd way of calculating it though. Have you verified if this works for something like 1m
?
If not, maybe we can revert and follow up with a PR upstream to add multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, in kueue we convert everything to int64 before multiplying, using MilliValue
for CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you verified if this works for something like 1m?
I verified that situation right now. This way doesn't work well :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, in kueue we convert everything to int64 before multiplying, using MilliValue for CPU.
I verified the resource.MustParse(strconv.Itoa(int(quantity.MilliValue()) * replicas))
works well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, as you say, this way looks odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure it gives you the right value?
also, it might not work well with memory.
If we use MilliValue
for CPU and Memory, it seems to work well. However, for GPU, it doesn't work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I’m not sure the above conversion works well in all situations.
So, It might be better to revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
we should follow up with a multiplication method upstream. It will become increasingly useful for batch operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will create an issue in k/k tomorrow or next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted.
1fb5394
to
9450cdc
Compare
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tenzen-y Some small comments.
mpiJobLister listers.MPIJobLister | ||
mpiJobSynced cache.InformerSynced | ||
// podGroupCtrl is a client for PodGroups (volcano and scheduler-plugins). | ||
podGroupCtrl PodGroupControl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. A little confused. Why do we call it xxControl if it's a client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have the special reasons for calling it "xxControl", and I was inspired by https://github.com/kubeflow/common/tree/83259a047fc47895dab1012728fb829523363a6d/pkg/controller.v1/control. This means we can even name it "xxClient".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was probably inspired on podControl in KCM:
https://github.com/kubernetes/kubernetes/blob/c3e7eca7fd38454200819b60e58144d5727f1bbc/pkg/controller/job/job_controller.go#L91
var podGroupCtrl controllersv1.PodGroupControl | ||
if opt.GangSchedulingName == options.GangSchedulerVolcano { | ||
podGroupCtrl = controllersv1.NewVolcanoCtrl(volcanoClientSet, namespace) | ||
} else if len(opt.GangSchedulingName) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about if I set it as "xxx", will it still choose as scheduler-plugins? will we check if the name is invalid like not volcano and scheduler-plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you set some scheduler name excluding volcano
to the gang-scheduling flag, the mpi-operator uses scheduler-plugins as we decided to block collaborating with some custom schedulers in kubeflow/common#209 (comment).
cmd/mpi-operator/app/server.go
Outdated
return nil, nil, nil, nil, nil, err | ||
} | ||
|
||
schedClientSet, err := schedclientset.NewForConfig(restclientset.AddUserAgent(config, "scheduler-plugins")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need init clientset both for scheduler-plugins and volcano?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't. We may be able to create either clientset in L145 - L 152 of this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it might better pass the gangScheulingName to createClientSets func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if f.gangSchedulingName == options.GangSchedulerVolcano { | ||
podGroupCtrl = NewVolcanoCtrl(f.volcanoClient, metav1.NamespaceAll) | ||
} else if len(f.gangSchedulingName) != 0 { | ||
podGroupCtrl = NewSchedulerPluginsCtrl(f.schedClient, metav1.NamespaceAll, "default-scheduler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it “default-scheduler”?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, if we set up the scheduler-plugins as single scheduler mode, we can use the scheduler-plugins as a part of the default scheduler.
can you squash? /approve |
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
b41aeea
to
58f0dfd
Compare
@alculquicondor Squashed. |
/approve leaving lgtm to @denkensk |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm Thanks for your contribution. @tenzen-y |
/unhold |
I added the implementation for the coscheduling plugin of kubernetes-sigs/scheduler-plugins.
ref: https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling
So I modified the following:
Follow ups:
Part-of: #500