-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Respect SchedulingPolicy #520
Respect SchedulingPolicy #520
Conversation
c8045ae
to
2bc2b65
Compare
Can you summarize the changes for somebody that isn't familiar with volcano? 😅 |
Sure. I will send a ping to you once this PR description is ready. |
if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok { | ||
return fmt.Errorf("failed to wait for podgroup caches to sync") | ||
} | ||
synced = append(synced, c.podgroupsSynced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also don't need the priorityClass data unless there is gangSchedulerName enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right.
It would be good to sync priorityclass only when the gangSchedulerName is set, same as priorityclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the above logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/controller/podgroup.go
Outdated
} | ||
} | ||
if minResources == nil { | ||
minResources = c.calcPGMinResources(minMember, mpiJob.Spec.MPIReplicaSpecs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? We were not calculating this before, so it wasn't working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Previously, we could not use minResources
since we don't calculate and set that:
mpi-operator/pkg/controller/mpi_job_controller.go
Lines 1299 to 1324 in c21942d
// newPodGroup creates a new PodGroup for an MPIJob | |
// resource. It also sets the appropriate OwnerReferences on the resource so | |
// handleObject can discover the MPIJob resource that 'owns' it. | |
func newPodGroup(mpiJob *kubeflow.MPIJob, minAvailableReplicas int32) *podgroupv1beta1.PodGroup { | |
var pName string | |
if l := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher]; l != nil { | |
pName = l.Template.Spec.PriorityClassName | |
if w := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker]; pName == "" && w != nil { | |
pName = w.Template.Spec.PriorityClassName | |
} | |
} | |
return &podgroupv1beta1.PodGroup{ | |
ObjectMeta: metav1.ObjectMeta{ | |
Name: mpiJob.Name, | |
Namespace: mpiJob.Namespace, | |
OwnerReferences: []metav1.OwnerReference{ | |
*metav1.NewControllerRef(mpiJob, kubeflow.SchemeGroupVersionKind), | |
}, | |
}, | |
Spec: podgroupv1beta1.PodGroupSpec{ | |
MinMember: minAvailableReplicas, | |
Queue: mpiJob.Annotations[podgroupv1beta1.QueueNameAnnotationKey], | |
PriorityClassName: pName, | |
}, | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In elastic mode, if we configurable the minResources
, scheduling part of Workers to Nodes is useful when the number of Workers is huge.
var minResources *corev1.ResourceList | ||
if schedulingPolicy := mpiJob.Spec.RunPolicy.SchedulingPolicy; schedulingPolicy != nil { | ||
if schedulingPolicy.MinAvailable != nil { | ||
minMember = *schedulingPolicy.MinAvailable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)
?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).
We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).
In that case, volcano schedules all Pods. If the cluster does not have enough resources for all Pods, some pods will be marked as Pending
.
We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.
I agree. I think we have 2 options. Since we don't have fields if the MPIJob uses elastic horovod.
- As you say, have clear documentation.
- Add a new member, 'Elastic bool
json:elastic,omitempty
' intoRunpolicy
to represent if MPIJob uses elastic mode. And then, we add validations forminMember
andminResources
.
I like the second option. What do you think? @alculquicondor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there similar fields in the training operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, most customjob-controllers have no fields on whether CustomJob uses elastic mode. This means users can set any value to schedulingPolicy.MinMember
. The pytorchjob-controller has ElasticPolicy
, although that field doesn't affect schedulingPolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, I think we should proceed with option 1 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, there is currently no documentation about SchedulingPolicy overall :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case, I think we should proceed with option 1 for now.
I agree. Maybe, we can add documentation to https://www.kubeflow.org/docs/components/training/job-scheduling/.
That said, there is currently no documentation about SchedulingPolicy overall
You're right...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thought: even if we add an "elastic" field, it doesn't mean that the application will support it. So we need to clarify it in the documentation anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need to clarify it in the documentation anyways.
I agree.
Probably, we can discuss if we add an elastic
field in other places (issue or pr).
In this PR, we can just pass SchedulingPolicy.MinAvailable
to PodGroup.Spec.MinMember
and then add docs about that.
pkg/controller/podgroup.go
Outdated
return "" | ||
} | ||
|
||
func (c *MPIJobController) calcPGMinResources( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment for what's happening here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting.
What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?
I tried to dive into the volcano. The volcano has the Queue resource like ClusterQueue + LocalQueue in Kueue.
If Queue doesn't have enough resources requested by minResources
in PodGroup, volcano-scheduler doesn't schedule all Pods to Nodes.
Does that make sense?
My understanding might be missing. I will try to investigate the volcano more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.... it looks like a quota system just like kueue. I wonder if they would be interested in reviewing this PR. Do you know anybody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, @shinytang6 is also interested in this PR.
cc: @shinytang6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor They don't seem to have enough bandwidth to review this PR.
So, I'm thinking of removing func calcPGMinResources
and then passing SchedulingPolicy.MinResources
to PodGroup.Spec.MinResources
.
Also, it would be good to create an issue to keep tracking this.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, we don't want to introduce functionality we aren't sure about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to introduce functionality we aren't sure about.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@alculquicondor Updated PR description. |
if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok { | ||
return fmt.Errorf("failed to wait for podgroup caches to sync") | ||
} | ||
synced = append(synced, c.podgroupsSynced) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still not changed
|
||
// newPodGroup creates a new PodGroup for an MPIJob | ||
// resource. It also sets the appropriate OwnerReferences on the resource so | ||
// handleObject can discover the MPIJob resource that 'owns' it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain how minResources is calculated.
What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain how minResources is calculated.
Sure.
What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?
I will answer in another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in the comment the fields that this function populates and how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/controller/podgroup.go
Outdated
return "" | ||
} | ||
|
||
func (c *MPIJobController) calcPGMinResources( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting.
What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?
792c859
to
37e464d
Compare
@alculquicondor I have addressed your comments. Please take another look. |
299b426
to
0eebc9a
Compare
cmd/mpi-operator/app/server.go
Outdated
@@ -46,6 +46,7 @@ import ( | |||
|
|||
"github.com/kubeflow/mpi-operator/cmd/mpi-operator/app/options" | |||
mpijobclientset "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned" | |||
kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme" | |
kubeflowscheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` | ||
// MinAvailable defines the minimal number of member to run the PodGroup. | ||
// If the gang-scheduling is set to the volcano, | ||
// input is passed to `.spec.mimMember` in PodGroup for the volcano. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that this will only work if the MPI application supports resizing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not set, it defaults to the number of workers
pkg/controller/mpi_job_controller.go
Outdated
AddFunc: controller.handleObject, | ||
UpdateFunc: controller.handleObjectUpdate, | ||
DeleteFunc: controller.handleObject, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add an event handler for priorityClass? They are not owned by the MPIJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
// newPodGroup creates a new PodGroup for an MPIJob | ||
// resource. It also sets the appropriate OwnerReferences on the resource so | ||
// handleObject can discover the MPIJob resource that 'owns' it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add in the comment the fields that this function populates and how
pkg/controller/podgroup.go
Outdated
schedulingPolicy *kubeflow.SchedulingPolicy, | ||
) string { | ||
if schedulingPolicy != nil && len(schedulingPolicy.PriorityClass) != 0 && | ||
c.priorityClassExist(schedulingPolicy.PriorityClass) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should MPIJob care if it exists? I think it's up to Volcano to decide what to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave it to the volcano.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's cleanup the informer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
35b124c
to
55a104c
Compare
@alculquicondor Updated. PTAL. |
pkg/apis/kubeflow/v2beta1/types.go
Outdated
// MinAvailable defines the minimal number of member to run the PodGroup. | ||
// If the gang-scheduling is set to the volcano, | ||
// input is passed to `.spec.mimMember` in PodGroup for the volcano. | ||
// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod). | |
// When using this field, you need to make sure the application supports resizing (e.g., Elastic Horovod). |
ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"` | ||
// MinAvailable defines the minimal number of member to run the PodGroup. | ||
// If the gang-scheduling is set to the volcano, | ||
// input is passed to `.spec.mimMember` in PodGroup for the volcano. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not set, it defaults to the number of workers
500beb3
to
c292f7e
Compare
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
c292f7e
to
68b2861
Compare
@alculquicondor I addressed your suggestions and squashed commits into one. |
Also, I will add docs for the schedulingPolicy to https://www.kubeflow.org/. |
/lgtm |
@alculquicondor Created kubeflow/website#3453. |
@alculquicondor Do we have any blocking for merging? |
oops, no, I just forgot to approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com
I changed the logic for the gang-scheduling so that the mpi-operator respects SchedulingPolicy when creating PodGroup.
Mainly, I modified the following:
3. Set "PodGroupSpec.MinResources".a. iff "SchedulingPolicy.MinAvailable" isn't empty, propagate that to PodGroup.
b. In the case of
PodGroupSpec.MinMember < MPIJobSpec.MPIReplicaSpecs[Worker].Replicas + 1
, sort in descending order "MPIJobSpec.MPIReplicaSpecs" according to priorityClass, and then add container resources to "PodGroupSpec.MinResources". However, the total value of "MPIJobSpec.MPIReplicaSpec.Replicas" to be added must not exceed "PodGroupSpec.MinMember".Fixes: #518
/assign @alculquicondor