Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Respect SchedulingPolicy #520

Merged
merged 1 commit into from
Feb 28, 2023

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Feb 8, 2023

Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com

I changed the logic for the gang-scheduling so that the mpi-operator respects SchedulingPolicy when creating PodGroup.
Mainly, I modified the following:

  1. When setting a priorityClass to PodGroup, "SchedulingPolicy.PriorityClass" is preferred over "ReplicaSpecs.PodTemplateSpec".
  2. When setting a queueName to PodGroup, "SchedulingPolicy.Queue" is preferred over annotation "scheduling.volcano.sh/queue-name".

3. Set "PodGroupSpec.MinResources".
a. iff "SchedulingPolicy.MinAvailable" isn't empty, propagate that to PodGroup.
b. In the case of PodGroupSpec.MinMember < MPIJobSpec.MPIReplicaSpecs[Worker].Replicas + 1, sort in descending order "MPIJobSpec.MPIReplicaSpecs" according to priorityClass, and then add container resources to "PodGroupSpec.MinResources". However, the total value of "MPIJobSpec.MPIReplicaSpec.Replicas" to be added must not exceed "PodGroupSpec.MinMember".

Fixes: #518

/assign @alculquicondor

@tenzen-y tenzen-y force-pushed the respect-scheduling-policy branch 4 times, most recently from c8045ae to 2bc2b65 Compare February 9, 2023 06:57
@alculquicondor
Copy link
Collaborator

Can you summarize the changes for somebody that isn't familiar with volcano? 😅

@tenzen-y
Copy link
Member Author

tenzen-y commented Feb 9, 2023

Can you summarize the changes for somebody that isn't familiar with volcano? 😅

Sure. I will send a ping to you once this PR description is ready.

if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok {
return fmt.Errorf("failed to wait for podgroup caches to sync")
}
synced = append(synced, c.podgroupsSynced)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also don't need the priorityClass data unless there is gangSchedulerName enabled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right.
It would be good to sync priorityclass only when the gangSchedulerName is set, same as priorityclass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the above logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}
if minResources == nil {
minResources = c.calcPGMinResources(minMember, mpiJob.Spec.MPIReplicaSpecs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? We were not calculating this before, so it wasn't working?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Previously, we could not use minResources since we don't calculate and set that:

// newPodGroup creates a new PodGroup for an MPIJob
// resource. It also sets the appropriate OwnerReferences on the resource so
// handleObject can discover the MPIJob resource that 'owns' it.
func newPodGroup(mpiJob *kubeflow.MPIJob, minAvailableReplicas int32) *podgroupv1beta1.PodGroup {
var pName string
if l := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher]; l != nil {
pName = l.Template.Spec.PriorityClassName
if w := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker]; pName == "" && w != nil {
pName = w.Template.Spec.PriorityClassName
}
}
return &podgroupv1beta1.PodGroup{
ObjectMeta: metav1.ObjectMeta{
Name: mpiJob.Name,
Namespace: mpiJob.Namespace,
OwnerReferences: []metav1.OwnerReference{
*metav1.NewControllerRef(mpiJob, kubeflow.SchemeGroupVersionKind),
},
},
Spec: podgroupv1beta1.PodGroupSpec{
MinMember: minAvailableReplicas,
Queue: mpiJob.Annotations[podgroupv1beta1.QueueNameAnnotationKey],
PriorityClassName: pName,
},
}
}

Copy link
Member Author

@tenzen-y tenzen-y Feb 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In elastic mode, if we configurable the minResources, scheduling part of Workers to Nodes is useful when the number of Workers is huge.

var minResources *corev1.ResourceList
if schedulingPolicy := mpiJob.Spec.RunPolicy.SchedulingPolicy; schedulingPolicy != nil {
if schedulingPolicy.MinAvailable != nil {
minMember = *schedulingPolicy.MinAvailable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).

We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).

In that case, volcano schedules all Pods. If the cluster does not have enough resources for all Pods, some pods will be marked as Pending.

We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.

I agree. I think we have 2 options. Since we don't have fields if the MPIJob uses elastic horovod.

  1. As you say, have clear documentation.
  2. Add a new member, 'Elastic bool json:elastic,omitempty' into Runpolicy to represent if MPIJob uses elastic mode. And then, we add validations for minMember and minResources.

I like the second option. What do you think? @alculquicondor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there similar fields in the training operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, most customjob-controllers have no fields on whether CustomJob uses elastic mode. This means users can set any value to schedulingPolicy.MinMember. The pytorchjob-controller has ElasticPolicy, although that field doesn't affect schedulingPolicy.

https://github.com/kubeflow/training-operator/blob/6ce98387a1a68b79365495a6344f7f19b48557d3/pkg/apis/kubeflow.org/v1/pytorch_types.go#L79

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, I think we should proceed with option 1 for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, there is currently no documentation about SchedulingPolicy overall :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, I think we should proceed with option 1 for now.

I agree. Maybe, we can add documentation to https://www.kubeflow.org/docs/components/training/job-scheduling/.

That said, there is currently no documentation about SchedulingPolicy overall

You're right...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another thought: even if we add an "elastic" field, it doesn't mean that the application will support it. So we need to clarify it in the documentation anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to clarify it in the documentation anyways.

I agree.

Probably, we can discuss if we add an elastic field in other places (issue or pr).
In this PR, we can just pass SchedulingPolicy.MinAvailable to PodGroup.Spec.MinMember and then add docs about that.

return ""
}

func (c *MPIJobController) calcPGMinResources(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment for what's happening here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

Copy link
Member Author

@tenzen-y tenzen-y Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

I tried to dive into the volcano. The volcano has the Queue resource like ClusterQueue + LocalQueue in Kueue.
If Queue doesn't have enough resources requested by minResources in PodGroup, volcano-scheduler doesn't schedule all Pods to Nodes.

Does that make sense?

My understanding might be missing. I will try to investigate the volcano more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.... it looks like a quota system just like kueue. I wonder if they would be interested in reviewing this PR. Do you know anybody?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, @shinytang6 is also interested in this PR.
cc: @shinytang6

Copy link
Member Author

@tenzen-y tenzen-y Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor They don't seem to have enough bandwidth to review this PR.
So, I'm thinking of removing func calcPGMinResources and then passing SchedulingPolicy.MinResources to PodGroup.Spec.MinResources.

Also, it would be good to create an issue to keep tracking this.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM, we don't want to introduce functionality we aren't sure about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to introduce functionality we aren't sure about.

I agree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y
Copy link
Member Author

@alculquicondor Updated PR description.

if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok {
return fmt.Errorf("failed to wait for podgroup caches to sync")
}
synced = append(synced, c.podgroupsSynced)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not changed


// newPodGroup creates a new PodGroup for an MPIJob
// resource. It also sets the appropriate OwnerReferences on the resource so
// handleObject can discover the MPIJob resource that 'owns' it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain how minResources is calculated.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain how minResources is calculated.

Sure.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

I will answer in another thread.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in the comment the fields that this function populates and how

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return ""
}

func (c *MPIJobController) calcPGMinResources(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still waiting.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

@tenzen-y tenzen-y changed the title Respect SchedulingPolicy WIP: Respect SchedulingPolicy Feb 17, 2023
@tenzen-y tenzen-y force-pushed the respect-scheduling-policy branch 2 times, most recently from 792c859 to 37e464d Compare February 17, 2023 20:52
@tenzen-y tenzen-y changed the title WIP: Respect SchedulingPolicy Respect SchedulingPolicy Feb 18, 2023
@tenzen-y
Copy link
Member Author

@alculquicondor I have addressed your comments. Please take another look.

@tenzen-y tenzen-y force-pushed the respect-scheduling-policy branch 2 times, most recently from 299b426 to 0eebc9a Compare February 18, 2023 12:19
@@ -46,6 +46,7 @@ import (

"github.com/kubeflow/mpi-operator/cmd/mpi-operator/app/options"
mpijobclientset "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned"
kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"
kubeflowscheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
// MinAvailable defines the minimal number of member to run the PodGroup.
// If the gang-scheduling is set to the volcano,
// input is passed to `.spec.mimMember` in PodGroup for the volcano.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that this will only work if the MPI application supports resizing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not set, it defaults to the number of workers

Comment on lines 381 to 383
AddFunc: controller.handleObject,
UpdateFunc: controller.handleObjectUpdate,
DeleteFunc: controller.handleObject,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to add an event handler for priorityClass? They are not owned by the MPIJob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// newPodGroup creates a new PodGroup for an MPIJob
// resource. It also sets the appropriate OwnerReferences on the resource so
// handleObject can discover the MPIJob resource that 'owns' it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add in the comment the fields that this function populates and how

schedulingPolicy *kubeflow.SchedulingPolicy,
) string {
if schedulingPolicy != nil && len(schedulingPolicy.PriorityClass) != 0 &&
c.priorityClassExist(schedulingPolicy.PriorityClass) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should MPIJob care if it exists? I think it's up to Volcano to decide what to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it to the volcano.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's cleanup the informer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tenzen-y tenzen-y force-pushed the respect-scheduling-policy branch 3 times, most recently from 35b124c to 55a104c Compare February 21, 2023 20:07
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Feb 21, 2023
@tenzen-y
Copy link
Member Author

@alculquicondor Updated. PTAL.

// MinAvailable defines the minimal number of member to run the PodGroup.
// If the gang-scheduling is set to the volcano,
// input is passed to `.spec.mimMember` in PodGroup for the volcano.
// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod).
// When using this field, you need to make sure the application supports resizing (e.g., Elastic Horovod).

ScheduleTimeoutSeconds *int32 `json:"scheduleTimeoutSeconds,omitempty"`
// MinAvailable defines the minimal number of member to run the PodGroup.
// If the gang-scheduling is set to the volcano,
// input is passed to `.spec.mimMember` in PodGroup for the volcano.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not set, it defaults to the number of workers

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y
Copy link
Member Author

@alculquicondor I addressed your suggestions and squashed commits into one.

@tenzen-y
Copy link
Member Author

Also, I will add docs for the schedulingPolicy to https://www.kubeflow.org/.

@alculquicondor
Copy link
Collaborator

/lgtm

@tenzen-y
Copy link
Member Author

@alculquicondor Created kubeflow/website#3453.

@tenzen-y
Copy link
Member Author

@alculquicondor Do we have any blocking for merging?

@alculquicondor
Copy link
Collaborator

oops, no, I just forgot to approve
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Respect SchedulingPolicy
2 participants