Respect SchedulingPolicy #520

tenzen-y · 2023-02-08T20:25:50Z

Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com

I changed the logic for the gang-scheduling so that the mpi-operator respects SchedulingPolicy when creating PodGroup.
Mainly, I modified the following:

When setting a priorityClass to PodGroup, "SchedulingPolicy.PriorityClass" is preferred over "ReplicaSpecs.PodTemplateSpec".
When setting a queueName to PodGroup, "SchedulingPolicy.Queue" is preferred over annotation "scheduling.volcano.sh/queue-name".

3. Set "PodGroupSpec.MinResources".
a. iff "SchedulingPolicy.MinAvailable" isn't empty, propagate that to PodGroup.
b. In the case of PodGroupSpec.MinMember < MPIJobSpec.MPIReplicaSpecs[Worker].Replicas + 1, sort in descending order "MPIJobSpec.MPIReplicaSpecs" according to priorityClass, and then add container resources to "PodGroupSpec.MinResources". However, the total value of "MPIJobSpec.MPIReplicaSpec.Replicas" to be added must not exceed "PodGroupSpec.MinMember".

Fixes: #518

/assign @alculquicondor

alculquicondor · 2023-02-09T20:46:53Z

Can you summarize the changes for somebody that isn't familiar with volcano? 😅

tenzen-y · 2023-02-09T21:00:05Z

Can you summarize the changes for somebody that isn't familiar with volcano? 😅

Sure. I will send a ping to you once this PR description is ready.

alculquicondor · 2023-02-09T20:48:12Z

pkg/controller/mpi_job_controller.go

-		if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok {
-			return fmt.Errorf("failed to wait for podgroup caches to sync")
-		}
+		synced = append(synced, c.podgroupsSynced)


we also don't need the priorityClass data unless there is gangSchedulerName enabled?

Yes, that's right.
It would be good to sync priorityclass only when the gangSchedulerName is set, same as priorityclass.

I will change the above logic.

still not changed

alculquicondor · 2023-02-09T20:55:36Z

pkg/controller/podgroup.go

+		}
+	}
+	if minResources == nil {
+		minResources = c.calcPGMinResources(minMember, mpiJob.Spec.MPIReplicaSpecs)


Do we need this? We were not calculating this before, so it wasn't working?

Yes. Previously, we could not use minResources since we don't calculate and set that:

mpi-operator/pkg/controller/mpi_job_controller.go

Lines 1299 to 1324 in c21942d

// newPodGroup creates a new PodGroup for an MPIJob

// resource. It also sets the appropriate OwnerReferences on the resource so

// handleObject can discover the MPIJob resource that 'owns' it.

func newPodGroup(mpiJob *kubeflow.MPIJob, minAvailableReplicas int32) *podgroupv1beta1.PodGroup {

var pName string

if l := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher]; l != nil {

pName = l.Template.Spec.PriorityClassName

if w := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker]; pName == "" && w != nil {

pName = w.Template.Spec.PriorityClassName

}

}

return &podgroupv1beta1.PodGroup{

ObjectMeta: metav1.ObjectMeta{

Name: mpiJob.Name,

Namespace: mpiJob.Namespace,

OwnerReferences: []metav1.OwnerReference{

*metav1.NewControllerRef(mpiJob, kubeflow.SchemeGroupVersionKind),

},

},

Spec: podgroupv1beta1.PodGroupSpec{

MinMember: minAvailableReplicas,

Queue: mpiJob.Annotations[podgroupv1beta1.QueueNameAnnotationKey],

PriorityClassName: pName,

},

}

}

In elastic mode, if we configurable the minResources, scheduling part of Workers to Nodes is useful when the number of Workers is huge.

alculquicondor · 2023-02-09T20:58:23Z

pkg/controller/podgroup.go

+	var minResources *corev1.ResourceList
+	if schedulingPolicy := mpiJob.Spec.RunPolicy.SchedulingPolicy; schedulingPolicy != nil {
+		if schedulingPolicy.MinAvailable != nil {
+			minMember = *schedulingPolicy.MinAvailable


what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).

We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.

what if volcano schedules a number of pods that is less than workerReplicas(mpiJob)?
The mpirun command might get stuck waiting for all the workers (unless it's elastic horovod that has native support for changing number of workers).

In that case, volcano schedules all Pods. If the cluster does not have enough resources for all Pods, some pods will be marked as Pending.

We need a way to restrict the usage of this field or have clear documentation that this is not supported for all kinds of MPI applications.

I agree. I think we have 2 options. Since we don't have fields if the MPIJob uses elastic horovod.

As you say, have clear documentation.

Add a new member, 'Elastic bool json:elastic,omitempty' into Runpolicy to represent if MPIJob uses elastic mode. And then, we add validations for minMember and minResources.

I like the second option. What do you think? @alculquicondor

Are there similar fields in the training operator?

No, most customjob-controllers have no fields on whether CustomJob uses elastic mode. This means users can set any value to schedulingPolicy.MinMember. The pytorchjob-controller has ElasticPolicy, although that field doesn't affect schedulingPolicy.

https://github.com/kubeflow/training-operator/blob/6ce98387a1a68b79365495a6344f7f19b48557d3/pkg/apis/kubeflow.org/v1/pytorch_types.go#L79

in that case, I think we should proceed with option 1 for now.

That said, there is currently no documentation about SchedulingPolicy overall :)

in that case, I think we should proceed with option 1 for now.

I agree. Maybe, we can add documentation to https://www.kubeflow.org/docs/components/training/job-scheduling/.

That said, there is currently no documentation about SchedulingPolicy overall

You're right...

another thought: even if we add an "elastic" field, it doesn't mean that the application will support it. So we need to clarify it in the documentation anyways.

So we need to clarify it in the documentation anyways.

I agree.

Probably, we can discuss if we add an elastic field in other places (issue or pr).
In this PR, we can just pass SchedulingPolicy.MinAvailable to PodGroup.Spec.MinMember and then add docs about that.

alculquicondor · 2023-02-09T21:00:40Z

pkg/controller/podgroup.go

+	return ""
+}
+
+func (c *MPIJobController) calcPGMinResources(


add a comment for what's happening here

Still waiting.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

I tried to dive into the volcano. The volcano has the Queue resource like ClusterQueue + LocalQueue in Kueue.
If Queue doesn't have enough resources requested by minResources in PodGroup, volcano-scheduler doesn't schedule all Pods to Nodes.

~~Does that make sense?~~

My understanding might be missing. I will try to investigate the volcano more.

makes sense.... it looks like a quota system just like kueue. I wonder if they would be interested in reviewing this PR. Do you know anybody?

Maybe, @shinytang6 is also interested in this PR.
cc: @shinytang6

@alculquicondor They don't seem to have enough bandwidth to review this PR.
So, I'm thinking of removing func calcPGMinResources and then passing SchedulingPolicy.MinResources to PodGroup.Spec.MinResources.

Also, it would be good to create an issue to keep tracking this.

WDYT?

SGTM, we don't want to introduce functionality we aren't sure about.

we don't want to introduce functionality we aren't sure about.

I agree.

tenzen-y · 2023-02-13T06:56:57Z

@alculquicondor Updated PR description.

alculquicondor · 2023-02-13T16:46:44Z

pkg/controller/mpi_job_controller.go

-		if ok := cache.WaitForCacheSync(stopCh, c.podgroupsSynced); !ok {
-			return fmt.Errorf("failed to wait for podgroup caches to sync")
-		}
+		synced = append(synced, c.podgroupsSynced)


still not changed

alculquicondor · 2023-02-13T16:48:51Z

pkg/controller/podgroup.go

+
+// newPodGroup creates a new PodGroup for an MPIJob
+// resource. It also sets the appropriate OwnerReferences on the resource so
+// handleObject can discover the MPIJob resource that 'owns' it.


explain how minResources is calculated.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

explain how minResources is calculated.

Sure.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

I will answer in another thread.

add in the comment the fields that this function populates and how

alculquicondor · 2023-02-13T16:49:47Z

pkg/controller/podgroup.go

+	return ""
+}
+
+func (c *MPIJobController) calcPGMinResources(


Still waiting.

What does volcano use that field? If it's all added up, how does it know whether they are consumed by one or multiple pods?

tenzen-y · 2023-02-18T08:21:45Z

@alculquicondor I have addressed your comments. Please take another look.

alculquicondor · 2023-02-21T17:50:32Z

cmd/mpi-operator/app/server.go

@@ -46,6 +46,7 @@ import (

 	"github.com/kubeflow/mpi-operator/cmd/mpi-operator/app/options"
 	mpijobclientset "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned"
+	kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"


Suggested change

kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"

kubeflowscheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"

alculquicondor · 2023-02-21T17:51:37Z

pkg/apis/kubeflow/v2beta1/types.go

-	ScheduleTimeoutSeconds *int32           `json:"scheduleTimeoutSeconds,omitempty"`
+	// MinAvailable defines the minimal number of member to run the PodGroup.
+	// If the gang-scheduling is set to the volcano,
+	// input is passed to `.spec.mimMember` in PodGroup for the volcano.


Explain that this will only work if the MPI application supports resizing.

Sounds good.

If not set, it defaults to the number of workers

alculquicondor · 2023-02-21T17:53:56Z

pkg/controller/mpi_job_controller.go

+			AddFunc:    controller.handleObject,
+			UpdateFunc: controller.handleObjectUpdate,
+			DeleteFunc: controller.handleObject,


Does it make sense to add an event handler for priorityClass? They are not owned by the MPIJob.

You're right.

alculquicondor · 2023-02-21T17:55:17Z

pkg/controller/podgroup.go

+
+// newPodGroup creates a new PodGroup for an MPIJob
+// resource. It also sets the appropriate OwnerReferences on the resource so
+// handleObject can discover the MPIJob resource that 'owns' it.


add in the comment the fields that this function populates and how

alculquicondor · 2023-02-21T17:57:18Z

pkg/controller/podgroup.go

+	schedulingPolicy *kubeflow.SchedulingPolicy,
+) string {
+	if schedulingPolicy != nil && len(schedulingPolicy.PriorityClass) != 0 &&
+		c.priorityClassExist(schedulingPolicy.PriorityClass) {


Should MPIJob care if it exists? I think it's up to Volcano to decide what to do.

Makes sense.

We can leave it to the volcano.

Then let's cleanup the informer

tenzen-y · 2023-02-21T20:51:02Z

@alculquicondor Updated. PTAL.

alculquicondor · 2023-02-23T16:01:49Z

pkg/apis/kubeflow/v2beta1/types.go

+	// MinAvailable defines the minimal number of member to run the PodGroup.
+	// If the gang-scheduling is set to the volcano,
+	// input is passed to `.spec.mimMember` in PodGroup for the volcano.
+	// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod).


Suggested change

// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod).

// When using this field, you need to make sure the application supports resizing (e.g., Elastic Horovod).

alculquicondor · 2023-02-23T16:02:36Z

pkg/apis/kubeflow/v2beta1/types.go

-	ScheduleTimeoutSeconds *int32           `json:"scheduleTimeoutSeconds,omitempty"`
+	// MinAvailable defines the minimal number of member to run the PodGroup.
+	// If the gang-scheduling is set to the volcano,
+	// input is passed to `.spec.mimMember` in PodGroup for the volcano.


If not set, it defaults to the number of workers

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

tenzen-y · 2023-02-23T17:30:05Z

@alculquicondor I addressed your suggestions and squashed commits into one.

tenzen-y · 2023-02-23T17:37:24Z

Also, I will add docs for the schedulingPolicy to https://www.kubeflow.org/.

alculquicondor · 2023-02-23T18:54:54Z

/lgtm

tenzen-y · 2023-02-25T11:05:26Z

@alculquicondor Created kubeflow/website#3453.

tenzen-y · 2023-02-28T15:15:47Z

@alculquicondor Do we have any blocking for merging?

alculquicondor · 2023-02-28T15:36:41Z

oops, no, I just forgot to approve
/approve

google-oss-prow · 2023-02-28T15:36:58Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot assigned alculquicondor Feb 8, 2023

google-oss-prow bot requested review from carmark and gaocegege February 8, 2023 20:25

google-oss-prow bot added the size/XL label Feb 8, 2023

tenzen-y force-pushed the respect-scheduling-policy branch 4 times, most recently from c8045ae to 2bc2b65 Compare February 9, 2023 06:57

alculquicondor reviewed Feb 9, 2023

View reviewed changes

alculquicondor reviewed Feb 13, 2023

View reviewed changes

tenzen-y changed the title ~~Respect SchedulingPolicy~~ WIP: Respect SchedulingPolicy Feb 17, 2023

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 17, 2023

tenzen-y force-pushed the respect-scheduling-policy branch 2 times, most recently from 792c859 to 37e464d Compare February 17, 2023 20:52

tenzen-y changed the title ~~WIP: Respect SchedulingPolicy~~ Respect SchedulingPolicy Feb 18, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Feb 18, 2023

tenzen-y force-pushed the respect-scheduling-policy branch 2 times, most recently from 299b426 to 0eebc9a Compare February 18, 2023 12:19

alculquicondor reviewed Feb 21, 2023

View reviewed changes

tenzen-y force-pushed the respect-scheduling-policy branch 3 times, most recently from 35b124c to 55a104c Compare February 21, 2023 20:07

google-oss-prow bot added size/L and removed size/XL labels Feb 21, 2023

alculquicondor reviewed Feb 23, 2023

View reviewed changes

tenzen-y force-pushed the respect-scheduling-policy branch from 500beb3 to c292f7e Compare February 23, 2023 17:26

Respect SchedulingPolicy

68b2861

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>

tenzen-y force-pushed the respect-scheduling-policy branch from c292f7e to 68b2861 Compare February 23, 2023 17:28

google-oss-prow bot added the lgtm label Feb 23, 2023

tenzen-y mentioned this pull request Feb 25, 2023

MPI-Operator: Add docs for the schedulingPolicy kubeflow/website#3453

Merged

google-oss-prow bot added the approved label Feb 28, 2023

google-oss-prow bot merged commit b302019 into kubeflow:master Feb 28, 2023

tenzen-y deleted the respect-scheduling-policy branch February 28, 2023 15:47

This was referenced Mar 2, 2023

Consider handling the minResources when using volcano as a gang scheduler #535

Closed

Pass schedulingPolicy.PriorityClass to Launcher and Workers #536

Closed

tenzen-y mentioned this pull request Jun 8, 2023

Fix a bug that the PodGroupCtrl can not list priorityclass #561

Merged

	// newPodGroup creates a new PodGroup for an MPIJob
	// resource. It also sets the appropriate OwnerReferences on the resource so
	// handleObject can discover the MPIJob resource that 'owns' it.
	func newPodGroup(mpiJob kubeflow.MPIJob, minAvailableReplicas int32) podgroupv1beta1.PodGroup {
	var pName string
	if l := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeLauncher]; l != nil {
	pName = l.Template.Spec.PriorityClassName
	if w := mpiJob.Spec.MPIReplicaSpecs[kubeflow.MPIReplicaTypeWorker]; pName == "" && w != nil {
	pName = w.Template.Spec.PriorityClassName
	}
	}
	return &podgroupv1beta1.PodGroup{
	ObjectMeta: metav1.ObjectMeta{
	Name: mpiJob.Name,
	Namespace: mpiJob.Namespace,
	OwnerReferences: []metav1.OwnerReference{
	*metav1.NewControllerRef(mpiJob, kubeflow.SchemeGroupVersionKind),
	},
	},
	Spec: podgroupv1beta1.PodGroupSpec{
	MinMember: minAvailableReplicas,
	Queue: mpiJob.Annotations[podgroupv1beta1.QueueNameAnnotationKey],
	PriorityClassName: pName,
	},
	}
	}

	kubeflowScheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"
	kubeflowscheme "github.com/kubeflow/mpi-operator/pkg/client/clientset/versioned/scheme"

	// Also, this parameter will function properly only when we use elastic training (e.g., Elastic Horovod).
	// When using this field, you need to make sure the application supports resizing (e.g., Elastic Horovod).

Respect SchedulingPolicy #520

Respect SchedulingPolicy #520

Conversation

tenzen-y commented Feb 8, 2023 • edited Loading

alculquicondor commented Feb 9, 2023

tenzen-y commented Feb 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Feb 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y commented Feb 23, 2023

tenzen-y commented Feb 23, 2023

alculquicondor commented Feb 23, 2023

tenzen-y commented Feb 25, 2023

tenzen-y commented Feb 28, 2023

alculquicondor commented Feb 28, 2023

google-oss-prow bot commented Feb 28, 2023

tenzen-y commented Feb 8, 2023 •

edited

Loading

tenzen-y Feb 10, 2023 •

edited

Loading

tenzen-y Feb 13, 2023 •

edited

Loading

tenzen-y Feb 17, 2023 •

edited

Loading