Feature/support pytorchjob set queue of volcano #1415

qiankunli · 2021-09-22T08:51:16Z

I want set queue of podgroup created by pytorchjob, but there is not SchedulingPolicy in pytorchjob struct, so I try to set queue name in annotation scheduling.volcano.sh/queue-name of pytorchjob.

it is related with issue #1414

Signed-off-by: bert.li <qiankun.li@qq.com>

google-cla · 2021-09-22T08:51:23Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

aws-kf-ci-bot · 2021-09-22T08:51:30Z

Hi @qiankunli. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

qiankunli · 2021-09-22T08:54:57Z

@googlebot I signed it!

johnugeorge · 2021-09-22T12:49:48Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -155,13 +156,18 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
 	// Set default priorities to pytorch job
 	r.Scheme.Default(pytorchjob)

+	// parse volcano Queue from pytorchjob Annotation


what about other jobs?

Jeffwan · 2021-09-22T17:48:40Z

pkg/controller.v1/pytorch/pytorchjob_controller.go

@@ -155,13 +156,18 @@ func (r *PyTorchJobReconciler) Reconcile(ctx context.Context, req ctrl.Request)
 	// Set default priorities to pytorch job
 	r.Scheme.Default(pytorchjob)

+	// parse volcano Queue from pytorchjob Annotation
+	schedulingPolicy := &commonv1.SchedulingPolicy{


Since Pytorch spec embed runPolicy, can we get scheduling policy directly from pytortchjob.Spec.RunPolicy.SchedulingPolicy?
@qiankunli

We should use pytorchjob.Spec.RunPolicy as the argument to reconcile the jobs

https://github.com/kubeflow/tf-operator/blob/f9e5004ee41c923e3ba1f804a97116c84dfd5779/pkg/controller.v1/xgboost/xgboostjob_controller.go#L176

@Jeffwan

now it is always nil for SchedulingPolicy in pytorch-operator, if SchedulingPolicy is seted , it is ok use pytortchjob.Spec.RunPolicy.SchedulingPolicy directly

// github.com/kubeflow/tf-operator/pkg/controller.v1/pytorch/pytorchjob_controller.go runPolicy := &commonv1.RunPolicy{ CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy, TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished, ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds, BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit, SchedulingPolicy: nil, } // Use common to reconcile the job related pod and service err = r.ReconcileJobs(pytorchjob, pytorchjob.Spec.PyTorchReplicaSpecs, pytorchjob.Status, runPolicy)

@Jeffwan I update the pr

runPolicy := &commonv1.RunPolicy{ CleanPodPolicy: pytorchjob.Spec.RunPolicy.CleanPodPolicy, TTLSecondsAfterFinished: pytorchjob.Spec.RunPolicy.TTLSecondsAfterFinished, ActiveDeadlineSeconds: pytorchjob.Spec.RunPolicy.ActiveDeadlineSeconds, BackoffLimit: pytorchjob.Spec.RunPolicy.BackoffLimit, SchedulingPolicy: pytorchjob.Spec.RunPolicy.SchedulingPolicy, }

@qiankunli

Can you help update this for MXNet job as well?

Actually, since pytorch.Spec.RunPolicy is &commonv1.RunPolicy. We can pass pytorchjob.Spec.RunPolicy instead of constructing a new one. See xgboost example

https://github.com/kubeflow/tf-operator/blob/acba15e644b4c4d4fe6b68664407e4ea588d4458/pkg/controller.v1/xgboost/xgboostjob_controller.go#L176

Can you help update this for MXNet job as well?

Should we make it in another PR?

Signed-off-by: bert.li <qiankun.li@qq.com>

gaocegege

LGTM

Should we add some unit test cases?

qiankunli · 2021-09-24T02:29:37Z

@Jeffwan I update the pr

use &pytorchjob.Spec.RunPolicy directly in PyTorchJobReconciler.Reconcile
update MXNet job like PyTorchJobReconciler.Reconcile

gaocegege

/ok-to-test

Jeffwan · 2021-09-24T04:13:09Z

/lgtm

google-oss-robot · 2021-09-24T04:13:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

qiankunli · 2021-09-24T10:29:48Z

/retest

gaocegege · 2021-09-24T11:08:57Z

/retest

Jeffwan · 2021-09-24T20:27:01Z

/test kubeflow-tf-operator-presubmit

* support pytorch use volcano-queue * support pytorch use volcano-queue Signed-off-by: bert.li <qiankun.li@qq.com> * set SchedulingPolicy for runPolicy Signed-off-by: bert.li <qiankun.li@qq.com> * use pytorchjob.Spec.RunPolicy directly

* Feature/support pytorchjob set queue of volcano (#1415) * support pytorch use volcano-queue * support pytorch use volcano-queue Signed-off-by: bert.li <qiankun.li@qq.com> * set SchedulingPolicy for runPolicy Signed-off-by: bert.li <qiankun.li@qq.com> * use pytorchjob.Spec.RunPolicy directly * fix hyperlinks in the 'overview' section (#1418) hyperlinks now point to the latest api reference files. issue - #1411

qiankunli added 2 commits September 22, 2021 16:43

support pytorch use volcano-queue

2c8e8b7

support pytorch use volcano-queue

963c601

Signed-off-by: bert.li <qiankun.li@qq.com>

aws-kf-ci-bot added the needs-ok-to-test label Sep 22, 2021

google-oss-robot added the size/XS label Sep 22, 2021

google-oss-robot requested review from jinchihe and terrytangyuan September 22, 2021 08:51

johnugeorge reviewed Sep 22, 2021

View reviewed changes

Jeffwan reviewed Sep 22, 2021

View reviewed changes

qiankunli added 2 commits September 23, 2021 15:46

set SchedulingPolicy for runPolicy

de6af43

Signed-off-by: bert.li <qiankun.li@qq.com>

use pytorchjob.Spec.RunPolicy directly

95174f8

google-oss-robot added size/S and removed size/XS labels Sep 24, 2021

gaocegege reviewed Sep 24, 2021

View reviewed changes

aws-kf-ci-bot added ok-to-test and removed needs-ok-to-test labels Sep 24, 2021

google-oss-robot assigned Jeffwan Sep 24, 2021

google-oss-robot added the lgtm label Sep 24, 2021

Jeffwan approved these changes Sep 24, 2021

View reviewed changes

google-oss-robot added the approved label Sep 24, 2021

google-oss-robot merged commit 557ba80 into kubeflow:master Sep 24, 2021

This was referenced Oct 3, 2021

Cherry pick #1415 #1418 to v1.3-branch #1427

Closed

Cherry pick #1415 #1418 to v1.3-branch #1428

Merged

Jeffwan mentioned this pull request Oct 3, 2021

Cut official release of 1.3.0 #1425

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/support pytorchjob set queue of volcano #1415

Feature/support pytorchjob set queue of volcano #1415

qiankunli commented Sep 22, 2021

google-cla bot commented Sep 22, 2021

aws-kf-ci-bot commented Sep 22, 2021

qiankunli commented Sep 22, 2021

johnugeorge Sep 22, 2021

Jeffwan Sep 22, 2021

Jeffwan Sep 22, 2021

Jeffwan Sep 22, 2021

qiankunli Sep 23, 2021

qiankunli Sep 23, 2021

Jeffwan Sep 23, 2021

gaocegege Sep 24, 2021

gaocegege left a comment

qiankunli commented Sep 24, 2021

gaocegege left a comment

Jeffwan commented Sep 24, 2021

google-oss-robot commented Sep 24, 2021

qiankunli commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Jeffwan commented Sep 24, 2021

Feature/support pytorchjob set queue of volcano #1415

Feature/support pytorchjob set queue of volcano #1415

Conversation

qiankunli commented Sep 22, 2021

google-cla bot commented Sep 22, 2021

What to do if you already signed the CLA

Individual signers

Corporate signers

aws-kf-ci-bot commented Sep 22, 2021

qiankunli commented Sep 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

qiankunli commented Sep 24, 2021

gaocegege left a comment

Choose a reason for hiding this comment

Jeffwan commented Sep 24, 2021

google-oss-robot commented Sep 24, 2021

qiankunli commented Sep 24, 2021

gaocegege commented Sep 24, 2021

Jeffwan commented Sep 24, 2021