Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--enable-gang-scheduling=true doesn't work for MPIJob #1548

Closed
cheimu opened this issue Mar 15, 2022 · 1 comment · Fixed by #1557
Closed

--enable-gang-scheduling=true doesn't work for MPIJob #1548

cheimu opened this issue Mar 15, 2022 · 1 comment · Fixed by #1557

Comments

@cheimu
Copy link
Member

cheimu commented Mar 15, 2022

Hi tranining-operator experts.

When I set --enable-gang-scheduling=true for training-operator
image , they worked for TFJob, PyTorchJob, .etc, but it doesn't work for MPIJob.
image

2022-03-15T03:30:18.191Z        DEBUG   controller-runtime.manager.events       
Warning {"object": {"kind":"MPIJob","namespace":"pytorch-job-test","name":"transformer-glue-4node","uid":"593289df-28b6-4f78-aee3-d6d6a3728cce","apiVersion":"kubeflow.org/v1","resourceVersion":"728417214"}, 
"reason": "SettedPodTemplateSchedulerName", "message": "Another scheduler is specified when gang-scheduling is enabled and it will not be overwritten"}

However, the manifest that I submitted doesn't set any SchedulerName

My tranining-operator version is [training-operator:6c115f6e00e3f2c979c6aa4bf2d93906a646b99d] Could anyone give a hint or help? Thank you in advance!

@zw0610
Copy link
Member

zw0610 commented Mar 16, 2022

Hi, I double-checked with the latest master branch (1f3f29a0bbac00e0410b8fc9b0f32cdf710dda30) and was not able to reproduce the issue. After setting --enable-gang-scheduling=true, I filed the MPIJob under the example directory and was able to observe the creation of the PodGroup
Screen Shot 2022-03-16 at 20 54 22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants