Plans for mpi-operator in Kubeflow 0.5? #66

jlewi · 2019-01-07T13:42:48Z

What are the plans for the mpi-operator in 0.5?

rongou · 2019-01-07T19:25:17Z

We are planning some work internally (at NVIDIA) to have better support for gang scheduling, bare metal/networking performance etc. Not sure how what would fit into the kubeflow release schedule. What's the target date for 0.5?

…

On Mon, Jan 7, 2019 at 5:42 AM Jeremy Lewi ***@***.***> wrote: What are the plans for the mpi-operator in 0.5? /cc @everpeace <https://github.com/everpeace> @rongou <https://github.com/rongou> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#66>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAeVzYUeZwMzsTxKX81nZ5rDzUDYsEugks5vA07YgaJpZM4Zzhoi> .

k82cn · 2019-01-08T01:40:31Z

@rongou , are you going to have a new implementation of gang-scheduling or leveraging kube-batch ?

rongou · 2019-01-08T23:13:57Z

This work is mostly done by the Nvidia GPU Cloud (NGC) team. They have an internal scheduler, but they are also looking at kube-batch, so I guess it's still to be determined.

k82cn · 2019-01-09T01:34:53Z

Got that; if anything I can help, please let me now :)

k82cn · 2019-01-10T08:40:35Z

They have an internal scheduler,

BTW, if the scheduler is internal, how others use that?

terrytangyuan · 2019-01-10T17:03:31Z

Not sure if these will fit in Kubeflow 0.5 but just want to add a couple related issues for discussion here:

[Done] Additional tags for docker images on Dockerhub and formal release before any API changes: Additional tags for docker images on Dockerhub #72
[Done] Update k8s.io/code-generator as discussed in Hard requirement on cluster-level roles #59
[Done] Support processing resource types other than GPU in Support processing resource types other than GPU #75

everpeace · 2019-01-11T01:23:44Z

Probably, some users using no GPU would like to configure custom slot= clause in hostfile

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/mpi_job_controller.go#L743-L747

rongou · 2019-01-11T01:24:54Z

@k82cn they have plans to eventually open source it, but for now it's on NGC only.

k82cn · 2019-01-11T02:54:44Z

they have plans to eventually open source it, but for now it's on NGC only.

Got that :)

terrytangyuan · 2019-01-11T15:02:32Z

@everpeace Thanks. That's good to know. Let's continue non-GPU specific discussion on the PR #75.

cheyang · 2019-01-19T06:21:54Z

@rongou , are there any plans to update API spec? For example, the currently job launcher and workers are sharing the same pod spec. I suggest to use different role spec to specify the launcher and worker, the reasons are :

launcher also needs to set resource requests/limits, but it's not equal with the workers' request. Because it doesn't participate compute in current design.
To save the compute resource, some users want to the launcher also participate computing as worker0. If the launcher and workers are in the different spec, it will be flexible to determine if the launcher can join the compute.

Thanks.

rongou · 2019-01-21T02:42:57Z

@cheyang see #54.

cheyang · 2019-01-21T14:58:05Z

Thanks, got it!

jlewi changed the title ~~Plans for mpi-operator in 0.5?~~ Plans for mpi-operator in Kubeflow 0.5? Jan 7, 2019

terrytangyuan closed this as completed May 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plans for mpi-operator in Kubeflow 0.5? #66

Plans for mpi-operator in Kubeflow 0.5? #66

jlewi commented Jan 7, 2019

rongou commented Jan 7, 2019 via email

k82cn commented Jan 8, 2019 •

edited

Loading

rongou commented Jan 8, 2019

k82cn commented Jan 9, 2019

k82cn commented Jan 10, 2019 •

edited

Loading

terrytangyuan commented Jan 10, 2019 •

edited

Loading

everpeace commented Jan 11, 2019

rongou commented Jan 11, 2019

k82cn commented Jan 11, 2019

terrytangyuan commented Jan 11, 2019 •

edited

Loading

cheyang commented Jan 19, 2019 •

edited

Loading

rongou commented Jan 21, 2019

cheyang commented Jan 21, 2019

Plans for mpi-operator in Kubeflow 0.5? #66

Plans for mpi-operator in Kubeflow 0.5? #66

Comments

jlewi commented Jan 7, 2019

rongou commented Jan 7, 2019 via email

k82cn commented Jan 8, 2019 • edited Loading

rongou commented Jan 8, 2019

k82cn commented Jan 9, 2019

k82cn commented Jan 10, 2019 • edited Loading

terrytangyuan commented Jan 10, 2019 • edited Loading

everpeace commented Jan 11, 2019

rongou commented Jan 11, 2019

k82cn commented Jan 11, 2019

terrytangyuan commented Jan 11, 2019 • edited Loading

cheyang commented Jan 19, 2019 • edited Loading

rongou commented Jan 21, 2019

cheyang commented Jan 21, 2019

k82cn commented Jan 8, 2019 •

edited

Loading

k82cn commented Jan 10, 2019 •

edited

Loading

terrytangyuan commented Jan 10, 2019 •

edited

Loading

terrytangyuan commented Jan 11, 2019 •

edited

Loading

cheyang commented Jan 19, 2019 •

edited

Loading