Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for mpi-operator in Kubeflow 0.5? #66

Closed
jlewi opened this issue Jan 7, 2019 · 13 comments
Closed

Plans for mpi-operator in Kubeflow 0.5? #66

jlewi opened this issue Jan 7, 2019 · 13 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Jan 7, 2019

What are the plans for the mpi-operator in 0.5?

/cc @everpeace @rongou

@jlewi jlewi changed the title Plans for mpi-operator in 0.5? Plans for mpi-operator in Kubeflow 0.5? Jan 7, 2019
@rongou
Copy link
Member

rongou commented Jan 7, 2019 via email

@k82cn
Copy link

k82cn commented Jan 8, 2019

@rongou , are you going to have a new implementation of gang-scheduling or leveraging kube-batch ?

@rongou
Copy link
Member

rongou commented Jan 8, 2019

This work is mostly done by the Nvidia GPU Cloud (NGC) team. They have an internal scheduler, but they are also looking at kube-batch, so I guess it's still to be determined.

@k82cn
Copy link

k82cn commented Jan 9, 2019

Got that; if anything I can help, please let me now :)

@k82cn
Copy link

k82cn commented Jan 10, 2019

They have an internal scheduler,

BTW, if the scheduler is internal, how others use that?

@terrytangyuan
Copy link
Member

terrytangyuan commented Jan 10, 2019

Not sure if these will fit in Kubeflow 0.5 but just want to add a couple related issues for discussion here:

@everpeace
Copy link

Probably, some users using no GPU would like to configure custom slot= clause in hostfile

https://github.com/kubeflow/mpi-operator/blob/master/pkg/controllers/mpi_job_controller.go#L743-L747

@rongou
Copy link
Member

rongou commented Jan 11, 2019

@k82cn they have plans to eventually open source it, but for now it's on NGC only.

@k82cn
Copy link

k82cn commented Jan 11, 2019

they have plans to eventually open source it, but for now it's on NGC only.

Got that :)

@terrytangyuan
Copy link
Member

terrytangyuan commented Jan 11, 2019

@everpeace Thanks. That's good to know. Let's continue non-GPU specific discussion on the PR #75.

@cheyang
Copy link
Contributor

cheyang commented Jan 19, 2019

@rongou , are there any plans to update API spec? For example, the currently job launcher and workers are sharing the same pod spec. I suggest to use different role spec to specify the launcher and worker, the reasons are :

  1. launcher also needs to set resource requests/limits, but it's not equal with the workers' request. Because it doesn't participate compute in current design.

  2. To save the compute resource, some users want to the launcher also participate computing as worker0. If the launcher and workers are in the different spec, it will be flexible to determine if the launcher can join the compute.

Thanks.

@rongou
Copy link
Member

rongou commented Jan 21, 2019

@cheyang see #54.

@cheyang
Copy link
Contributor

cheyang commented Jan 21, 2019

Thanks, got it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants