Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] specify total GPU count for distributed training #384

Closed
pineking opened this issue Feb 11, 2018 · 6 comments
Closed

[discussion] specify total GPU count for distributed training #384

pineking opened this issue Feb 11, 2018 · 6 comments

Comments

@pineking
Copy link
Member

I am not sure whether this can be discussed here.

suppose we have a k8s cluster with 5 nodes. each node has 8 GPUs, so there are 40 GPUs in total.
when a user starts a distributed training with 20 GPUs.

what we expect:
he just specifies the number 20 and does not need to split the GPUs request count manually. a controller or something else can do this automatically according to the current free GPU resources of the cluster.
e.g. 20 = 8 + 8 + 2 + 2.
at the same time, when the training ends, all pods can be deleted by this controller.

does tensorflow/k8s or other operator have this function?

@gaocegege
Copy link
Member

gaocegege commented Feb 11, 2018

Thanks for your issue.

I am not sure if I understand the idea. Do you mean that the operator should support assigning GPUs to PS and workers automatically?

@pineking
Copy link
Member Author

the k8s can support to assign GPUs to pod/worker automatically now if the limit of "nvidia.com/gpu" specified in pod YAML file. But for distributed training, it is not easy and user-friendly to use. we need to create each worker/pod and set the number of GPU in each pod separately.

I mean, for distributed training,
the user does not need to know how many workers(pods) to run the training, nor the limit value of "nvidia.com/gpu" for each worker/pod.
the only thing he needs to set is the total GPU count used in all workers.

I think it's easier for the user to create a distributed training.

@gaocegege
Copy link
Member

I do not think we should hide the PS/worker from users at operator level, maybe we could build a config generator on top of tf-operator, which accepts the user code and number of GPUs, and generates the TFJob config.

@pineking
Copy link
Member Author

I agree we can implement this in the top of tf-operator.
we use https://github.com/uber/horovod MPI mode instead of parameter servers in tensorflow, It's easy and faster to use. I have tested it with k8s, perhaps it can also work with tf-operator in the future.

@jlewi
Copy link
Contributor

jlewi commented Feb 11, 2018

horovod looks promising. Is there something that could make that easier to use on K8s?

@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2019

I'm going to close this issue out because of lack of activity.

@jlewi jlewi closed this as completed Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants