[discussion] specify total GPU count for distributed training #384

pineking · 2018-02-11T08:28:01Z

I am not sure whether this can be discussed here.

suppose we have a k8s cluster with 5 nodes. each node has 8 GPUs, so there are 40 GPUs in total.
when a user starts a distributed training with 20 GPUs.

what we expect:
he just specifies the number 20 and does not need to split the GPUs request count manually. a controller or something else can do this automatically according to the current free GPU resources of the cluster.
e.g. 20 = 8 + 8 + 2 + 2.
at the same time, when the training ends, all pods can be deleted by this controller.

does tensorflow/k8s or other operator have this function?

The text was updated successfully, but these errors were encountered:

gaocegege · 2018-02-11T09:59:13Z

Thanks for your issue.

I am not sure if I understand the idea. Do you mean that the operator should support assigning GPUs to PS and workers automatically?

pineking · 2018-02-11T10:57:17Z

the k8s can support to assign GPUs to pod/worker automatically now if the limit of "nvidia.com/gpu" specified in pod YAML file. But for distributed training, it is not easy and user-friendly to use. we need to create each worker/pod and set the number of GPU in each pod separately.

I mean, for distributed training,
the user does not need to know how many workers(pods) to run the training, nor the limit value of "nvidia.com/gpu" for each worker/pod.
the only thing he needs to set is the total GPU count used in all workers.

I think it's easier for the user to create a distributed training.

gaocegege · 2018-02-11T11:25:28Z

I do not think we should hide the PS/worker from users at operator level, maybe we could build a config generator on top of tf-operator, which accepts the user code and number of GPUs, and generates the TFJob config.

pineking · 2018-02-11T12:56:58Z

I agree we can implement this in the top of tf-operator.
we use https://github.com/uber/horovod MPI mode instead of parameter servers in tensorflow, It's easy and faster to use. I have tested it with k8s, perhaps it can also work with tf-operator in the future.

jlewi · 2018-02-11T17:17:48Z

horovod looks promising. Is there something that could make that easier to use on K8s?

jlewi · 2019-02-04T18:43:44Z

I'm going to close this issue out because of lack of activity.

gaocegege added area/operator kind/discussion labels Feb 11, 2018

jlewi closed this as completed Feb 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discussion] specify total GPU count for distributed training #384

[discussion] specify total GPU count for distributed training #384

pineking commented Feb 11, 2018

gaocegege commented Feb 11, 2018 •

edited

pineking commented Feb 11, 2018

gaocegege commented Feb 11, 2018

pineking commented Feb 11, 2018

jlewi commented Feb 11, 2018

jlewi commented Feb 4, 2019

[discussion] specify total GPU count for distributed training #384

[discussion] specify total GPU count for distributed training #384

Comments

pineking commented Feb 11, 2018

gaocegege commented Feb 11, 2018 • edited

pineking commented Feb 11, 2018

gaocegege commented Feb 11, 2018

pineking commented Feb 11, 2018

jlewi commented Feb 11, 2018

jlewi commented Feb 4, 2019

gaocegege commented Feb 11, 2018 •

edited