support for tensorflow distribution #92

cloustone · 2018-06-09T02:30:51Z

Would you please tell us how to support tensorflow distributions train in FfDL.
As we known, there are worker tasks and parameter tasks in tensorflow, when using FfDL, should we specify the information clearly to FfDL.

Thanks

Tomcli · 2018-06-11T17:53:47Z

Yes, we need to define which node we want to use for parameter servers and workers when we run the TensorFlow distributed job.

Currently we have the example in one of our Pull requests https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed which we will provide the launcher.py script to configure the networking and communication between different node, so users can do minimal work to enable distributed training.

We will soon merge this pull request after we finish our code scan and reviews. Sorry for the inconvenience.

cloustone · 2018-06-12T02:13:16Z

@Tomcli Thanks for your good working, Anyway, according to the example https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed, parameter server also take on GPU resources. for tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?

atinsood · 2018-06-13T02:10:04Z

@cloustone unfortunately yes, the ps also takes on a gpu resource, so yes in the current implementation you end up wasting a gpu running parameters server even though its not required.

I guess we can probably do interesting things like start a sidecar container in these pods and have ps run in the sidecar containers on some of the pods.

We are still thinking through what's the best approach of implementing this. the above is one thought, the other thought process is to basically use the same container and expose ps as a separate process. another one can be to run ps as a separate standalone pod(s) . we have also thought about enhancing our job monitor component and may be have it take the responsibility of ps.

As you can see nothing concrete here, since the TF distributed approach seems to be rapidly evolving as well.

Honestly, we are thinking about what's the best way to capture all of these in a clean manner https://www.youtube.com/watch?v=bRMGoPqsn20 especially the mirrored strategy which brings ring reduce natively to TF

I am not sure if I follow or tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it? the launcher.py code is responsible for that https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/tf-distributed/launcher.py#L42 it basically marks the first n (by default 1) replicas as ps and the others as workers.

I can provide more in depth detail if you think the above does not make sense

animeshsingh · 2018-06-13T17:44:36Z

Thanks @atinsood

@cloustone in short, yes in the current implementation there is penalty to be paid for use of GPU for PS. Would love to schedule a call and discuss options with your team

cloustone · 2018-06-15T11:33:28Z

Thanks @animeshsingh @atinsood
We understand the current status and wish a great PR or improvement will be available.

Our team want to push FfDL into production environment, however there are so many works to be done.We should keep in touch deeply and also wish get helps from FfDL team.
Anyway, it's a little difficult for our team to speak in English fluently, so Mail or WeChat will be a better way:)

My mail address: jenson.zuo@qq.com

animeshsingh · 2018-06-15T18:05:55Z

Thanks @cloustone
The next PR we have in pipeline for 0.1 release candidate #79 is much closer to what you would need. I would send a mail on your id and we should iterate.

Tomcli · 2018-07-09T22:53:33Z

Closed with #79

Tomcli closed this as completed Jul 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for tensorflow distribution #92

support for tensorflow distribution #92

cloustone commented Jun 9, 2018

Tomcli commented Jun 11, 2018

cloustone commented Jun 12, 2018

atinsood commented Jun 13, 2018 •

edited

Loading

animeshsingh commented Jun 13, 2018

cloustone commented Jun 15, 2018

animeshsingh commented Jun 15, 2018

Tomcli commented Jul 9, 2018

support for tensorflow distribution #92

support for tensorflow distribution #92

Comments

cloustone commented Jun 9, 2018

Tomcli commented Jun 11, 2018

cloustone commented Jun 12, 2018

atinsood commented Jun 13, 2018 • edited Loading

animeshsingh commented Jun 13, 2018

cloustone commented Jun 15, 2018

animeshsingh commented Jun 15, 2018

Tomcli commented Jul 9, 2018

atinsood commented Jun 13, 2018 •

edited

Loading