Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for tensorflow distribution #92

Closed
cloustone opened this issue Jun 9, 2018 · 7 comments
Closed

support for tensorflow distribution #92

cloustone opened this issue Jun 9, 2018 · 7 comments

Comments

@cloustone
Copy link

Would you please tell us how to support tensorflow distributions train in FfDL.
As we known, there are worker tasks and parameter tasks in tensorflow, when using FfDL, should we specify the information clearly to FfDL.

Thanks

@Tomcli
Copy link
Contributor

Tomcli commented Jun 11, 2018

Yes, we need to define which node we want to use for parameter servers and workers when we run the TensorFlow distributed job.

Currently we have the example in one of our Pull requests https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed which we will provide the launcher.py script to configure the networking and communication between different node, so users can do minimal work to enable distributed training.

We will soon merge this pull request after we finish our code scan and reviews. Sorry for the inconvenience.

@cloustone
Copy link
Author

@Tomcli Thanks for your good working, Anyway, according to the example https://github.com/fplk/FfDL/tree/merge_20180514_1536/etc/examples/tf-distributed, parameter server also take on GPU resources. for tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it?

@atinsood
Copy link

atinsood commented Jun 13, 2018

@cloustone unfortunately yes, the ps also takes on a gpu resource, so yes in the current implementation you end up wasting a gpu running parameters server even though its not required.

I guess we can probably do interesting things like start a sidecar container in these pods and have ps run in the sidecar containers on some of the pods.

We are still thinking through what's the best approach of implementing this. the above is one thought, the other thought process is to basically use the same container and expose ps as a separate process. another one can be to run ps as a separate standalone pod(s) . we have also thought about enhancing our job monitor component and may be have it take the responsibility of ps.

As you can see nothing concrete here, since the TF distributed approach seems to be rapidly evolving as well.

Honestly, we are thinking about what's the best way to capture all of these in a clean manner https://www.youtube.com/watch?v=bRMGoPqsn20 especially the mirrored strategy which brings ring reduce natively to TF

I am not sure if I follow or tensorflow native framework, the learner container doesn't distinguish ps and worker, it will be a problem. how to deal with it? the launcher.py code is responsible for that https://github.com/fplk/FfDL/blob/merge_20180514_1536/etc/examples/tf-distributed/launcher.py#L42 it basically marks the first n (by default 1) replicas as ps and the others as workers.

I can provide more in depth detail if you think the above does not make sense

@animeshsingh
Copy link

Thanks @atinsood

@cloustone in short, yes in the current implementation there is penalty to be paid for use of GPU for PS. Would love to schedule a call and discuss options with your team

@cloustone
Copy link
Author

Thanks @animeshsingh @atinsood
We understand the current status and wish a great PR or improvement will be available.

Our team want to push FfDL into production environment, however there are so many works to be done.We should keep in touch deeply and also wish get helps from FfDL team.
Anyway, it's a little difficult for our team to speak in English fluently, so Mail or WeChat will be a better way:)

My mail address: jenson.zuo@qq.com

@animeshsingh
Copy link

Thanks @cloustone
The next PR we have in pipeline for 0.1 release candidate #79 is much closer to what you would need. I would send a mail on your id and we should iterate.

@Tomcli
Copy link
Contributor

Tomcli commented Jul 9, 2018

Closed with #79

@Tomcli Tomcli closed this as completed Jul 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants