New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update documentation #10
Conversation
alsrgv
commented
Aug 21, 2017
- Add motivation in the beginning.
- Clarify how processes are assigned GPUs with visible_devices_list.
- Add quick guide to install Open MPI. Add missing mpicxx in PATH to troubleshooting.
- Add -x LD_LIBRARY_PATH and other useful env vars to multi-node mpirun example
- Add quick guide to install NCCL, link to nv_peer_mem for GPUDirect, /etc/init.d/nv_peer_mem start. Add missing NCCL error to troubleshooting.
- Add Travis CI & license links.
- Open MPI & IB should use -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32, greatly improves performance.
1. Add motivation in the beginning. 2. Clarify how processes are assigned GPUs with visible_devices_list. 3. Add quick guide to install Open MPI. Add missing mpicxx in PATH to troubleshooting. 4. Add -x LD_LIBRARY_PATH and other useful env vars to multi-node mpirun example 5. Add quick guide to install NCCL, link to nv_peer_mem for GPUDirect, /etc/init.d/nv_peer_mem start. Add missing NCCL error to troubleshooting. 6. Add Travis CI & license links. 7. Open MPI & IB should use -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32, greatly improves performance.
README.md
Outdated
Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning | ||
fast and easy to use. | ||
|
||
# Why not traditional Distributed TensorFlow? | ||
|
||
The primary motivation for this project is to make it easy to take single GPU TensorFlow program and successfully train |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take a single GPU
README.md
Outdated
The primary motivation for this project is to make it easy to take single GPU TensorFlow program and successfully train | ||
it on many GPUs faster. This has two aspects: | ||
|
||
1. How much modifications does one have to make to program to make it distributed, and how easy is it to run it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to a program
README.md
Outdated
1. How much modifications does one have to make to program to make it distributed, and how easy is it to run it. | ||
2. How much faster would it run in distributed mode? | ||
|
||
Internally at Uber we found that it's much easier for people to understand MPI model that requires minimal changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a MPI model
README.md
Outdated
To give some perspective on that, [this commit](https://github.com/alsrgv/benchmarks/commit/86bf2f9269dbefb4e57a8b66ed260c8fab84d6c7) | ||
into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually | ||
averaging gradients across them, `tf.Server()`, `tf.ClusterSpec()`, `tf.train.SyncReplicasOptimizer()`, | ||
`tf.train.replicas_device_setter()` and etc. If none of this things makes sense to you - don't worry, you don't have to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace "etc." with "so on".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, "If none of these things"
README.md
Outdated
learn them if you use Horovod. | ||
|
||
While installing MPI itself may seem like an extra hassle, it only needs to be done once and by one group of people, | ||
while everyone else in the company who are building the models can enjoy simplicity of training them at scale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"who are building" ==> "who builds"
README.md
Outdated
@@ -53,7 +91,8 @@ To use Horovod, make the following additions to your program: | |||
1. Run `hvd.init()`. | |||
|
|||
2. Pin a server GPU to be used by this process using `config.gpu_options.visible_device_list`. | |||
With the typical setup of one GPU per process, this can be set to *local rank*. | |||
With the typical setup of one GPU per process, this can be set to *local rank*. In that case, first process on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first process
README.md
Outdated
@@ -53,7 +91,8 @@ To use Horovod, make the following additions to your program: | |||
1. Run `hvd.init()`. | |||
|
|||
2. Pin a server GPU to be used by this process using `config.gpu_options.visible_device_list`. | |||
With the typical setup of one GPU per process, this can be set to *local rank*. | |||
With the typical setup of one GPU per process, this can be set to *local rank*. In that case, first process on the | |||
server will be allocated first GPU, second process will be allocated second GPU and so forth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocated the first GPU, the second process will be allocated the second GPU , and so forth.
README.md
Outdated
|
||
1. Is MPI in PATH? | ||
|
||
If you see error message below, it means `mpicxx` was not found in PATH. Typically `mpicxx` is located in the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the error
README.md
Outdated
1. Is MPI in PATH? | ||
|
||
If you see error message below, it means `mpicxx` was not found in PATH. Typically `mpicxx` is located in the same | ||
directory as `mpirun`. Please add directory containing `mpicxx` to PATH before installing Horovod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a directory
README.md
Outdated
|
||
### NCCL 2 is not found | ||
|
||
If you see error message below, it means NCCL 2 was not found in standard libraries location. If you have directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have a directory