Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation #10

Merged
merged 3 commits into from Aug 21, 2017
Merged

Update documentation #10

merged 3 commits into from Aug 21, 2017

Conversation

alsrgv
Copy link
Member

@alsrgv alsrgv commented Aug 21, 2017

  1. Add motivation in the beginning.
  2. Clarify how processes are assigned GPUs with visible_devices_list.
  3. Add quick guide to install Open MPI. Add missing mpicxx in PATH to troubleshooting.
  4. Add -x LD_LIBRARY_PATH and other useful env vars to multi-node mpirun example
  5. Add quick guide to install NCCL, link to nv_peer_mem for GPUDirect, /etc/init.d/nv_peer_mem start. Add missing NCCL error to troubleshooting.
  6. Add Travis CI & license links.
  7. Open MPI & IB should use -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32, greatly improves performance.

1. Add motivation in the beginning.
2. Clarify how processes are assigned GPUs with visible_devices_list.
3. Add quick guide to install Open MPI. Add missing mpicxx in PATH to troubleshooting.
4. Add -x LD_LIBRARY_PATH and other useful env vars to multi-node mpirun example
5. Add quick guide to install NCCL, link to nv_peer_mem for GPUDirect, /etc/init.d/nv_peer_mem start. Add missing NCCL error to troubleshooting.
6. Add Travis CI & license links.
7. Open MPI & IB should use -mca btl_openib_receive_queues P,128,32:P,2048,32:P,12288,32:P,131072,32, greatly improves performance.
@alsrgv alsrgv self-assigned this Aug 21, 2017
README.md Outdated
Horovod is a distributed training framework for TensorFlow. The goal of Horovod is to make distributed Deep Learning
fast and easy to use.

# Why not traditional Distributed TensorFlow?

The primary motivation for this project is to make it easy to take single GPU TensorFlow program and successfully train
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a single GPU

README.md Outdated
The primary motivation for this project is to make it easy to take single GPU TensorFlow program and successfully train
it on many GPUs faster. This has two aspects:

1. How much modifications does one have to make to program to make it distributed, and how easy is it to run it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to a program

README.md Outdated
1. How much modifications does one have to make to program to make it distributed, and how easy is it to run it.
2. How much faster would it run in distributed mode?

Internally at Uber we found that it's much easier for people to understand MPI model that requires minimal changes to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a MPI model

README.md Outdated
To give some perspective on that, [this commit](https://github.com/alsrgv/benchmarks/commit/86bf2f9269dbefb4e57a8b66ed260c8fab84d6c7)
into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers and manually
averaging gradients across them, `tf.Server()`, `tf.ClusterSpec()`, `tf.train.SyncReplicasOptimizer()`,
`tf.train.replicas_device_setter()` and etc. If none of this things makes sense to you - don't worry, you don't have to
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace "etc." with "so on".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, "If none of these things"

README.md Outdated
learn them if you use Horovod.

While installing MPI itself may seem like an extra hassle, it only needs to be done once and by one group of people,
while everyone else in the company who are building the models can enjoy simplicity of training them at scale.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"who are building" ==> "who builds"

README.md Outdated
@@ -53,7 +91,8 @@ To use Horovod, make the following additions to your program:
1. Run `hvd.init()`.

2. Pin a server GPU to be used by this process using `config.gpu_options.visible_device_list`.
With the typical setup of one GPU per process, this can be set to *local rank*.
With the typical setup of one GPU per process, this can be set to *local rank*. In that case, first process on the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first process

README.md Outdated
@@ -53,7 +91,8 @@ To use Horovod, make the following additions to your program:
1. Run `hvd.init()`.

2. Pin a server GPU to be used by this process using `config.gpu_options.visible_device_list`.
With the typical setup of one GPU per process, this can be set to *local rank*.
With the typical setup of one GPU per process, this can be set to *local rank*. In that case, first process on the
server will be allocated first GPU, second process will be allocated second GPU and so forth.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocated the first GPU, the second process will be allocated the second GPU , and so forth.

README.md Outdated

1. Is MPI in PATH?

If you see error message below, it means `mpicxx` was not found in PATH. Typically `mpicxx` is located in the same
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the error

README.md Outdated
1. Is MPI in PATH?

If you see error message below, it means `mpicxx` was not found in PATH. Typically `mpicxx` is located in the same
directory as `mpirun`. Please add directory containing `mpicxx` to PATH before installing Horovod.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a directory

README.md Outdated

### NCCL 2 is not found

If you see error message below, it means NCCL 2 was not found in standard libraries location. If you have directory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants