Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Gloo controller #1181

Merged
merged 8 commits into from Aug 10, 2019

Conversation

@zsh-thu
Copy link
Contributor

commented Jun 28, 2019

In our last pr, all MPI related operations have been decoupled from the control plane logic and abstracted as an MPI controller. This pr derives a new Gloo controller, which implemented all control plane logic including rendezvous independent of MPI environment. NCCL is supported for GPU data transferring.
The Gloo library already finished part of the rendezvous job so that we only have to provide a key-value store that is accessible by all participating processes. Considering the distributed environment, we decided to use an HTTP server that serves as the key/value store. Each process communicates with it through a designed protocol. This part of logic is wrapped in horovodrun, which automatically launches the HTTP server and performs rendezvous.
To enable Gloo controller at runtime, users can simply use a command like horovodrun --gloo ....

Preliminary Benchmark
Calculating graident on GPU and performing all-reduce directly on GPU using NCCL.
(Performance metric: number of processed images per second when training ResNet50 )

num of nodes num of GPU per node Speed per process (MPI) Speed total (MPI) Speed per process (Gloo) Speed total (Gloo)
1 1 189.4 +-1.0 189.4 +-1.0 189.4 +-1.0 189.4 +-1.0
1 4 154.4 +-3.7 617.6 +-14.8 154.4 +-4.3 617.8 +-17.2
2 4 143.7 +-7.7 1149.3 +-61.9 145.5 +-4.6 1163.9 +-37.0
4 4 145.1 +-7.8 2321.3 +-125.6 144.9 +-7.6 2319.5 +-108.1

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from 44a09fe to 824029f Jun 28, 2019

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from 824029f to 1686866 Jul 11, 2019

@zsh-thu zsh-thu changed the title Added Gloo controller [WIP]Added Gloo controller Jul 11, 2019

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch 4 times, most recently from e995d8b to 6b01831 Jul 12, 2019

for t in threads:
t.join()

have_alive_chile = True

This comment has been minimized.

Copy link
@abditag2

abditag2 Jul 16, 2019

Collaborator

have_alive_chile -> have_alive_child ?

@@ -454,25 +466,40 @@ def run():
'training script using the standard way provided by your'
' MPI distribution (usually mpirun, srun, or jsrun).')

# Pass all the env variables to the mpirun command.

This comment has been minimized.

Copy link
@abditag2

abditag2 Jul 16, 2019

Collaborator

Can we also pack all the MPI run code (essentially rest of this file) into a mpi_run file, similar to gloo_run?

# ignore strings with any none word chars (e.g. '<')
re.compile(r'[^\w/.:]'),
# Do not overwrite instance id on each instance
re.compile('MICHELANGELO_INSTANCE_ID'),

This comment has been minimized.

Copy link
@abditag2

abditag2 Jul 16, 2019

Collaborator

Instead of hardcoding these variables, can we put them in a config file? That will allow users outside of Uber to customize them for their environment.

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from 6b01831 to 93efbd0 Jul 16, 2019

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch 17 times, most recently from 309353b to ce78097 Jul 17, 2019

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch 2 times, most recently from 0058048 to ea6c88a Jul 25, 2019

zsh-thu added some commits Jul 30, 2019

fix2
Signed-off-by: Sihan Zeng <zsh@uber.com>
add unit tests
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch 3 times, most recently from d73839d to 5f3941e Aug 7, 2019

fix review for python part
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from 5f3941e to 2213083 Aug 8, 2019

horovod/run/common/util/env.py Outdated Show resolved Hide resolved

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from aa6f050 to df3d769 Aug 8, 2019

refractor HTTP rendezvous
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from df3d769 to feb8819 Aug 8, 2019

fix review 3
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from 123cb6b to 568af0b Aug 9, 2019

horovod/run/run.py Outdated Show resolved Hide resolved
horovod/run/gloo_run.py Outdated Show resolved Hide resolved
fix nit
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo_control branch from d45f583 to 3b693f3 Aug 9, 2019

@tgaddair
Copy link
Collaborator

left a comment

LGTM!

@tgaddair tgaddair merged commit 669eb9e into horovod:master Aug 10, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #913 passed (1 hour, 25 minutes, 20 seconds)
Details
@alsrgv

This comment has been minimized.

Copy link
Member

commented Aug 10, 2019

🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉 🎉

@@ -28,7 +28,9 @@ class MPIController : public Controller {
Timeline& timeline, ParameterManager& parameter_manager,
MPIContext& mpi_ctx)
: Controller(response_cache, tensor_queue, timeline, parameter_manager),
mpi_ctx_(mpi_ctx) {}
mpi_ctx_(mpi_ctx) {
LOG(INFO) << "MPI Controller Initialized.";

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 10, 2019

Member

@tgaddair, I think most of these INFOs should be DEBUGs.

host_name = alloc_info.hostname

env = os.environ.copy()
local_command = '{horovod_env} {env} {run_command}' .format(

This comment has been minimized.

Copy link
@alsrgv

alsrgv Aug 11, 2019

Member

@tgaddair @zsh-thu, did we verify that this uses the current working directory of the user calling horovodrun? E.g. if you call horovodrun from /path/to/my/script, would it cd /path/to/my/script before running the python code on other machines? This is necessary to have parity with MPI behavior.

This comment has been minimized.

Copy link
@tgaddair

tgaddair Aug 12, 2019

Collaborator

Good question, I'll double check this tomorrow.

This comment has been minimized.

Copy link
@zsh-thu

zsh-thu Aug 12, 2019

Author Contributor

No, that's not supported. Maybe we can get the absolute path from $PWD and add a cd $PWD command before running the actual command.

This comment has been minimized.

Copy link
@tgaddair
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
4 participants
You can’t perform that action at this time.