Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Gloo api for data layer #1120

Merged
merged 28 commits into from Jun 18, 2019

Conversation

3 participants
@zsh-thu
Copy link
Contributor

commented Jun 3, 2019

Updated the AllGather(), AllReduce(), and Broadcast() operations in the data layer using unified gloo apis.

Benchmark with MPI (Matric: processing image per second based on ResNet50)

Machine Num Process per Machine MPI per Process MPI Total Utilization GLOO Per Process GLOO Total Utilization%
1 1 2.8 +-0.1 2.8 +-0.1 100% 3.0 +-0.1 3.0 +-0.1 100%
2 1 2.2 +-0.0 4.4 +-0.1 78% 2.2 +-0.0 4.4 +-0.1 78%
4 1 1.5 +-0.1 5.9 +-0.3 52% 1.7 +-0.0 6.8 +-0.1 61%
8 1 1.3 +-0.0 10.5 +-0.2 46% 1.3 +-0.0 10.6 +-0.1 47%
16 1 1.3 +-0.0 20.9 +-0.4 47% 1.3 +-0.0 21.4 +-0.3 48%
8 2 1.3 +-0.0 21.0 +-0.4 47% 1.3 +-0.0 21.5 +-0.4 48%

Above results are executing entirely on CPU. For better demonstration, we calculate the gradient on GPU and do allreduce on CPU. The results are as follow:

Machine Num Process per Machine MPI per Process MPI Total Utilization GLOO Per Process GLOO Total Utilization%
1 1 211.3 +-0.5 211.3 +-0.5 100% 211.5 +-0.7 211.5 +-0.7 100%
1 2 164.1 +-4.6 328.2 +-9.3 77% 116.5 +-5.2 233.1 +-10.3 55%
1 4 130.9 +-4.0 523.5 +-16.0 61% 114.4 +-5.6 457.8 +-22.6 54%
2 1 135.5 +-7.4 271.0 +-14.8 64% 103.0 +-7.3 206.1 +-14.7 48%
2 2 94.9 +-14.4 379.8 +-57.7 45% 105.4 +-3.3 421.8 +-13.3 50%
4 2 77.5 +-13.2 620.2 +-105.6 36% 99.1 +-8.8 792.6 +-70.3 47%

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch from 49ff244 to 2a58d0a Jun 3, 2019

@alsrgv

This comment has been minimized.

Copy link
Member

commented Jun 3, 2019

I think you want to rebase and force-push this, since it has a lot of changes that are already in master.

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch 9 times, most recently from 7bb3c82 to 4b479d9 Jun 3, 2019

@zsh-thu zsh-thu changed the base branch from gloo to master Jun 4, 2019

@tgaddair tgaddair marked this pull request as ready for review Jun 4, 2019

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch from c9daeb4 to 76dd703 Jun 4, 2019

@zsh-thu zsh-thu closed this Jun 4, 2019

@zsh-thu zsh-thu reopened this Jun 4, 2019

@zsh-thu

This comment has been minimized.

Copy link
Contributor Author

commented Jun 4, 2019

I think you want to rebase and force-push this, since it has a lot of changes that are already in master.

Already rebased on June 3rd's commit.

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch 3 times, most recently from c0c3f1a to 92154d5 Jun 4, 2019

@tgaddair tgaddair requested a review from alsrgv Jun 6, 2019

Show resolved Hide resolved MANIFEST.in Outdated
@tgaddair
Copy link
Collaborator

left a comment

Nice work! This looks great. Just a few comments on variable naming and organization.


void GlooContext::InitializeFromMPI(const MPI_Comm& mpi_comm) {
gloo::transport::tcp::attr attr;
attr.iface = "eth0";

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jun 6, 2019

Collaborator

We should find a way to infer this from MPI. @alsrgv do you know if MPI has a function for determining the network interface it's using?

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jun 7, 2019

Collaborator

After our discussion this mornong: This should be a private member variable of the GlooContext. It should be set in operations.cc from an environment variable like HOROVOD_GLOO_IFACE. That environment variable will be set by horovodrun after it figures out which interfaces are routable. Alternatively, the user can set it themselves if they aren't using horovodrun.

Show resolved Hide resolved horovod/common/operations.cc
Show resolved Hide resolved horovod/common/operations.cc Outdated
@@ -1109,6 +1132,10 @@ void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
state.message_table = std::unique_ptr<MessageTable>(new MessageTable());
}

#if COMPILE_WITH_GLOO
gloo_context.InitializeFromMPI(ctx.mpi_comm);

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jun 6, 2019

Collaborator

We should also check if Gloo is enabled here to avoid creating Gloo communicator we aren't going to use.

One thing we can do is add a field to the GlooContext called enabled_cpu_operations. We can set this once during initialization at the same place in BackgroundThreadLoop we check the other environment variables.

Then for initialization and Finalize() below, we just need to check against gloo_context.enabled_cpu_operations.

This comment has been minimized.

Copy link
@zsh-thu

zsh-thu Jun 6, 2019

Author Contributor

I have created two flags indicating whether gloo will be used for the data layer and the control layer. And the initialization and finalization functions will check these flags to determine whether we should actually initialize or finalize.

This part of code is in gloo_context.cc && gloo_context.h.

@@ -1149,6 +1176,10 @@ void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
cb(SHUT_DOWN_ERROR);
}

#if COMPILE_WITH_GLOO

This comment has been minimized.

Copy link
@tgaddair

tgaddair Jun 6, 2019

Collaborator

See comment above. We should check against the environment variable that tells us if Gloo is being used at runtime here too.

This comment has been minimized.

Copy link
@zsh-thu

zsh-thu Jun 6, 2019

Author Contributor

See above.

Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/ops/gloo_operations.cc Outdated
Show resolved Hide resolved horovod/common/ops/gloo_operations.cc Outdated

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch 6 times, most recently from af513bd to df0db5f Jun 6, 2019

@alsrgv
Copy link
Member

left a comment

Thanks @zsh-thu!

Left a few comments, and have a question about the benchmark: is it done on CPU servers? The values look too small. Does it mean that with batch size 64 you'd only allreduce every 20 seconds? If that's the case, the variability is likely not due to the allreduce algorithm, but due to variability of execution on CPUs (due to thread starvation or noisy neighbors).

I'd suggest running benchmarks on GPU (but keeping allreduce on CPU).

Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/gloo_context.cc
Show resolved Hide resolved horovod/common/gloo_context.h Outdated
Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated

tgaddair and others added some commits Mar 11, 2019

Added float16 support for Gloo
Signed-off-by: Travis Addair <taddair@uber.com>
Use allgatherv
Signed-off-by: Travis Addair <taddair@uber.com>
Use GlooAllgather by default
Signed-off-by: Travis Addair <taddair@uber.com>
Pulled down update to gloo
Signed-off-by: Sihan Zeng <zsh@uber.com>
update allgather allreduce and broadcast for unified gloo api
Signed-off-by: Sihan Zeng <zsh@uber.com>
update setup.py & MANIFEST.in
Signed-off-by: Sihan Zeng <zsh@uber.com>
Add runtime flag to support switching betwee gloo and mpi
Signed-off-by: Sihan Zeng <zsh@uber.com>
Resolve review
Signed-off-by: Sihan Zeng <zsh@uber.com>
fix iface issue
Signed-off-by: Sihan Zeng <zsh@uber.com>
set Gloo to be automatically compiled except on MacOS
Signed-off-by: Sihan Zeng <zsh@uber.com>
fix code style
Signed-off-by: Sihan Zeng <zsh@uber.com>
integrate compile flag
Signed-off-by: Sihan Zeng <zsh@uber.com>
fixed reviews
Signed-off-by: Sihan Zeng <zsh@uber.com>
remove cmake from require list if system has cmake installed
Signed-off-by: Sihan Zeng <zsh@uber.com>
cmake becomes a blocking issue, temporarily work it around by skip co…
…mpiling gloo if cmake is not installed.

Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch 2 times, most recently from 63b6cd4 to f7e6d4f Jun 15, 2019

rebase on the latest master
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch from f7e6d4f to 7b311e2 Jun 15, 2019

Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved setup.py
Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved setup.py Outdated
Show resolved Hide resolved setup.py Outdated

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch from 5b1b5d3 to a21d803 Jun 16, 2019

remove chmod related code
Signed-off-by: Sihan Zeng <zsh@uber.com>

@zsh-thu zsh-thu force-pushed the zsh-thu:gloo branch from a21d803 to b58cb7c Jun 16, 2019

setup.py Outdated
@@ -956,11 +1020,68 @@ def build_torch_extension_v2(build_ext, options, torch_version):
build_ext.build_extension(torch_mpi_lib_v2)


def build_cmake(build_ext, ext, output_dir, options):

This comment has been minimized.

Copy link
@alsrgv

alsrgv Jun 17, 2019

Member

nit: Remove empty line

@alsrgv

alsrgv approved these changes Jun 17, 2019

Copy link
Member

left a comment

LGTM

final fix up
Signed-off-by: Sihan Zeng <zsh@uber.com>

@alsrgv alsrgv merged commit 31408cc into horovod:master Jun 18, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #528 passed (51 minutes, 34 seconds)
Details

sblotner added a commit to sblotner/horovod that referenced this pull request Jun 19, 2019

Update Gloo api for data layer (horovod#1120)
* Added gloo as a submodule

Signed-off-by: Travis Addair <taddair@uber.com>

* Added cmake build for gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Added allreduce and broadcast ops for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Enable MPI

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed transport

Signed-off-by: Travis Addair <taddair@uber.com>

* Use MPI comm from Horovod

Signed-off-by: Travis Addair <taddair@uber.com>

* Changed gloo allreduce to always make use of fusion buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Copy directly to output buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Unique ptr to shared ptr

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed root pointer rank

Signed-off-by: Travis Addair <taddair@uber.com>

* Added float16 support for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Use allgatherv

Signed-off-by: Travis Addair <taddair@uber.com>

* Use GlooAllgather by default

Signed-off-by: Travis Addair <taddair@uber.com>

* Pulled down update to gloo

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update allgather allreduce and broadcast for unified gloo api

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update setup.py & MANIFEST.in

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Add runtime flag to support switching betwee gloo and mpi

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Resolve review

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix iface issue

Signed-off-by: Sihan Zeng <zsh@uber.com>

* set Gloo to be automatically compiled except on MacOS

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix code style

Signed-off-by: Sihan Zeng <zsh@uber.com>

* integrate compile flag

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fixed reviews

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove cmake from require list if system has cmake installed

Signed-off-by: Sihan Zeng <zsh@uber.com>

* cmake becomes a blocking issue, temporarily work it around by skip compiling gloo if cmake is not installed.

Signed-off-by: Sihan Zeng <zsh@uber.com>

* rebase on the latest master

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove chmod related code

Signed-off-by: Sihan Zeng <zsh@uber.com>

* final fix up

Signed-off-by: Sihan Zeng <zsh@uber.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

alsrgv added a commit that referenced this pull request Jun 20, 2019

Restructure Horovod doc landing page (#1158)
* Edit Horovod docs (#1119)

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Fix tf-nightly-gpu break (#1124)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Fix minor spacing issue in GPU and summary docs

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Update Horovod doc site font, add landing page accordion

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Reorder and clean up titles in left navigation

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Add basic pages for keras, mxnet, pytorch, and tensorflow

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Fix mpirun 4 GPU example

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Minor updates to README

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Make Open MPI an advanced topic

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Fix REAMDE TF link

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Resolve conflict

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Fix minor spacing issue in GPU and summary docs (#1127)

Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Updated horovodrun command (#1126)

Signed-off-by: Carsten Jacobsen <carstenj@uber.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Ask GCC version when filling out the issue (#1133)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Pin tf-nightly to 1.14.1.dev20190606 & remove torchvision-nightly (#1137)

* Pin tf-nightly to 1.14.1.dev20190606

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* torchvision-nightly -> torchvision

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Pin torchvision to a version that does not require CUDA

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Replace .step(synchronize=False) with optimizer.skip_synchronize() (#1132)

* Replace .step(synchronize=False) with optimizer.already_synchronized()

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Fix docs

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Rename to skip_synchronize() and fix test

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Unpin tf-nightly version (#1140)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Bump version to 0.16.4 (#1139)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* remove MSHADOW_USE_F16C (#1141)

Signed-off-by: Lin Yuan <apeforest@gmail.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Adding support for multiple CUDA streams for NCCL operations. (#1128)

* Adding support for multiple CUDA streams for NCCL operations.

Signed-off-by: Josh Romero <joshr@nvidia.com>

* Fix compilation without CUDA or NCCL enabled.

Signed-off-by: Josh Romero <joshr@nvidia.com>

* Updating variable names.

Signed-off-by: Josh Romero <joshr@nvidia.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Add Singularity example page (#1149)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Update Gloo api for data layer (#1120)

* Added gloo as a submodule

Signed-off-by: Travis Addair <taddair@uber.com>

* Added cmake build for gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Added allreduce and broadcast ops for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Enable MPI

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed transport

Signed-off-by: Travis Addair <taddair@uber.com>

* Use MPI comm from Horovod

Signed-off-by: Travis Addair <taddair@uber.com>

* Changed gloo allreduce to always make use of fusion buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Copy directly to output buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Unique ptr to shared ptr

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed root pointer rank

Signed-off-by: Travis Addair <taddair@uber.com>

* Added float16 support for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Use allgatherv

Signed-off-by: Travis Addair <taddair@uber.com>

* Use GlooAllgather by default

Signed-off-by: Travis Addair <taddair@uber.com>

* Pulled down update to gloo

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update allgather allreduce and broadcast for unified gloo api

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update setup.py & MANIFEST.in

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Add runtime flag to support switching betwee gloo and mpi

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Resolve review

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix iface issue

Signed-off-by: Sihan Zeng <zsh@uber.com>

* set Gloo to be automatically compiled except on MacOS

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix code style

Signed-off-by: Sihan Zeng <zsh@uber.com>

* integrate compile flag

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fixed reviews

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove cmake from require list if system has cmake installed

Signed-off-by: Sihan Zeng <zsh@uber.com>

* cmake becomes a blocking issue, temporarily work it around by skip compiling gloo if cmake is not installed.

Signed-off-by: Sihan Zeng <zsh@uber.com>

* rebase on the latest master

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove chmod related code

Signed-off-by: Sihan Zeng <zsh@uber.com>

* final fix up

Signed-off-by: Sihan Zeng <zsh@uber.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* MLSL: move mlsl_init before mpi_init, add mlsl_finalize call (#1156)

Signed-off-by: Mikhail Shiryaev <mikhail.shiryaev@intel.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>

* Update Horovod GitHub URL (#1147)

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Signed-off-by: Stephanie Blotner <sblotner@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.