Skip to content
Permalink
Browse files

Update Gloo api for data layer (#1120)

* Added gloo as a submodule

Signed-off-by: Travis Addair <taddair@uber.com>

* Added cmake build for gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Added allreduce and broadcast ops for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Enable MPI

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed transport

Signed-off-by: Travis Addair <taddair@uber.com>

* Use MPI comm from Horovod

Signed-off-by: Travis Addair <taddair@uber.com>

* Changed gloo allreduce to always make use of fusion buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Copy directly to output buffer

Signed-off-by: Travis Addair <taddair@uber.com>

* Unique ptr to shared ptr

Signed-off-by: Travis Addair <taddair@uber.com>

* Fixed root pointer rank

Signed-off-by: Travis Addair <taddair@uber.com>

* Added float16 support for Gloo

Signed-off-by: Travis Addair <taddair@uber.com>

* Use allgatherv

Signed-off-by: Travis Addair <taddair@uber.com>

* Use GlooAllgather by default

Signed-off-by: Travis Addair <taddair@uber.com>

* Pulled down update to gloo

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update allgather allreduce and broadcast for unified gloo api

Signed-off-by: Sihan Zeng <zsh@uber.com>

* update setup.py & MANIFEST.in

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Add runtime flag to support switching betwee gloo and mpi

Signed-off-by: Sihan Zeng <zsh@uber.com>

* Resolve review

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix iface issue

Signed-off-by: Sihan Zeng <zsh@uber.com>

* set Gloo to be automatically compiled except on MacOS

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fix code style

Signed-off-by: Sihan Zeng <zsh@uber.com>

* integrate compile flag

Signed-off-by: Sihan Zeng <zsh@uber.com>

* fixed reviews

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove cmake from require list if system has cmake installed

Signed-off-by: Sihan Zeng <zsh@uber.com>

* cmake becomes a blocking issue, temporarily work it around by skip compiling gloo if cmake is not installed.

Signed-off-by: Sihan Zeng <zsh@uber.com>

* rebase on the latest master

Signed-off-by: Sihan Zeng <zsh@uber.com>

* remove chmod related code

Signed-off-by: Sihan Zeng <zsh@uber.com>

* final fix up

Signed-off-by: Sihan Zeng <zsh@uber.com>
  • Loading branch information...
zsh-thu authored and alsrgv committed Jun 18, 2019
1 parent 599d911 commit 31408cc1d0eec5614882d47c2eeba01954966f39
@@ -46,3 +46,6 @@
[submodule "third_party/flatbuffers"]
path = third_party/flatbuffers
url = https://github.com/google/flatbuffers.git
[submodule "third_party/gloo"]
path = third_party/gloo
url = https://github.com/facebookincubator/gloo.git
@@ -14,3 +14,8 @@ exclude third_party/eigen/Eigen/SparseLU
exclude third_party/eigen/Eigen/src/IterativeSolvers/*
exclude third_party/eigen/Eigen/src/OrderingMethods/Amd.h
exclude third_party/eigen/Eigen/src/SparseCholesky/*

# include cmake related files for submodule gloo
graft third_party/gloo/cmake
recursive-include third_party/gloo CMakeLists.txt
recursive-include third_party/gloo *.in
@@ -53,6 +53,12 @@ namespace common {
#define MLSL_ALLREDUCE "MLSL_ALLREDUCE"
#define MLSL_ALLGATHER "MLSL_ALLGATHER"
#define MLSL_BCAST "MLSL_BCAST"
#define GLOO_ALLREDUCE "GLOO_ALLREDUCE"
#define GLOO_ALLGATHER "GLOO_ALLGATHER"
#define GLOO_BCAST "GLOO_BCAST"

// String constant for gloo interface.
#define GLOO_DEFAULT_IFACE "eth0"

// Device ID used for CPU.
#define CPU_DEVICE_ID (-1)
@@ -143,6 +143,9 @@ struct HorovodGlobalState {
// Index of current CUDA stream to use
int current_nccl_stream = 0;

// A string indicating what framework we are using to perform CPU operations.
std::string cpu_operation;

~HorovodGlobalState() {
// Make sure that the destructor of the background thread is safe to
// call. If a thread is still joinable (not detached or complete) its
@@ -0,0 +1,43 @@
// Copyright 2019 Uber Technologies, Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// ============================================================================

#include "gloo_context.h"

#include "gloo/mpi/context.h"
#include "gloo/transport/tcp/device.h"

namespace horovod {
namespace common {

void GlooContext::InitializeFromMPI(const MPI_Comm& mpi_comm,
const char* gloo_iface) {
gloo::transport::tcp::attr attr;
// TODO(sihan): add interface load balancing after
// https://github.com/facebookincubator/gloo/issues/183 is resolved
attr.iface = gloo_iface;
attr.ai_family = AF_UNSPEC;
auto dev = gloo::transport::tcp::CreateDevice(attr);

auto context = std::make_shared<gloo::mpi::Context>(mpi_comm);
context->connectFullMesh(dev);
ctx = context;
}

void GlooContext::Finalize() {
ctx.reset();
}

} // namespace common
} // namespace horovod
@@ -0,0 +1,38 @@
// Copyright 2019 Uber Technologies, Inc. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// =============================================================================

#ifndef HOROVOD_GLOO_CONTEXT_H
#define HOROVOD_GLOO_CONTEXT_H

#include "gloo/context.h"
#include "mpi.h"

#include "common.h"

namespace horovod {
namespace common {

struct GlooContext {
void InitializeFromMPI(const MPI_Comm &mpi_comm, const char* gloo_iface);

void Finalize();

std::shared_ptr<gloo::Context> ctx;
};

} // namespace common
} // namespace horovod

#endif //HOROVOD_GLOO_CONTEXT_H

0 comments on commit 31408cc

Please sign in to comment.
You can’t perform that action at this time.