Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLSL operation implementation #908

Merged
merged 1 commit into from May 30, 2019

Conversation

5 participants
@shirosankaku
Copy link
Contributor

commented Mar 13, 2019

This PR adds integration with Intel(R) MLSL library on operation level.

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch 2 times, most recently from 8324054 to eadb1c4 Mar 13, 2019

@alsrgv alsrgv requested review from tgaddair, alsrgv and abditag2 Mar 15, 2019

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from eadb1c4 to 351b7ec Mar 22, 2019

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

@tgaddair, could you review for alignment with ops.cc refactor?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Mar 28, 2019

@shirosankaku, could you rebase? We've recently fixed CI, would be great to see a green build.

@alsrgv
Copy link
Collaborator

left a comment

Apologies for the delay. Added first pass of comments. Could you add a test to .travis.yaml that would reinstall Horovod with MLSL support and re-run some of the tests? How long does MLSL build/installation take?

Show resolved Hide resolved horovod/common/common.h Outdated
Show resolved Hide resolved horovod/common/common.h Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
Show resolved Hide resolved horovod/common/operations.cc Outdated
@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 3, 2019

@AlekseyMarchuk, thanks for the updates! Could you sign DCO, and add MLSL tests to .travis.yml?

@AlekseyMarchuk AlekseyMarchuk force-pushed the SmorkalovME:mlsl_op branch 4 times, most recently from 8b568b9 to f40a59d Apr 3, 2019

@AlekseyMarchuk

This comment has been minimized.

Copy link

commented Apr 8, 2019

@alsrgv I've added MLSL tests to new CI. But build fails with strange error /bin/bash: pip: command not found while pip was used earlier. Could you please take a look at the script?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 8, 2019

@AlekseyMarchuk, not sure - is it possible that MLSL installation overwrites the contents of /usr/local? Can you try manually running commands from Dockerfile.test.cpu until you discover the issue?

@AlekseyMarchuk AlekseyMarchuk force-pushed the SmorkalovME:mlsl_op branch from bf6af5d to ed65651 Apr 9, 2019

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2019

@AlekseyMarchuk, can you rebase on latest master? We've made a fix to the build break you're seeing.

@AlekseyMarchuk

This comment has been minimized.

Copy link

commented Apr 9, 2019

@alsrgv sure, one moment. You were right - mlsl installation script overwrites destination directory. I completely forget about this fact

@AlekseyMarchuk

This comment has been minimized.

Copy link

commented Apr 9, 2019

mlsl build now fails with

ERROR: denied: The repository with name 'buildkite' in registry with id '823773083436' already has the maximum allowed number of images which is '1000'

Exited with 1

it seems to be CI issue

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 9, 2019

@AlekseyMarchuk, sorry about that - fixed the CI.

@AlekseyMarchuk AlekseyMarchuk force-pushed the SmorkalovME:mlsl_op branch 4 times, most recently from 2ea5d39 to 2f917bb Apr 9, 2019

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 10, 2019

@AlekseyMarchuk, we made another CI fix yesterday, could you rebase and push?

@AlekseyMarchuk

This comment has been minimized.

Copy link

commented Apr 11, 2019

@alsrgv I did rebase but several CI tests failed with the following call stack:

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
--
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present
  | Fatal error in PMPI_Init_thread: Other MPI error, error stack:
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present

Have you met this issue before?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented Apr 11, 2019

@AlekseyMarchuk, no - we don't use MPICH internally, and I don't have experience beyond other unit tests that run successfully. Does MLSL work with Open MPI?

@AlekseyMarchuk AlekseyMarchuk force-pushed the SmorkalovME:mlsl_op branch 2 times, most recently from 21f765f to 5e946b5 Apr 15, 2019

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from 1135cad to d49a3de May 17, 2019

@shirosankaku

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

@alsrgv We are sorry for the delay. The last fixes were implemented.

@@ -854,7 +870,63 @@ void CoordinateCacheAndState(CacheCoordinator& cache_coordinator,
// otherwise we may end up dispatching many blocked threads and never make
// progress if we have a thread pool limit.
bool RunLoopOnce(HorovodGlobalState& state, MPIContext& ctx, bool is_coordinator);

#if HAVE_MLSL

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 20, 2019

Collaborator

Can we move this into MLSLContext? One of our goals with operations.cc is to minimize the amount of "framework specific" code in here. Ideally, we want to isolate framework code within specific places like mlsl_context.h or mlsl_operations.h.

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 21, 2019

Author Contributor

Moved to MLSLContext

void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
#if HAVE_MLSL

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 20, 2019

Collaborator

I feel that this adds a lot of complication to the BackgroundThreadLoop. I'm wondering if some of this can be pulled out into MLSLContext in some way? Background thread initialization is something that will ultimately need to be abstracted between MLSL, MPI, Gloo, etc.

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 21, 2019

Author Contributor

Moved init and finalize logic to MLSLContext

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch 2 times, most recently from 958b2cc to c6377c3 May 21, 2019

@@ -1581,6 +1603,14 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
while (!horovod_global.initialization_done) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

#if HAVE_MLSL
LOG(DEBUG) << "BG-thread init done";

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 21, 2019

Collaborator

Could this be useful information for non-MLSL scenarios as well? Looks generic enough and could potentially provide insight into certain types of failures. Thoughts, @alsrgv?

This comment has been minimized.

Copy link
@alsrgv

alsrgv May 21, 2019

Collaborator

Yeah, I'm fine to make remove #if around this log message.

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 22, 2019

Author Contributor

Done

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from c6377c3 to 4b03200 May 22, 2019

@shirosankaku

This comment has been minimized.

Copy link
Contributor Author

commented May 23, 2019

@alsrgv Hello! The last build with our changes was successful, but this time it shows that PyTorch related tests have failed due to errors like:
ImportError: /usr/local/lib/python3.5/dist-packages/torchvision/_C.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail36_typeMetaDataInstance_preallocated_7E

Could you, please, have a look?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented May 26, 2019

@shirosankaku, could you rebase on latest master? It fixes recent torchvision compatibility issues.

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from bc87ec9 to 7895304 May 27, 2019

@shirosankaku

This comment has been minimized.

Copy link
Contributor Author

commented May 27, 2019

@alsrgv Done.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented May 27, 2019

Thanks! @tgaddair, could you do another pass?

@@ -75,7 +77,11 @@ def test_horovod_allreduce_cpu(self):
"""Test on CPU that the allreduce correctly sums 1D, 2D, 3D tensors."""
hvd.init()
size = hvd.size()
dtypes = [tf.int32, tf.int64, tf.float16, tf.float32, tf.float64]
# MLSL supports only byte, float and double data types

This comment has been minimized.

Copy link
@abditag2

abditag2 May 27, 2019

Collaborator

Do we need some sort of a check in runtime to throw an exception if the user requests unsupported compression with MLSL?

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 28, 2019

Author Contributor

actually, we have that kind of check. Only Allreduce needs it
The check is in mlsl_operations.cc file in GetMLSLDataType function that is called when MLSL_Allreduce is executed. The reason for these checks in test_*.py files is not to let testing fail because of a bunch of types that are not supported by MLSL_Allreduce operation (such sort of exception is not a planned one and test will fail).
Or would you prefer seeing this check in operation.cc in ConstructResponse function?

@@ -1587,6 +1609,13 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
while (!horovod_global.initialization_done) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

LOG(DEBUG) << "Background thread init done";
#if HAVE_MLSL

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 27, 2019

Collaborator

Can you leave a comment explaining what this signaling in the context of MLSL, and why we only log it when using MLSL?

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 28, 2019

Author Contributor

I guess, it's better to be removed.

setup.py Outdated
@@ -35,6 +35,11 @@
torch_mpi_lib_v2 = Extension('horovod.torch.mpi_lib_v2', [])
mxnet_mpi_lib = Extension('horovod.mxnet.mpi_lib', [])

have_mlsl = False
mlsl_root = os.environ.get('MLSL_ROOT')
if mlsl_root is not None:

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 27, 2019

Collaborator

Nit: SImplify to have_mlsl = mlsl_root is not None

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 28, 2019

Author Contributor

Thank you. Done

@@ -44,7 +44,10 @@ def test_timeline(self):
hvd.init()

# Perform a simple allreduce operation
hvd.allreduce(torch.tensor([1, 2, 3]), name='test_allreduce')
if 'MLSL_ROOT' in os.environ:

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 27, 2019

Collaborator

Can we just remove the else statement and always test against float32? That way we don't need to add additional coupling with MLSL here.

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 28, 2019

Author Contributor

Done

if _fp16_supported:
dtypes += [torch.HalfTensor]
# MLSL supports only byte, float and double data types
if 'MLSL_ROOT' in os.environ:

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 27, 2019

Collaborator

Looks like almost every test has been updated with this check. I think we can simplify this and make it more extensible to other frameworks with data type restrictions by having a single utility function that filters the list of dtypes down to only the ones supported by the current framework.

For example:

MLSL_TYPES = set([torch.FloatTensor, torch.DoubleTensor])

def filter_supported_types(dtypes):
    if 'MLSL_ROOT' in os.environ:
        dtypes = [dtype for dtype in dtypes if dtype in MLSL_TYPES]
    return dtypes

...
dtypes = filter_supported_types([torch.IntTensor, torch.LongTensor,
                                                         torch.FloatTensor, torch.DoubleTensor])

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 28, 2019

Author Contributor

Thank you. Done

@@ -44,8 +44,11 @@ def test_horovod_allreduce(self):
"""Test that the allreduce correctly sums 1D, 2D, 3D tensors."""
hvd.init()
size = hvd.size()
dtypes = ['int32', 'int64',
'float32', 'float64']
if 'MLSL_ROOT' in os.environ:

This comment has been minimized.

Copy link
@tgaddair

tgaddair May 28, 2019

Collaborator

Please use the same filter_supported_types pattern here for consistency.

This comment has been minimized.

Copy link
@shirosankaku

shirosankaku May 29, 2019

Author Contributor

Done

@tgaddair
Copy link
Collaborator

left a comment

LGTM! Nice work.

@alsrgv

alsrgv approved these changes May 29, 2019

Copy link
Collaborator

left a comment

LGTM, one minor comment.

.travis.yml Outdated
@@ -0,0 +1,180 @@
dist: trusty

This comment has been minimized.

Copy link
@alsrgv

alsrgv May 29, 2019

Collaborator

Can you remove this file? We have removed Travis CI integration in favor of Buildkite.

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented May 30, 2019

Also, it looks like most recent commits caused merge conflict with test_torch, could you resolve it?

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented May 30, 2019

@shirosankaku, can you rebase on master instead of merging? Now PR contains a lot of recent changes (e.g. whole .md -> .rst migration) which makes it hard to review.

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from 8e54a52 to 10c7329 May 30, 2019

MLSL integration
Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>

@shirosankaku shirosankaku force-pushed the SmorkalovME:mlsl_op branch from 10c7329 to 196f338 May 30, 2019

@shirosankaku

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

We've merged all the commits into the one

@alsrgv

This comment has been minimized.

Copy link
Collaborator

commented May 30, 2019

Perfect, thanks!

@alsrgv alsrgv merged commit 7d9d053 into horovod:master May 30, 2019

2 checks passed

DCO DCO
Details
buildkite/horovod/pr Build #390 passed (48 minutes, 46 seconds)
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.