MLSL operation implementation #908

shirosankaku · 2019-03-13T12:00:17Z

This PR adds integration with Intel(R) MLSL library on operation level.

alsrgv · 2019-03-28T00:26:41Z

@tgaddair, could you review for alignment with ops.cc refactor?

alsrgv · 2019-03-28T00:27:14Z

@shirosankaku, could you rebase? We've recently fixed CI, would be great to see a green build.

alsrgv

Apologies for the delay. Added first pass of comments. Could you add a test to .travis.yaml that would reinstall Horovod with MLSL support and re-run some of the tests? How long does MLSL build/installation take?

horovod/common/common.h

horovod/common/operations.cc

alsrgv · 2019-04-03T06:41:43Z

@AlekseyMarchuk, thanks for the updates! Could you sign DCO, and add MLSL tests to .travis.yml?

AlekseyMarchuk · 2019-04-08T12:00:38Z

@alsrgv I've added MLSL tests to new CI. But build fails with strange error /bin/bash: pip: command not found while pip was used earlier. Could you please take a look at the script?

alsrgv · 2019-04-08T22:12:52Z

@AlekseyMarchuk, not sure - is it possible that MLSL installation overwrites the contents of /usr/local? Can you try manually running commands from Dockerfile.test.cpu until you discover the issue?

alsrgv · 2019-04-09T06:03:03Z

@AlekseyMarchuk, can you rebase on latest master? We've made a fix to the build break you're seeing.

AlekseyMarchuk · 2019-04-09T06:59:49Z

@alsrgv sure, one moment. You were right - mlsl installation script overwrites destination directory. I completely forget about this fact

AlekseyMarchuk · 2019-04-09T07:04:07Z

mlsl build now fails with

ERROR: denied: The repository with name 'buildkite' in registry with id '823773083436' already has the maximum allowed number of images which is '1000'

Exited with 1

it seems to be CI issue

alsrgv · 2019-04-09T07:07:33Z

@AlekseyMarchuk, sorry about that - fixed the CI.

alsrgv · 2019-04-10T22:39:02Z

@AlekseyMarchuk, we made another CI fix yesterday, could you rebase and push?

AlekseyMarchuk · 2019-04-11T12:37:09Z

@alsrgv I did rebase but several CI tests failed with the following call stack:

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
--
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present
  | Fatal error in PMPI_Init_thread: Other MPI error, error stack:
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present

Have you met this issue before?

alsrgv · 2019-04-11T19:49:08Z

@AlekseyMarchuk, no - we don't use MPICH internally, and I don't have experience beyond other unit tests that run successfully. Does MLSL work with Open MPI?

shirosankaku · 2019-05-17T14:42:17Z

@alsrgv We are sorry for the delay. The last fixes were implemented.

tgaddair · 2019-05-20T18:41:08Z

horovod/common/operations.cc

@@ -854,7 +870,63 @@ void CoordinateCacheAndState(CacheCoordinator& cache_coordinator,
 //      otherwise we may end up dispatching many blocked threads and never make
 //      progress if we have a thread pool limit.
 bool RunLoopOnce(HorovodGlobalState& state, MPIContext& ctx, bool is_coordinator);
+
+#if HAVE_MLSL


Can we move this into MLSLContext? One of our goals with operations.cc is to minimize the amount of "framework specific" code in here. Ideally, we want to isolate framework code within specific places like mlsl_context.h or mlsl_operations.h.

Moved to MLSLContext

tgaddair · 2019-05-20T18:46:07Z

horovod/common/operations.cc

 void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
+#if HAVE_MLSL


I feel that this adds a lot of complication to the BackgroundThreadLoop. I'm wondering if some of this can be pulled out into MLSLContext in some way? Background thread initialization is something that will ultimately need to be abstracted between MLSL, MPI, Gloo, etc.

Moved init and finalize logic to MLSLContext

tgaddair · 2019-05-21T18:11:58Z

horovod/common/operations.cc

@@ -1581,6 +1603,14 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
  while (!horovod_global.initialization_done) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
  }
+
+#if HAVE_MLSL
+  LOG(DEBUG) << "BG-thread init done";


Could this be useful information for non-MLSL scenarios as well? Looks generic enough and could potentially provide insight into certain types of failures. Thoughts, @alsrgv?

Yeah, I'm fine to make remove #if around this log message.

shirosankaku · 2019-05-23T13:51:05Z

@alsrgv Hello! The last build with our changes was successful, but this time it shows that PyTorch related tests have failed due to errors like:
ImportError: /usr/local/lib/python3.5/dist-packages/torchvision/_C.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail36_typeMetaDataInstance_preallocated_7E

Could you, please, have a look?

alsrgv · 2019-05-26T08:53:39Z

@shirosankaku, could you rebase on latest master? It fixes recent torchvision compatibility issues.

shirosankaku · 2019-05-27T09:45:58Z

@alsrgv Done.

alsrgv · 2019-05-27T19:11:35Z

Thanks! @tgaddair, could you do another pass?

abditag2 · 2019-05-27T19:22:24Z

test/test_tensorflow.py

@@ -75,7 +77,11 @@ def test_horovod_allreduce_cpu(self):
        """Test on CPU that the allreduce correctly sums 1D, 2D, 3D tensors."""
        hvd.init()
        size = hvd.size()
-        dtypes = [tf.int32, tf.int64, tf.float16, tf.float32, tf.float64]
+        # MLSL supports only byte, float and double data types


Do we need some sort of a check in runtime to throw an exception if the user requests unsupported compression with MLSL?

actually, we have that kind of check. Only Allreduce needs it
The check is in mlsl_operations.cc file in GetMLSLDataType function that is called when MLSL_Allreduce is executed. The reason for these checks in test_*.py files is not to let testing fail because of a bunch of types that are not supported by MLSL_Allreduce operation (such sort of exception is not a planned one and test will fail).
Or would you prefer seeing this check in operation.cc in ConstructResponse function?

tgaddair · 2019-05-27T20:55:19Z

horovod/common/operations.cc

@@ -1587,6 +1609,13 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
  while (!horovod_global.initialization_done) {
    std::this_thread::sleep_for(std::chrono::milliseconds(1));
  }
+
+  LOG(DEBUG) << "Background thread init done";
+#if HAVE_MLSL


Can you leave a comment explaining what this signaling in the context of MLSL, and why we only log it when using MLSL?

I guess, it's better to be removed.

tgaddair · 2019-05-27T20:57:30Z

setup.py

@@ -35,6 +35,11 @@
 torch_mpi_lib_v2 = Extension('horovod.torch.mpi_lib_v2', [])
 mxnet_mpi_lib = Extension('horovod.mxnet.mpi_lib', [])

+have_mlsl = False
+mlsl_root = os.environ.get('MLSL_ROOT')
+if mlsl_root is not None:


Nit: SImplify to have_mlsl = mlsl_root is not None

Thank you. Done

tgaddair · 2019-05-27T21:00:22Z

test/test_timeline.py

@@ -44,7 +44,10 @@ def test_timeline(self):
                hvd.init()

                # Perform a simple allreduce operation
-                hvd.allreduce(torch.tensor([1, 2, 3]), name='test_allreduce')
+                if 'MLSL_ROOT' in os.environ:


Can we just remove the else statement and always test against float32? That way we don't need to add additional coupling with MLSL here.

tgaddair · 2019-05-27T21:07:13Z

test/test_torch.py

-        if _fp16_supported:
-            dtypes += [torch.HalfTensor]
+        # MLSL supports only byte, float and double data types
+        if 'MLSL_ROOT' in os.environ:


Looks like almost every test has been updated with this check. I think we can simplify this and make it more extensible to other frameworks with data type restrictions by having a single utility function that filters the list of dtypes down to only the ones supported by the current framework.

For example:

MLSL_TYPES = set([torch.FloatTensor, torch.DoubleTensor]) def filter_supported_types(dtypes): if 'MLSL_ROOT' in os.environ: dtypes = [dtype for dtype in dtypes if dtype in MLSL_TYPES] return dtypes ... dtypes = filter_supported_types([torch.IntTensor, torch.LongTensor, torch.FloatTensor, torch.DoubleTensor])

Thank you. Done

tgaddair · 2019-05-28T20:12:14Z

test/test_mxnet.py

@@ -44,8 +44,11 @@ def test_horovod_allreduce(self):
        """Test that the allreduce correctly sums 1D, 2D, 3D tensors."""
        hvd.init()
        size = hvd.size()
-        dtypes = ['int32',   'int64',
-                  'float32', 'float64']
+        if 'MLSL_ROOT' in os.environ:


Please use the same filter_supported_types pattern here for consistency.

tgaddair

LGTM! Nice work.

alsrgv

LGTM, one minor comment.

alsrgv · 2019-05-29T18:28:26Z

.travis.yml

@@ -0,0 +1,180 @@
+dist: trusty


Can you remove this file? We have removed Travis CI integration in favor of Buildkite.

alsrgv · 2019-05-30T02:02:42Z

Also, it looks like most recent commits caused merge conflict with test_torch, could you resolve it?

alsrgv · 2019-05-30T08:43:45Z

@shirosankaku, can you rebase on master instead of merging? Now PR contains a lot of recent changes (e.g. whole .md -> .rst migration) which makes it hard to review.

Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>

shirosankaku · 2019-05-30T12:14:41Z

We've merged all the commits into the one

alsrgv · 2019-05-30T13:03:44Z

Perfect, thanks!

Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>

shirosankaku force-pushed the mlsl_op branch 2 times, most recently from 8324054 to eadb1c4 Compare March 13, 2019 14:13

alsrgv requested review from tgaddair, alsrgv and abditag2 March 15, 2019 02:31

shirosankaku force-pushed the mlsl_op branch from eadb1c4 to 351b7ec Compare March 22, 2019 08:30

alsrgv reviewed Mar 28, 2019

View reviewed changes

AlekseyMarchuk force-pushed the mlsl_op branch 4 times, most recently from 8b568b9 to f40a59d Compare April 8, 2019 11:37

AlekseyMarchuk force-pushed the mlsl_op branch from bf6af5d to ed65651 Compare April 9, 2019 05:43

AlekseyMarchuk force-pushed the mlsl_op branch 4 times, most recently from 2ea5d39 to 2f917bb Compare April 10, 2019 13:21

AlekseyMarchuk force-pushed the mlsl_op branch 2 times, most recently from 21f765f to 5e946b5 Compare April 15, 2019 13:37

shirosankaku force-pushed the mlsl_op branch from 1135cad to d49a3de Compare May 17, 2019 13:20

tgaddair reviewed May 20, 2019

View reviewed changes

shirosankaku force-pushed the mlsl_op branch 2 times, most recently from 958b2cc to c6377c3 Compare May 21, 2019 15:37

tgaddair reviewed May 21, 2019

View reviewed changes

shirosankaku force-pushed the mlsl_op branch from c6377c3 to 4b03200 Compare May 22, 2019 12:52

shirosankaku force-pushed the mlsl_op branch from bc87ec9 to 7895304 Compare May 27, 2019 08:46

abditag2 reviewed May 27, 2019

View reviewed changes

tgaddair reviewed May 27, 2019

View reviewed changes

tgaddair reviewed May 28, 2019

View reviewed changes

tgaddair approved these changes May 29, 2019

View reviewed changes

alsrgv approved these changes May 29, 2019

View reviewed changes

shirosankaku force-pushed the mlsl_op branch from 8e54a52 to 10c7329 Compare May 30, 2019 09:53

MLSL integration

196f338

Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>

shirosankaku force-pushed the mlsl_op branch from 10c7329 to 196f338 Compare May 30, 2019 12:09

alsrgv merged commit 7d9d053 into horovod:master May 30, 2019

jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019

MLSL integration (horovod#908)

1d59342

Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>

		void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
		#if HAVE_MLSL

MLSL operation implementation #908

MLSL operation implementation #908

Conversation

shirosankaku commented Mar 13, 2019

alsrgv commented Mar 28, 2019

alsrgv commented Mar 28, 2019

alsrgv left a comment

Choose a reason for hiding this comment

alsrgv commented Apr 3, 2019

AlekseyMarchuk commented Apr 8, 2019

alsrgv commented Apr 8, 2019

alsrgv commented Apr 9, 2019

AlekseyMarchuk commented Apr 9, 2019

AlekseyMarchuk commented Apr 9, 2019

alsrgv commented Apr 9, 2019

alsrgv commented Apr 10, 2019

AlekseyMarchuk commented Apr 11, 2019

alsrgv commented Apr 11, 2019

shirosankaku commented May 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shirosankaku commented May 23, 2019

alsrgv commented May 26, 2019

shirosankaku commented May 27, 2019

alsrgv commented May 27, 2019

Choose a reason for hiding this comment

shirosankaku May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

alsrgv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alsrgv commented May 30, 2019

alsrgv commented May 30, 2019

shirosankaku commented May 30, 2019

alsrgv commented May 30, 2019

shirosankaku May 28, 2019 •

edited

Loading