Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLSL operation implementation #908

Merged
merged 1 commit into from
May 30, 2019
Merged

MLSL operation implementation #908

merged 1 commit into from
May 30, 2019

Conversation

shirosankaku
Copy link
Contributor

This PR adds integration with Intel(R) MLSL library on operation level.

@alsrgv
Copy link
Member

alsrgv commented Mar 28, 2019

@tgaddair, could you review for alignment with ops.cc refactor?

@alsrgv
Copy link
Member

alsrgv commented Mar 28, 2019

@shirosankaku, could you rebase? We've recently fixed CI, would be great to see a green build.

Copy link
Member

@alsrgv alsrgv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay. Added first pass of comments. Could you add a test to .travis.yaml that would reinstall Horovod with MLSL support and re-run some of the tests? How long does MLSL build/installation take?

horovod/common/common.h Outdated Show resolved Hide resolved
horovod/common/common.h Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
horovod/common/operations.cc Outdated Show resolved Hide resolved
@alsrgv
Copy link
Member

alsrgv commented Apr 3, 2019

@AlekseyMarchuk, thanks for the updates! Could you sign DCO, and add MLSL tests to .travis.yml?

@AlekseyMarchuk
Copy link

@alsrgv I've added MLSL tests to new CI. But build fails with strange error /bin/bash: pip: command not found while pip was used earlier. Could you please take a look at the script?

@alsrgv
Copy link
Member

alsrgv commented Apr 8, 2019

@AlekseyMarchuk, not sure - is it possible that MLSL installation overwrites the contents of /usr/local? Can you try manually running commands from Dockerfile.test.cpu until you discover the issue?

@alsrgv
Copy link
Member

alsrgv commented Apr 9, 2019

@AlekseyMarchuk, can you rebase on latest master? We've made a fix to the build break you're seeing.

@AlekseyMarchuk
Copy link

@alsrgv sure, one moment. You were right - mlsl installation script overwrites destination directory. I completely forget about this fact

@AlekseyMarchuk
Copy link

mlsl build now fails with

ERROR: denied: The repository with name 'buildkite' in registry with id '823773083436' already has the maximum allowed number of images which is '1000'

Exited with 1

it seems to be CI issue

@alsrgv
Copy link
Member

alsrgv commented Apr 9, 2019

@AlekseyMarchuk, sorry about that - fixed the CI.

@alsrgv
Copy link
Member

alsrgv commented Apr 10, 2019

@AlekseyMarchuk, we made another CI fix yesterday, could you rebase and push?

@AlekseyMarchuk
Copy link

@alsrgv I did rebase but several CI tests failed with the following call stack:

Fatal error in PMPI_Init_thread: Other MPI error, error stack:
--
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present
  | Fatal error in PMPI_Init_thread: Other MPI error, error stack:
  | MPIR_Init_thread(474)...:
  | MPID_Init(190)..........: channel initialization failed
  | MPIDI_CH3_Init(89)......:
  | MPID_nem_init(413)......:
  | MPIDI_nem_ckpt_init(170): BLCR kernel module not present

Have you met this issue before?

@alsrgv
Copy link
Member

alsrgv commented Apr 11, 2019

@AlekseyMarchuk, no - we don't use MPICH internally, and I don't have experience beyond other unit tests that run successfully. Does MLSL work with Open MPI?

@shirosankaku
Copy link
Contributor Author

@alsrgv We are sorry for the delay. The last fixes were implemented.

@@ -854,7 +870,63 @@ void CoordinateCacheAndState(CacheCoordinator& cache_coordinator,
// otherwise we may end up dispatching many blocked threads and never make
// progress if we have a thread pool limit.
bool RunLoopOnce(HorovodGlobalState& state, MPIContext& ctx, bool is_coordinator);

#if HAVE_MLSL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into MLSLContext? One of our goals with operations.cc is to minimize the amount of "framework specific" code in here. Ideally, we want to isolate framework code within specific places like mlsl_context.h or mlsl_operations.h.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to MLSLContext

void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) {
#if HAVE_MLSL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this adds a lot of complication to the BackgroundThreadLoop. I'm wondering if some of this can be pulled out into MLSLContext in some way? Background thread initialization is something that will ultimately need to be abstracted between MLSL, MPI, Gloo, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved init and finalize logic to MLSLContext

@shirosankaku shirosankaku force-pushed the mlsl_op branch 2 times, most recently from 958b2cc to c6377c3 Compare May 21, 2019 15:37
@@ -1581,6 +1603,14 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
while (!horovod_global.initialization_done) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

#if HAVE_MLSL
LOG(DEBUG) << "BG-thread init done";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be useful information for non-MLSL scenarios as well? Looks generic enough and could potentially provide insight into certain types of failures. Thoughts, @alsrgv?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm fine to make remove #if around this log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shirosankaku
Copy link
Contributor Author

@alsrgv Hello! The last build with our changes was successful, but this time it shows that PyTorch related tests have failed due to errors like:
ImportError: /usr/local/lib/python3.5/dist-packages/torchvision/_C.cpython-35m-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe26detail36_typeMetaDataInstance_preallocated_7E

Could you, please, have a look?

@alsrgv
Copy link
Member

alsrgv commented May 26, 2019

@shirosankaku, could you rebase on latest master? It fixes recent torchvision compatibility issues.

@shirosankaku
Copy link
Contributor Author

@alsrgv Done.

@alsrgv
Copy link
Member

alsrgv commented May 27, 2019

Thanks! @tgaddair, could you do another pass?

@@ -75,7 +77,11 @@ def test_horovod_allreduce_cpu(self):
"""Test on CPU that the allreduce correctly sums 1D, 2D, 3D tensors."""
hvd.init()
size = hvd.size()
dtypes = [tf.int32, tf.int64, tf.float16, tf.float32, tf.float64]
# MLSL supports only byte, float and double data types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some sort of a check in runtime to throw an exception if the user requests unsupported compression with MLSL?

Copy link
Contributor Author

@shirosankaku shirosankaku May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, we have that kind of check. Only Allreduce needs it
The check is in mlsl_operations.cc file in GetMLSLDataType function that is called when MLSL_Allreduce is executed. The reason for these checks in test_*.py files is not to let testing fail because of a bunch of types that are not supported by MLSL_Allreduce operation (such sort of exception is not a planned one and test will fail).
Or would you prefer seeing this check in operation.cc in ConstructResponse function?

@@ -1587,6 +1609,13 @@ void InitializeHorovodOnce(const int* ranks, int nranks) {
while (!horovod_global.initialization_done) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}

LOG(DEBUG) << "Background thread init done";
#if HAVE_MLSL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment explaining what this signaling in the context of MLSL, and why we only log it when using MLSL?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, it's better to be removed.

setup.py Outdated
@@ -35,6 +35,11 @@
torch_mpi_lib_v2 = Extension('horovod.torch.mpi_lib_v2', [])
mxnet_mpi_lib = Extension('horovod.mxnet.mpi_lib', [])

have_mlsl = False
mlsl_root = os.environ.get('MLSL_ROOT')
if mlsl_root is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: SImplify to have_mlsl = mlsl_root is not None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Done

@@ -44,7 +44,10 @@ def test_timeline(self):
hvd.init()

# Perform a simple allreduce operation
hvd.allreduce(torch.tensor([1, 2, 3]), name='test_allreduce')
if 'MLSL_ROOT' in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove the else statement and always test against float32? That way we don't need to add additional coupling with MLSL here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if _fp16_supported:
dtypes += [torch.HalfTensor]
# MLSL supports only byte, float and double data types
if 'MLSL_ROOT' in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like almost every test has been updated with this check. I think we can simplify this and make it more extensible to other frameworks with data type restrictions by having a single utility function that filters the list of dtypes down to only the ones supported by the current framework.

For example:

MLSL_TYPES = set([torch.FloatTensor, torch.DoubleTensor])

def filter_supported_types(dtypes):
    if 'MLSL_ROOT' in os.environ:
        dtypes = [dtype for dtype in dtypes if dtype in MLSL_TYPES]
    return dtypes

...
dtypes = filter_supported_types([torch.IntTensor, torch.LongTensor,
                                                         torch.FloatTensor, torch.DoubleTensor])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Done

@@ -44,8 +44,11 @@ def test_horovod_allreduce(self):
"""Test that the allreduce correctly sums 1D, 2D, 3D tensors."""
hvd.init()
size = hvd.size()
dtypes = ['int32', 'int64',
'float32', 'float64']
if 'MLSL_ROOT' in os.environ:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the same filter_supported_types pattern here for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work.

Copy link
Member

@alsrgv alsrgv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor comment.

.travis.yml Outdated
@@ -0,0 +1,180 @@
dist: trusty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this file? We have removed Travis CI integration in favor of Buildkite.

@alsrgv
Copy link
Member

alsrgv commented May 30, 2019

Also, it looks like most recent commits caused merge conflict with test_torch, could you resolve it?

@alsrgv
Copy link
Member

alsrgv commented May 30, 2019

@shirosankaku, can you rebase on master instead of merging? Now PR contains a lot of recent changes (e.g. whole .md -> .rst migration) which makes it hard to review.

Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>
@shirosankaku
Copy link
Contributor Author

We've merged all the commits into the one

@alsrgv
Copy link
Member

alsrgv commented May 30, 2019

Perfect, thanks!

@alsrgv alsrgv merged commit 7d9d053 into horovod:master May 30, 2019
jeffdaily pushed a commit to ROCm/horovod that referenced this pull request Nov 27, 2019
Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants