-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLSL operation implementation #908
Conversation
8324054
to
eadb1c4
Compare
@tgaddair, could you review for alignment with ops.cc refactor? |
@shirosankaku, could you rebase? We've recently fixed CI, would be great to see a green build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. Added first pass of comments. Could you add a test to .travis.yaml
that would reinstall Horovod with MLSL support and re-run some of the tests? How long does MLSL build/installation take?
@AlekseyMarchuk, thanks for the updates! Could you sign DCO, and add MLSL tests to |
8b568b9
to
f40a59d
Compare
@alsrgv I've added MLSL tests to new CI. But build fails with strange error |
@AlekseyMarchuk, not sure - is it possible that MLSL installation overwrites the contents of |
@AlekseyMarchuk, can you rebase on latest master? We've made a fix to the build break you're seeing. |
@alsrgv sure, one moment. You were right - mlsl installation script overwrites destination directory. I completely forget about this fact |
mlsl build now fails with
it seems to be CI issue |
@AlekseyMarchuk, sorry about that - fixed the CI. |
2ea5d39
to
2f917bb
Compare
@AlekseyMarchuk, we made another CI fix yesterday, could you rebase and push? |
@alsrgv I did rebase but several CI tests failed with the following call stack:
Have you met this issue before? |
@AlekseyMarchuk, no - we don't use MPICH internally, and I don't have experience beyond other unit tests that run successfully. Does MLSL work with Open MPI? |
21f765f
to
5e946b5
Compare
@alsrgv We are sorry for the delay. The last fixes were implemented. |
horovod/common/operations.cc
Outdated
@@ -854,7 +870,63 @@ void CoordinateCacheAndState(CacheCoordinator& cache_coordinator, | |||
// otherwise we may end up dispatching many blocked threads and never make | |||
// progress if we have a thread pool limit. | |||
bool RunLoopOnce(HorovodGlobalState& state, MPIContext& ctx, bool is_coordinator); | |||
|
|||
#if HAVE_MLSL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into MLSLContext
? One of our goals with operations.cc
is to minimize the amount of "framework specific" code in here. Ideally, we want to isolate framework code within specific places like mlsl_context.h
or mlsl_operations.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to MLSLContext
horovod/common/operations.cc
Outdated
void BackgroundThreadLoop(HorovodGlobalState& state, MPIContext& ctx) { | ||
#if HAVE_MLSL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this adds a lot of complication to the BackgroundThreadLoop
. I'm wondering if some of this can be pulled out into MLSLContext
in some way? Background thread initialization is something that will ultimately need to be abstracted between MLSL, MPI, Gloo, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved init and finalize logic to MLSLContext
958b2cc
to
c6377c3
Compare
horovod/common/operations.cc
Outdated
@@ -1581,6 +1603,14 @@ void InitializeHorovodOnce(const int* ranks, int nranks) { | |||
while (!horovod_global.initialization_done) { | |||
std::this_thread::sleep_for(std::chrono::milliseconds(1)); | |||
} | |||
|
|||
#if HAVE_MLSL | |||
LOG(DEBUG) << "BG-thread init done"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be useful information for non-MLSL scenarios as well? Looks generic enough and could potentially provide insight into certain types of failures. Thoughts, @alsrgv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm fine to make remove #if
around this log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@alsrgv Hello! The last build with our changes was successful, but this time it shows that PyTorch related tests have failed due to errors like: Could you, please, have a look? |
@shirosankaku, could you rebase on latest master? It fixes recent torchvision compatibility issues. |
@alsrgv Done. |
Thanks! @tgaddair, could you do another pass? |
test/test_tensorflow.py
Outdated
@@ -75,7 +77,11 @@ def test_horovod_allreduce_cpu(self): | |||
"""Test on CPU that the allreduce correctly sums 1D, 2D, 3D tensors.""" | |||
hvd.init() | |||
size = hvd.size() | |||
dtypes = [tf.int32, tf.int64, tf.float16, tf.float32, tf.float64] | |||
# MLSL supports only byte, float and double data types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some sort of a check in runtime to throw an exception if the user requests unsupported compression with MLSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, we have that kind of check. Only Allreduce needs it
The check is in mlsl_operations.cc file in GetMLSLDataType function that is called when MLSL_Allreduce is executed. The reason for these checks in test_*.py files is not to let testing fail because of a bunch of types that are not supported by MLSL_Allreduce operation (such sort of exception is not a planned one and test will fail).
Or would you prefer seeing this check in operation.cc in ConstructResponse function?
horovod/common/operations.cc
Outdated
@@ -1587,6 +1609,13 @@ void InitializeHorovodOnce(const int* ranks, int nranks) { | |||
while (!horovod_global.initialization_done) { | |||
std::this_thread::sleep_for(std::chrono::milliseconds(1)); | |||
} | |||
|
|||
LOG(DEBUG) << "Background thread init done"; | |||
#if HAVE_MLSL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave a comment explaining what this signaling in the context of MLSL, and why we only log it when using MLSL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, it's better to be removed.
setup.py
Outdated
@@ -35,6 +35,11 @@ | |||
torch_mpi_lib_v2 = Extension('horovod.torch.mpi_lib_v2', []) | |||
mxnet_mpi_lib = Extension('horovod.mxnet.mpi_lib', []) | |||
|
|||
have_mlsl = False | |||
mlsl_root = os.environ.get('MLSL_ROOT') | |||
if mlsl_root is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: SImplify to have_mlsl = mlsl_root is not None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Done
test/test_timeline.py
Outdated
@@ -44,7 +44,10 @@ def test_timeline(self): | |||
hvd.init() | |||
|
|||
# Perform a simple allreduce operation | |||
hvd.allreduce(torch.tensor([1, 2, 3]), name='test_allreduce') | |||
if 'MLSL_ROOT' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove the else statement and always test against float32? That way we don't need to add additional coupling with MLSL here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/test_torch.py
Outdated
if _fp16_supported: | ||
dtypes += [torch.HalfTensor] | ||
# MLSL supports only byte, float and double data types | ||
if 'MLSL_ROOT' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like almost every test has been updated with this check. I think we can simplify this and make it more extensible to other frameworks with data type restrictions by having a single utility function that filters the list of dtypes down to only the ones supported by the current framework.
For example:
MLSL_TYPES = set([torch.FloatTensor, torch.DoubleTensor])
def filter_supported_types(dtypes):
if 'MLSL_ROOT' in os.environ:
dtypes = [dtype for dtype in dtypes if dtype in MLSL_TYPES]
return dtypes
...
dtypes = filter_supported_types([torch.IntTensor, torch.LongTensor,
torch.FloatTensor, torch.DoubleTensor])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Done
test/test_mxnet.py
Outdated
@@ -44,8 +44,11 @@ def test_horovod_allreduce(self): | |||
"""Test that the allreduce correctly sums 1D, 2D, 3D tensors.""" | |||
hvd.init() | |||
size = hvd.size() | |||
dtypes = ['int32', 'int64', | |||
'float32', 'float64'] | |||
if 'MLSL_ROOT' in os.environ: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the same filter_supported_types
pattern here for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, one minor comment.
.travis.yml
Outdated
@@ -0,0 +1,180 @@ | |||
dist: trusty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this file? We have removed Travis CI integration in favor of Buildkite.
Also, it looks like most recent commits caused merge conflict with test_torch, could you resolve it? |
@shirosankaku, can you rebase on master instead of merging? Now PR contains a lot of recent changes (e.g. whole .md -> .rst migration) which makes it hard to review. |
Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>
We've merged all the commits into the one |
Perfect, thanks! |
Signed-off-by: Yana Shchyokotova <yana.shchyokotova@intel.com>
This PR adds integration with Intel(R) MLSL library on operation level.