Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

romerojosh · 2020-02-27T00:19:11Z

Recent testing has shown that there is an issue with the existing MXNet out-of-place broadcast implementation in setting the output result on the root rank. The existing implementation has a race condition that can return zero tensor instead of the expected output if it is queried quickly after the hvd.broadcast call. For instance, in this example here:

    tensor = nd.full((1), 42, dtype=np.int32)
    output = hvd.broadcast(tensor=tensor, root_rank=0)
    result = output[0].asscalar()

we see that result on the root rank can sometimes be 0 instead of the expected value 42.

The issue is due to this special handling for root rank and the call to TensorUtil::Copy here: https://github.com/horovod/horovod/blob/master/horovod/mxnet/mpi_ops.cc#L101

The problem arises because TensorUtil::Copy launches an MXNet op (CopyFromTo) within the Horovod op to copy the root rank input to output. This creates a race condition because if output.asscalar() is scheduled on the Python main thread before CopyFromTo is scheduled by the engine worker thread, it will return the output tensor before the copy is carried out, yielding an incorrect zero tensor.

This is fixed by moving the input to output tensor copy on the root rank to the hvd.broadcast function in Python.

cc @ptrendx

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2020-02-27T00:19:47Z

@apeforest @eric-haibin-lin Can you please take a look at this PR?

apeforest

Could you add a unit test?

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh · 2020-02-27T07:04:19Z

@apeforest Since this is a race condition, it is difficult to catch with a unit test. In my testing, I was not able to trigger the erroneous behavior on a 2 GPU system, even with many repeated runs. The issue was only observed, still with varied frequency, on 8+ GPU systems.

apeforest · 2020-02-27T19:50:49Z

@romerojosh This change looks good to me. However, looking at the broadcast_parameters() API, it was actually calling the inplace broadcast_(). So I am curious how you encountered such problem during model training.

romerojosh · 2020-02-27T22:24:12Z

@apeforest We ran into this issue outside of training when broadcasting a random seed using the out-of-place implementation. This was not an issue on GPUs before NCCL broadcast support was added since the DoHorovodOperationCudaOnCPU path which was used before did not have the race condition.

apeforest

LGTM

Fix root rank output handling bug in MXNet out-of-place broadcast.

4bc7097

Signed-off-by: Josh Romero <joshr@nvidia.com>

romerojosh requested a review from apeforest February 27, 2020 00:19

apeforest reviewed Feb 27, 2020

View reviewed changes

Correctly use rank(), not rank.

8c6352b

Signed-off-by: Josh Romero <joshr@nvidia.com>

apeforest approved these changes Feb 27, 2020

View reviewed changes

romerojosh merged commit ff74540 into horovod:master Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

romerojosh commented Feb 27, 2020

romerojosh commented Feb 27, 2020

apeforest left a comment

romerojosh commented Feb 27, 2020

apeforest commented Feb 27, 2020

romerojosh commented Feb 27, 2020

apeforest left a comment

Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

Conversation

romerojosh commented Feb 27, 2020

romerojosh commented Feb 27, 2020

apeforest left a comment

Choose a reason for hiding this comment

romerojosh commented Feb 27, 2020

apeforest commented Feb 27, 2020

romerojosh commented Feb 27, 2020

apeforest left a comment

Choose a reason for hiding this comment