Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix root rank output handling bug in MXNet out-of-place broadcast. #1740

Merged
merged 2 commits into from
Feb 27, 2020

Conversation

romerojosh
Copy link
Collaborator

Recent testing has shown that there is an issue with the existing MXNet out-of-place broadcast implementation in setting the output result on the root rank. The existing implementation has a race condition that can return zero tensor instead of the expected output if it is queried quickly after the hvd.broadcast call. For instance, in this example here:

    tensor = nd.full((1), 42, dtype=np.int32)
    output = hvd.broadcast(tensor=tensor, root_rank=0)
    result = output[0].asscalar()

we see that result on the root rank can sometimes be 0 instead of the expected value 42.

The issue is due to this special handling for root rank and the call to TensorUtil::Copy here: https://github.com/horovod/horovod/blob/master/horovod/mxnet/mpi_ops.cc#L101

The problem arises because TensorUtil::Copy launches an MXNet op (CopyFromTo) within the Horovod op to copy the root rank input to output. This creates a race condition because if output.asscalar() is scheduled on the Python main thread before CopyFromTo is scheduled by the engine worker thread, it will return the output tensor before the copy is carried out, yielding an incorrect zero tensor.

This is fixed by moving the input to output tensor copy on the root rank to the hvd.broadcast function in Python.

cc @ptrendx

Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

@apeforest @eric-haibin-lin Can you please take a look at this PR?

Copy link
Member

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a unit test?

Signed-off-by: Josh Romero <joshr@nvidia.com>
@romerojosh
Copy link
Collaborator Author

@apeforest Since this is a race condition, it is difficult to catch with a unit test. In my testing, I was not able to trigger the erroneous behavior on a 2 GPU system, even with many repeated runs. The issue was only observed, still with varied frequency, on 8+ GPU systems.

@apeforest
Copy link
Member

@romerojosh This change looks good to me. However, looking at the broadcast_parameters() API, it was actually calling the inplace broadcast_(). So I am curious how you encountered such problem during model training.

@romerojosh
Copy link
Collaborator Author

@apeforest We ran into this issue outside of training when broadcasting a random seed using the out-of-place implementation. This was not an issue on GPUs before NCCL broadcast support was added since the DoHorovodOperationCudaOnCPU path which was used before did not have the race condition.

Copy link
Member

@apeforest apeforest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@romerojosh romerojosh merged commit ff74540 into horovod:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants