You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
The following variation of TorchTests::test_horovod_broadcast_duplicate_name_error has been observed to trigger segmentation faults with PyTorch 1.10 and OpenMPI (see this comment #3300 (comment))
deftest_horovod_broadcast_duplicate_name_error(self):
"""Test that the broadcast raises an error if there are two concurrent operations with the same name."""hvd.init()
size=hvd.size()
rank=hvd.rank()
# This test does not apply if there is only one worker.ifsize==1:
self.skipTest("Only one worker available")
dims= [17] *3tensor=torch.FloatTensor(*dims)
ifrank==0:
hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
try:
hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
assertFalse, 'hvd.broadcast_async did not throw error'except (torch.FatalError, ValueError):
passhvd.barrier()
## Workaround:# hvd.allreduce(torch.FloatTensor([1]), name="synch1")ifrank>0:
hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
try:
hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
assertFalse, 'hvd.broadcast_async did not throw error'except (torch.FatalError, ValueError):
passhvd.barrier()
## Workaround:# hvd.allreduce(torch.FloatTensor([2]), name="synch2")
The same applies to equivalent tests for the other collective ops. As a workaround we use allreduce instead of barrier for now.
When just the second barrier was replaced by an allreduce, but the first was kept, we still observed some test failures. Maybe those were caused by undefinded behavior?
The text was updated successfully, but these errors were encountered:
when tensor-less ops like join or barrier are passed into the fusion check, the checker uses the tensor pointer in the response to determine fits. Since those responses don't have a tensor, seg fault happens because it's trying to access a field in a nullptr. i.e. in Controller::TotalByteSizeOfAllgatherOutput:
for (int i = 1; i < entry.tensor->shape().dims(); ++i) {
total_count_of_output_entries *= entry.tensor->shape().dim_size(i);
}
Environment:
Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.
The following variation of
TorchTests::test_horovod_broadcast_duplicate_name_error
has been observed to trigger segmentation faults with PyTorch 1.10 and OpenMPI (see this comment #3300 (comment))The same applies to equivalent tests for the other collective ops. As a workaround we use
allreduce
instead ofbarrier
for now.When just the second
barrier
was replaced by anallreduce
, but the first was kept, we still observed some test failures. Maybe those were caused by undefinded behavior?The text was updated successfully, but these errors were encountered: