Segmentation fault with `hvd.barrier` #3308

maxhgerlach · 2021-12-09T09:21:17Z

Environment:

Framework: PyTorch
Framework version: 1.10
Horovod version: master

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

The following variation of TorchTests::test_horovod_broadcast_duplicate_name_error has been observed to trigger segmentation faults with PyTorch 1.10 and OpenMPI (see this comment #3300 (comment))

    def test_horovod_broadcast_duplicate_name_error(self):
        """Test that the broadcast raises an error if there are
        two concurrent operations with the same name."""
        hvd.init()
        size = hvd.size()
        rank = hvd.rank()

        # This test does not apply if there is only one worker.
        if size == 1:
            self.skipTest("Only one worker available")

        dims = [17] * 3
        tensor = torch.FloatTensor(*dims)

        if rank == 0:
            hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
            try:
                hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
                assert False, 'hvd.broadcast_async did not throw error'
            except (torch.FatalError, ValueError):
                pass
        hvd.barrier()
        ## Workaround:
        # hvd.allreduce(torch.FloatTensor([1]), name="synch1")
        if rank > 0:
            hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
            try:
                hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
                assert False, 'hvd.broadcast_async did not throw error'
            except (torch.FatalError, ValueError):
                pass
        hvd.barrier()
        ## Workaround:
        # hvd.allreduce(torch.FloatTensor([2]), name="synch2")

The same applies to equivalent tests for the other collective ops. As a workaround we use allreduce instead of barrier for now.

When just the second barrier was replaced by an allreduce, but the first was kept, we still observed some test failures. Maybe those were caused by undefinded behavior?

The text was updated successfully, but these errors were encountered:

Tixxx · 2021-12-09T18:31:15Z

when tensor-less ops like join or barrier are passed into the fusion check, the checker uses the tensor pointer in the response to determine fits. Since those responses don't have a tensor, seg fault happens because it's trying to access a field in a nullptr. i.e. in Controller::TotalByteSizeOfAllgatherOutput:

for (int i = 1; i < entry.tensor->shape().dims(); ++i) {
    total_count_of_output_entries *= entry.tensor->shape().dim_size(i);
  }

I'm looking into this.

maxhgerlach added the bug label Dec 9, 2021

maxhgerlach mentioned this issue Dec 9, 2021

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Merged

4 tasks

Tixxx self-assigned this Dec 9, 2021

Tixxx linked a pull request Dec 13, 2021 that will close this issue

fix barrier seg fault and added test to mix it with multiple collectives #3313

Merged

4 tasks

maxhgerlach closed this as completed in #3313 Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault with `hvd.barrier` #3308

Segmentation fault with `hvd.barrier` #3308

maxhgerlach commented Dec 9, 2021

Tixxx commented Dec 9, 2021

Segmentation fault with hvd.barrier #3308

Segmentation fault with hvd.barrier #3308

Comments

maxhgerlach commented Dec 9, 2021

Tixxx commented Dec 9, 2021

Segmentation fault with `hvd.barrier` #3308

Segmentation fault with `hvd.barrier` #3308