Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with hvd.barrier #3308

Closed
maxhgerlach opened this issue Dec 9, 2021 · 1 comment · Fixed by #3313
Closed

Segmentation fault with hvd.barrier #3308

maxhgerlach opened this issue Dec 9, 2021 · 1 comment · Fixed by #3313
Assignees
Labels

Comments

@maxhgerlach
Copy link
Collaborator

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.10
  3. Horovod version: master

Bug report:
Please describe erroneous behavior you're observing and steps to reproduce it.

The following variation of TorchTests::test_horovod_broadcast_duplicate_name_error has been observed to trigger segmentation faults with PyTorch 1.10 and OpenMPI (see this comment #3300 (comment))

    def test_horovod_broadcast_duplicate_name_error(self):
        """Test that the broadcast raises an error if there are
        two concurrent operations with the same name."""
        hvd.init()
        size = hvd.size()
        rank = hvd.rank()

        # This test does not apply if there is only one worker.
        if size == 1:
            self.skipTest("Only one worker available")

        dims = [17] * 3
        tensor = torch.FloatTensor(*dims)

        if rank == 0:
            hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
            try:
                hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
                assert False, 'hvd.broadcast_async did not throw error'
            except (torch.FatalError, ValueError):
                pass
        hvd.barrier()
        ## Workaround:
        # hvd.allreduce(torch.FloatTensor([1]), name="synch1")
        if rank > 0:
            hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
            try:
                hvd.broadcast_async(tensor, name='duplicate_name', root_rank=0)
                assert False, 'hvd.broadcast_async did not throw error'
            except (torch.FatalError, ValueError):
                pass
        hvd.barrier()
        ## Workaround:
        # hvd.allreduce(torch.FloatTensor([2]), name="synch2")

The same applies to equivalent tests for the other collective ops. As a workaround we use allreduce instead of barrier for now.

When just the second barrier was replaced by an allreduce, but the first was kept, we still observed some test failures. Maybe those were caused by undefinded behavior?

@Tixxx
Copy link
Collaborator

Tixxx commented Dec 9, 2021

when tensor-less ops like join or barrier are passed into the fusion check, the checker uses the tensor pointer in the response to determine fits. Since those responses don't have a tensor, seg fault happens because it's trying to access a field in a nullptr. i.e. in Controller::TotalByteSizeOfAllgatherOutput:

for (int i = 1; i < entry.tensor->shape().dims(); ++i) {
    total_count_of_output_entries *= entry.tensor->shape().dim_size(i);
  }

I'm looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants