Skip to content

Commit

Permalink
Add synchronization barriers to the ends of the test_*_duplicate_name…
Browse files Browse the repository at this point in the history
…_error

Without this, deadlocks in the subsequent test were possible: One process would
already have enqueued a collective op like hvd.broadcast(), while the other
would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()].

I could not use hvd.barrier() for this second barrier because that would somehow
cause a segmentation fault. Went for an allreduce instead.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
  • Loading branch information
maxhgerlach committed Dec 5, 2021
1 parent 952d17a commit 6593da5
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions test/parallel/test_torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,7 @@ def test_horovod_allreduce_duplicate_name_error(self):
assert False, 'hvd.allreduce_async did not throw error'
except (torch.FatalError, ValueError):
pass
hvd.allreduce(torch.FloatTensor([0]), name="synch")

def test_horovod_allreduce_grad(self):
"""Test the correctness of the allreduce gradient."""
Expand Down Expand Up @@ -1246,6 +1247,7 @@ def test_horovod_allgather_duplicate_name_error(self):
assert False, 'hvd.allgather_async did not throw error'
except (torch.FatalError, ValueError):
pass
hvd.allreduce(torch.FloatTensor([0]), name="synch")

def test_horovod_allgather_grad(self):
"""Test the correctness of the allgather gradient."""
Expand Down Expand Up @@ -1565,6 +1567,7 @@ def test_horovod_broadcast_duplicate_name_error(self):
assert False, 'hvd.broadcast_async did not throw error'
except (torch.FatalError, ValueError):
pass
hvd.allreduce(torch.FloatTensor([0]), name="synch")

def test_horovod_broadcast_grad(self):
"""Test the correctness of the broadcast gradient."""
Expand Down

0 comments on commit 6593da5

Please sign in to comment.