Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

maxhgerlach · 2021-12-03T22:59:56Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

1) Fix PyTorch unit test failures from op name mismatches

When a Horovod operation is not given an explicit name in PyTorch code, it receives an automatically generated name that encodes an internal handle number. That handle, however, is a local concept and it can differ between Horovod processes if they enqueue different numbers of operations. Name mismatches then lead to stalls.

This situation was occasionally triggered in the PyTorch unit tests when running with MPI. For instance, I would see the following:

test_torch.py::TorchTests::test_horovod_broadcast_error 
[horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [broadcast.noname.3259]
1: [broadcast.duplicate_name, broadcast.noname.3260]

Here we see the influence of the previously executed test function: Rank 1 had enqueued one broadcast.duplicate_name more than rank 0. Then in this test the autogenerated names don't match up because the handles 3259 and 3260 are different, causing a stall.

Even without lingering duplicate_name ops such a stall could be triggered. For example I saw errors like the following in the CI logs for PR #3199, which were also caused by previous tests:

test_torch.py::TorchTests::test_horovod_reducescatter 
[horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [reducescatter.noname.3236]
1: [reducescatter.noname.3234]

(When running with Gloo instead of MPI, Horovod is shut down and reinitialized between each test function, so we don't see these cross effects between separate tests as often).

To fix this situation I have rewritten the offending tests: Those that test duplicate name error handling and those that test hvd.join(). In the reformulated tests it is ensured that each rank always enqueues the same number of Horovod ops. This should prevent stalls in subsequent tests.

It remains an open question if a better mechanism could be found that generally avoids these name mismatches.

2) Fix `hvd.barrier()` tensor queue management

The rewritten test_horovod_allreduce_duplicate_name_error etc. now employ a hvd.barrier() call and that unveiled another bug.

If a barrier was enqueued together with an allreduce, the barrier request would erroneously be pushed again and again into the tensor queue in each cycle even after the barrier was completed. Additionally, this could cause a mismatch in GetTensorEntriesFromResponse where a tensor from the response would not be present in the tensor table. That mismatch caused segmentation faults in the test_horovod_join_allgather unit test although the situation was established earlier by an hvd.barrier() in the test_horovod_allreduce_duplicate_name_error unit test.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

…ank is guaranteed to enqueue the same number of Horovod ops. This is meant to prevent later tests from stalling due to op name mismatches when these names are automatically generated from a handle integer. These handles can be different on different ranks if they did not enqueue the same number of Horovod ops. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

If a barrier was enqueued together with an allreduce, the barrier request would erroneously be pushed again and again into the tensor queue each cycle even after the barrier was completed. Additionally, this could cause a mismatch in GetTensorEntriesFromResponse where a tensor from the response would not be present in the tensor table. This caused segmentation faults in the test_horovod_join_allgather unit test although the situation was established earlier by a hvd.barrier() in the test_horovod_allreduce_duplicate_name_error unit test. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

github-actions · 2021-12-04T06:15:41Z

Unit Test Results

    826 files +  28     826 suites +28 8h 23m 22s ⏱️ + 18m 8s
    716 tests ±    0     672 ✔️ ±    0     44 💤 ±    0 0 ❌ ±0
17 834 runs +662 12 540 ✔️ +418 5 294 💤 +244 0 ❌ ±0

Results for commit 6bb255d. ± Comparison against base commit 5af1e22.

♻️ This comment has been updated with latest results.

github-actions · 2021-12-04T06:16:03Z

Unit Test Results (with flaky tests)

    950 files +  28     950 suites +28 9h 35m 6s ⏱️ + 7m 49s
    716 tests ±    0     669 ✔️ -     1     44 💤 ±    0 3 ❌ +1
20 824 runs +662 14 536 ✔️ +402 6 283 💤 +259 5 ❌ +1

For more details on these failures, see this check.

Results for commit 6bb255d. ± Comparison against base commit 5af1e22.

♻️ This comment has been updated with latest results.

EnricoMi · 2021-12-04T13:18:54Z

@maxhgerlach this should fix the torch 1.10. issues? I will merge this into there to test this.

maxhgerlach · 2021-12-04T13:19:54Z

@EnricoMi, yes it definitely looked like that when I tested locally with PyTorch 1.10.

…_error Without this, deadlocks in the subsequent test were possible: One process would already have enqueued a collective op like hvd.broadcast(), while the other would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()]. I could not use hvd.barrier() for this second barrier because that would somehow cause a segmentation fault. Went for an allreduce instead. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach · 2021-12-05T16:53:33Z

Added an additional commit that fixes deadlocks, which were observed on the torch 1.10 branch (see #3291).

Add synchronization barriers to the ends of the test_*_duplicate_name_error

Without this, deadlocks in the subsequent test were possible: One process would
already have enqueued a collective op like hvd.broadcast(), while the other
would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()].

I could not use hvd.barrier() for this second barrier because that would somehow
cause a segmentation fault. Went for an allreduce instead.

I didn't investigate why a second hvd.barrier() would trigger a segmentation fault there.

Tixxx · 2021-12-06T07:13:25Z

Thanks for catching this! I think the second seg fault is caused by calling TotalByteSizeOfAllgatherOutput function. The tensor pointer for this response is null which is not checked before calling it. Feel free to merge this in, I will open a separate PR to address the seg fault, I also need to check if there are other places that it's not null-checking the tensor pointer since it will potentially give a lot of issues for tensor-less ops like join or barrier.

EnricoMi · 2021-12-06T07:39:34Z

I have marked this as a draft as #3291 is still failing in some cases.

maxhgerlach · 2021-12-06T08:51:46Z

Thanks for checking this out and looking into that other segfault, @Tixxx! The control flow in ComputeResponseList() is not that easy to follow so I wasn't sure if l might have misunderstood something.

I agree with Enrico that we should fix the remaining test failure observed with the PyTorch 1.10 CI branch before merging.

…y allreduces Apparently there is still UB with hvd.barrier() that may trigger failures under certain conditions. Let's avoid that for now. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach · 2021-12-08T16:01:08Z

With the fixes from this PR's branch, all tests passed on #3291 now, too. PyTorch 1.10 just needed an extra workaround that's independent from the changes here. Understanding and clearing the underlying issue that required that workaround would be material for an additional PR.

@EnricoMi, would you agree to merge this one (#3300) now?

EnricoMi

Excellent work, thanks for looking into this! Minor question.

horovod/common/controller.cc

EnricoMi

LGTM!

maxhgerlach · 2021-12-09T09:25:27Z

Thanks for looking this over.

I wrote an issue about the remaining segmentation fault with hvd.barrier to track it more easily. #3308

…m op name mismatches (horovod#3300) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach added 2 commits December 3, 2021 22:10

maxhgerlach requested review from Tixxx, EnricoMi and tgaddair December 3, 2021 23:00

EnricoMi mentioned this pull request Dec 4, 2021

Add pytorch 1.10.0 to test space, remove 1.6.0 #3291

Merged

Tixxx approved these changes Dec 6, 2021

View reviewed changes

EnricoMi marked this pull request as draft December 6, 2021 07:37

Replace all synchronization barriers in test_*_duplicate_name_error b…

6bb255d

…y allreduces Apparently there is still UB with hvd.barrier() that may trigger failures under certain conditions. Let's avoid that for now. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach mentioned this pull request Dec 8, 2021

CI: Build Horovod in test images with HOROVOD_DEBUG=1 #3307

Open

maxhgerlach marked this pull request as ready for review December 8, 2021 15:57

EnricoMi reviewed Dec 8, 2021

View reviewed changes

horovod/common/controller.cc Show resolved Hide resolved

EnricoMi approved these changes Dec 8, 2021

View reviewed changes

maxhgerlach merged commit be3b72d into horovod:master Dec 9, 2021

maxhgerlach mentioned this pull request Dec 9, 2021

Segmentation fault with hvd.barrier #3308

Closed

Tixxx mentioned this pull request Dec 13, 2021

fix barrier seg fault and added test to mix it with multiple collectives #3313

Merged

4 tasks

tkhanna1996 pushed a commit to tkhanna1996/horovod that referenced this pull request Dec 16, 2021

Fix hvd.barrier() tensor queue management and torch test failures fro…

b8907b4

…m op name mismatches (horovod#3300) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Dec 17, 2021

Update TorchTests::test_horovod_reducescatter_duplicate_name_error fo…

b85e60d

…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

EnricoMi mentioned this pull request Dec 20, 2021

Support resurrecting blacklisted hosts #3319

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

maxhgerlach commented Dec 3, 2021 •

edited

github-actions bot commented Dec 4, 2021 •

edited

github-actions bot commented Dec 4, 2021 •

edited

EnricoMi commented Dec 4, 2021

maxhgerlach commented Dec 4, 2021

maxhgerlach commented Dec 5, 2021

Tixxx commented Dec 6, 2021 •

edited

EnricoMi commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021 •

edited

maxhgerlach commented Dec 8, 2021

EnricoMi left a comment

EnricoMi left a comment

maxhgerlach commented Dec 9, 2021

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Conversation

maxhgerlach commented Dec 3, 2021 • edited

Checklist before submitting

Description

1) Fix PyTorch unit test failures from op name mismatches

2) Fix hvd.barrier() tensor queue management

Review process to land

github-actions bot commented Dec 4, 2021 • edited

Unit Test Results

github-actions bot commented Dec 4, 2021 • edited

Unit Test Results (with flaky tests)

EnricoMi commented Dec 4, 2021

maxhgerlach commented Dec 4, 2021

maxhgerlach commented Dec 5, 2021

Tixxx commented Dec 6, 2021 • edited

EnricoMi commented Dec 6, 2021

maxhgerlach commented Dec 6, 2021 • edited

maxhgerlach commented Dec 8, 2021

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi left a comment

Choose a reason for hiding this comment

maxhgerlach commented Dec 9, 2021

maxhgerlach commented Dec 3, 2021 •

edited

2) Fix `hvd.barrier()` tensor queue management

github-actions bot commented Dec 4, 2021 •

edited

github-actions bot commented Dec 4, 2021 •

edited

Tixxx commented Dec 6, 2021 •

edited

maxhgerlach commented Dec 6, 2021 •

edited