Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300

Merged
merged 4 commits into from Dec 9, 2021

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Dec 3, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

1) Fix PyTorch unit test failures from op name mismatches

When a Horovod operation is not given an explicit name in PyTorch code, it receives an automatically generated name that encodes an internal handle number. That handle, however, is a local concept and it can differ between Horovod processes if they enqueue different numbers of operations. Name mismatches then lead to stalls.

This situation was occasionally triggered in the PyTorch unit tests when running with MPI. For instance, I would see the following:

test_torch.py::TorchTests::test_horovod_broadcast_error 
[horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [broadcast.noname.3259]
1: [broadcast.duplicate_name, broadcast.noname.3260]

Here we see the influence of the previously executed test function: Rank 1 had enqueued one broadcast.duplicate_name more than rank 0. Then in this test the autogenerated names don't match up because the handles 3259 and 3260 are different, causing a stall.

Even without lingering duplicate_name ops such a stall could be triggered. For example I saw errors like the following in the CI logs for PR #3199, which were also caused by previous tests:

test_torch.py::TorchTests::test_horovod_reducescatter 
[horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Missing ranks:
0: [reducescatter.noname.3236]
1: [reducescatter.noname.3234]

(When running with Gloo instead of MPI, Horovod is shut down and reinitialized between each test function, so we don't see these cross effects between separate tests as often).

To fix this situation I have rewritten the offending tests: Those that test duplicate name error handling and those that test hvd.join(). In the reformulated tests it is ensured that each rank always enqueues the same number of Horovod ops. This should prevent stalls in subsequent tests.

It remains an open question if a better mechanism could be found that generally avoids these name mismatches.

2) Fix hvd.barrier() tensor queue management

The rewritten test_horovod_allreduce_duplicate_name_error etc. now employ a hvd.barrier() call and that unveiled another bug.

If a barrier was enqueued together with an allreduce, the barrier request would erroneously be pushed again and again into the tensor queue in each cycle even after the barrier was completed. Additionally, this could cause a mismatch in GetTensorEntriesFromResponse where a tensor from the response would not be present in the tensor table. That mismatch caused segmentation faults in the test_horovod_join_allgather unit test although the situation was established earlier by an hvd.barrier() in the test_horovod_allreduce_duplicate_name_error unit test.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

…ank is guaranteed to enqueue the same number of Horovod ops.

This is meant to prevent later tests from stalling due to op name mismatches
when these names are automatically generated from a handle integer. These
handles can be different on different ranks if they did not enqueue the same
number of Horovod ops.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
If a barrier was enqueued together with an allreduce, the barrier request
would erroneously be pushed again and again into the tensor queue each
cycle even after the barrier was completed. Additionally, this could
cause a mismatch in GetTensorEntriesFromResponse where a tensor from
the response would not be present in the tensor table. This caused
segmentation faults in the test_horovod_join_allgather unit test
although the situation was established earlier by a hvd.barrier() in the
test_horovod_allreduce_duplicate_name_error unit test.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@github-actions
Copy link

github-actions bot commented Dec 4, 2021

Unit Test Results

     826 files  +  28       826 suites  +28   8h 23m 22s ⏱️ + 18m 8s
     716 tests ±    0       672 ✔️ ±    0       44 💤 ±    0  0 ±0 
17 834 runs  +662  12 540 ✔️ +418  5 294 💤 +244  0 ±0 

Results for commit 6bb255d. ± Comparison against base commit 5af1e22.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Dec 4, 2021

Unit Test Results (with flaky tests)

     950 files  +  28       950 suites  +28   9h 35m 6s ⏱️ + 7m 49s
     716 tests ±    0       669 ✔️  -     1       44 💤 ±    0  3 +1 
20 824 runs  +662  14 536 ✔️ +402  6 283 💤 +259  5 +1 

For more details on these failures, see this check.

Results for commit 6bb255d. ± Comparison against base commit 5af1e22.

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator

EnricoMi commented Dec 4, 2021

@maxhgerlach this should fix the torch 1.10. issues? I will merge this into there to test this.

@maxhgerlach
Copy link
Collaborator Author

@EnricoMi, yes it definitely looked like that when I tested locally with PyTorch 1.10.

…_error

Without this, deadlocks in the subsequent test were possible: One process would
already have enqueued a collective op like hvd.broadcast(), while the other
would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()].

I could not use hvd.barrier() for this second barrier because that would somehow
cause a segmentation fault. Went for an allreduce instead.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

Added an additional commit that fixes deadlocks, which were observed on the torch 1.10 branch (see #3291).

Add synchronization barriers to the ends of the test_*_duplicate_name_error

Without this, deadlocks in the subsequent test were possible: One process would
already have enqueued a collective op like hvd.broadcast(), while the other
would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()].

I could not use hvd.barrier() for this second barrier because that would somehow
cause a segmentation fault. Went for an allreduce instead.

I didn't investigate why a second hvd.barrier() would trigger a segmentation fault there.

@Tixxx
Copy link
Collaborator

Tixxx commented Dec 6, 2021

Thanks for catching this! I think the second seg fault is caused by calling TotalByteSizeOfAllgatherOutput function. The tensor pointer for this response is null which is not checked before calling it. Feel free to merge this in, I will open a separate PR to address the seg fault, I also need to check if there are other places that it's not null-checking the tensor pointer since it will potentially give a lot of issues for tensor-less ops like join or barrier.

@EnricoMi EnricoMi marked this pull request as draft December 6, 2021 07:37
@EnricoMi
Copy link
Collaborator

EnricoMi commented Dec 6, 2021

I have marked this as a draft as #3291 is still failing in some cases.

@maxhgerlach
Copy link
Collaborator Author

maxhgerlach commented Dec 6, 2021

Thanks for checking this out and looking into that other segfault, @Tixxx! The control flow in ComputeResponseList() is not that easy to follow so I wasn't sure if l might have misunderstood something.

I agree with Enrico that we should fix the remaining test failure observed with the PyTorch 1.10 CI branch before merging.

…y allreduces

Apparently there is still UB with hvd.barrier() that may trigger
failures under certain conditions. Let's avoid that for now.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@maxhgerlach
Copy link
Collaborator Author

With the fixes from this PR's branch, all tests passed on #3291 now, too. PyTorch 1.10 just needed an extra workaround that's independent from the changes here. Understanding and clearing the underlying issue that required that workaround would be material for an additional PR.

@EnricoMi, would you agree to merge this one (#3300) now?

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work, thanks for looking into this! Minor question.

horovod/common/controller.cc Show resolved Hide resolved
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@maxhgerlach maxhgerlach merged commit be3b72d into horovod:master Dec 9, 2021
@maxhgerlach
Copy link
Collaborator Author

Thanks for looking this over.

I wrote an issue about the remaining segmentation fault with hvd.barrier to track it more easily. #3308

tkhanna1996 pushed a commit to tkhanna1996/horovod that referenced this pull request Dec 16, 2021
…m op name mismatches (horovod#3300)

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
maxhgerlach added a commit to maxhgerlach/horovod that referenced this pull request Dec 17, 2021
…llowing horovod#3300, horovod#3313

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants