New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300
Fix hvd.barrier() tensor queue management and torch test failures from op name mismatches #3300
Conversation
…ank is guaranteed to enqueue the same number of Horovod ops. This is meant to prevent later tests from stalling due to op name mismatches when these names are automatically generated from a handle integer. These handles can be different on different ranks if they did not enqueue the same number of Horovod ops. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
If a barrier was enqueued together with an allreduce, the barrier request would erroneously be pushed again and again into the tensor queue each cycle even after the barrier was completed. Additionally, this could cause a mismatch in GetTensorEntriesFromResponse where a tensor from the response would not be present in the tensor table. This caused segmentation faults in the test_horovod_join_allgather unit test although the situation was established earlier by a hvd.barrier() in the test_horovod_allreduce_duplicate_name_error unit test. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Unit Test Results (with flaky tests) 950 files + 28 950 suites +28 9h 35m 6s ⏱️ + 7m 49s For more details on these failures, see this check. Results for commit 6bb255d. ± Comparison against base commit 5af1e22. ♻️ This comment has been updated with latest results. |
@maxhgerlach this should fix the torch 1.10. issues? I will merge this into there to test this. |
@EnricoMi, yes it definitely looked like that when I tested locally with PyTorch 1.10. |
…_error Without this, deadlocks in the subsequent test were possible: One process would already have enqueued a collective op like hvd.broadcast(), while the other would still block in hvd.init() [specifically in _get_process_set_ids_and_ranks()]. I could not use hvd.barrier() for this second barrier because that would somehow cause a segmentation fault. Went for an allreduce instead. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Added an additional commit that fixes deadlocks, which were observed on the torch 1.10 branch (see #3291).
I didn't investigate why a second |
Thanks for catching this! I think the second seg fault is caused by calling TotalByteSizeOfAllgatherOutput function. The tensor pointer for this response is null which is not checked before calling it. Feel free to merge this in, I will open a separate PR to address the seg fault, I also need to check if there are other places that it's not null-checking the tensor pointer since it will potentially give a lot of issues for tensor-less ops like join or barrier. |
I have marked this as a draft as #3291 is still failing in some cases. |
Thanks for checking this out and looking into that other segfault, @Tixxx! The control flow in I agree with Enrico that we should fix the remaining test failure observed with the PyTorch 1.10 CI branch before merging. |
…y allreduces Apparently there is still UB with hvd.barrier() that may trigger failures under certain conditions. Let's avoid that for now. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
With the fixes from this PR's branch, all tests passed on #3291 now, too. PyTorch 1.10 just needed an extra workaround that's independent from the changes here. Understanding and clearing the underlying issue that required that workaround would be material for an additional PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work, thanks for looking into this! Minor question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks for looking this over. I wrote an issue about the remaining segmentation fault with |
…m op name mismatches (horovod#3300) Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…llowing horovod#3300, horovod#3313 Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
Checklist before submitting
Description
1) Fix PyTorch unit test failures from op name mismatches
When a Horovod operation is not given an explicit name in PyTorch code, it receives an automatically generated name that encodes an internal handle number. That handle, however, is a local concept and it can differ between Horovod processes if they enqueue different numbers of operations. Name mismatches then lead to stalls.
This situation was occasionally triggered in the PyTorch unit tests when running with MPI. For instance, I would see the following:
Here we see the influence of the previously executed test function: Rank 1 had enqueued one
broadcast.duplicate_name
more than rank 0. Then in this test the autogenerated names don't match up because the handles 3259 and 3260 are different, causing a stall.Even without lingering duplicate_name ops such a stall could be triggered. For example I saw errors like the following in the CI logs for PR #3199, which were also caused by previous tests:
(When running with Gloo instead of MPI, Horovod is shut down and reinitialized between each test function, so we don't see these cross effects between separate tests as often).
To fix this situation I have rewritten the offending tests: Those that test duplicate name error handling and those that test
hvd.join()
. In the reformulated tests it is ensured that each rank always enqueues the same number of Horovod ops. This should prevent stalls in subsequent tests.It remains an open question if a better mechanism could be found that generally avoids these name mismatches.
2) Fix
hvd.barrier()
tensor queue managementThe rewritten
test_horovod_allreduce_duplicate_name_error
etc. now employ ahvd.barrier()
call and that unveiled another bug.If a
barrier
was enqueued together with anallreduce
, the barrier request would erroneously be pushed again and again into the tensor queue in each cycle even after the barrier was completed. Additionally, this could cause a mismatch inGetTensorEntriesFromResponse
where a tensor from the response would not be present in the tensor table. That mismatch caused segmentation faults in thetest_horovod_join_allgather
unit test although the situation was established earlier by anhvd.barrier()
in thetest_horovod_allreduce_duplicate_name_error
unit test.Review process to land