Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Checklist before submitting
Description
I've been looking at some test failures in PR #3261, that popped up after rebasing to master: CI logs for test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8. As far as I can tell, these are not directly related to the proposed CMake changes. Perhaps the recent update to PyTorch 1.10.1 has made it more likely to run into them.
These related fixes come with this PR:
ProcessSetTable
when finalizing Horovod: In debug builds with Gloo, where Horovod is re-initialized after each torch test, an assertion would fail otherwise on such a re-init. Internal process set IDs now get more reasonable values, but I don't think this caused further bugs.hvd.shutdown()
inTorchTests.tearDown()
: In tests liketest_horovod_alltoall_equal_split_length_error
it would be possible that rank 0 has already finished the test function and has triggered shutting down Horovod, before rank 1 has had a chance to callalltoall
(which would exit quickly to report an error if the test worked as intended). Rank 1 would then crash.TorchTests::test_broadcast_state
: I am not sure, but hangs might have been caused by wrongly associated autogenerated names likebroadcast.noname.1114
.