Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

maxhgerlach · 2022-01-07T18:28:46Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

I've been looking at some test failures in PR #3261, that popped up after rebasing to master: CI logs for test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8. As far as I can tell, these are not directly related to the proposed CMake changes. Perhaps the recent update to PyTorch 1.10.1 has made it more likely to run into them.

These related fixes come with this PR:

Properly reset internal state of ProcessSetTable when finalizing Horovod: In debug builds with Gloo, where Horovod is re-initialized after each torch test, an assertion would fail otherwise on such a re-init. Internal process set IDs now get more reasonable values, but I don't think this caused further bugs.
Add a synchronization barrier before hvd.shutdown() in TorchTests.tearDown(): In tests like test_horovod_alltoall_equal_split_length_error it would be possible that rank 0 has already finished the test function and has triggered shutting down Horovod, before rank 1 has had a chance to call alltoall (which would exit quickly to report an error if the test worked as intended). Rank 1 would then crash.
Add explicit names to broadcast operations in TorchTests::test_broadcast_state: I am not sure, but hangs might have been caused by wrongly associated autogenerated names like broadcast.noname.1114.

In debug mode with Gloo an assertion would fail otherwise when Horovod is reinitialized. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

In tests like `test_horovod_alltoall_equal_split_length_error` it would be possible that rank 0 has already finished the test function and has triggered shutting down Horovod, before rank 1 has had a chance to call `alltoall` (which would exit quickly to report an error if the test worked as intended). Rank 1 would then crash. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

…rations I am not sure, but hangs might have been caused by wrongly associated autogenerated names like `broadcast.noname.1114`. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

github-actions · 2022-01-07T23:15:27Z

Unit Test Results

    830 files ±0     830 suites ±0 8h 41m 7s ⏱️ - 36m 42s
    717 tests ±0     672 ✔️ ±  0     45 💤 ±  0 0 ❌ ±0
17 988 runs ±0 12 658 ✔️ +14 5 330 💤 - 14 0 ❌ ±0

Results for commit abea4a0. ± Comparison against base commit 976a879.

github-actions · 2022-01-07T23:15:43Z

Unit Test Results (with flaky tests)

    924 files -   32     924 suites - 32 9h 18m 40s ⏱️ - 58m 9s
    717 tests ±    0     670 ✔️ +    1     45 💤 ±    0 2 ❌ - 1
20 054 runs - 778 13 966 ✔️ - 486 6 086 💤 - 290 2 ❌ - 2

For more details on these failures, see this check.

Results for commit abea4a0. ± Comparison against base commit 976a879.

maxhgerlach added 3 commits January 7, 2022 18:03

Reset next_id_ properly when finalizing ProcessSetTable

b7b687b

In debug mode with Gloo an assertion would fail otherwise when Horovod is reinitialized. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

TorchTests::test_broadcast_state: Add explicit names to broadcast ope…

abea4a0

…rations I am not sure, but hangs might have been caused by wrongly associated autogenerated names like `broadcast.noname.1114`. Signed-off-by: Max H. Gerlach <git@maxgerlach.de>

maxhgerlach mentioned this pull request Jan 7, 2022

GIL-related deadlock with PyTorch 1.10.1 #3352

Closed

maxhgerlach requested review from tgaddair and Tixxx January 8, 2022 09:32

Tixxx approved these changes Jan 10, 2022

View reviewed changes

maxhgerlach merged commit 45e1af6 into horovod:master Jan 10, 2022

maxhgerlach deleted the some-fixes-torch-1.10.1 branch January 10, 2022 19:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

maxhgerlach commented Jan 7, 2022 •

edited

github-actions bot commented Jan 7, 2022

github-actions bot commented Jan 7, 2022

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

Conversation

maxhgerlach commented Jan 7, 2022 • edited

Checklist before submitting

Description

github-actions bot commented Jan 7, 2022

Unit Test Results

github-actions bot commented Jan 7, 2022

Unit Test Results (with flaky tests)

maxhgerlach commented Jan 7, 2022 •

edited