Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

Merged
merged 3 commits into from Jan 10, 2022

Conversation

maxhgerlach
Copy link
Collaborator

@maxhgerlach maxhgerlach commented Jan 7, 2022

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

I've been looking at some test failures in PR #3261, that popped up after rebasing to master: CI logs for test-cpu-gloo-py3_7-tf2_7_0-keras2_7_0-torch1_10_1-mxnet1_9_0-pyspark2_4_8. As far as I can tell, these are not directly related to the proposed CMake changes. Perhaps the recent update to PyTorch 1.10.1 has made it more likely to run into them.

These related fixes come with this PR:

  1. Properly reset internal state of ProcessSetTable when finalizing Horovod: In debug builds with Gloo, where Horovod is re-initialized after each torch test, an assertion would fail otherwise on such a re-init. Internal process set IDs now get more reasonable values, but I don't think this caused further bugs.
  2. Add a synchronization barrier before hvd.shutdown() in TorchTests.tearDown(): In tests like test_horovod_alltoall_equal_split_length_error it would be possible that rank 0 has already finished the test function and has triggered shutting down Horovod, before rank 1 has had a chance to call alltoall (which would exit quickly to report an error if the test worked as intended). Rank 1 would then crash.
  3. Add explicit names to broadcast operations in TorchTests::test_broadcast_state: I am not sure, but hangs might have been caused by wrongly associated autogenerated names like broadcast.noname.1114.

In debug mode with Gloo  an assertion would fail otherwise when
Horovod is reinitialized.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
In tests like `test_horovod_alltoall_equal_split_length_error` it would be possible
that rank 0 has already finished the test function and has triggered shutting down
Horovod, before rank 1 has had a chance to call `alltoall` (which would exit quickly to
report an error if the test worked as intended). Rank 1 would then crash.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
…rations

I am not sure, but hangs might have been caused by wrongly associated
autogenerated names like `broadcast.noname.1114`.

Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
@github-actions
Copy link

github-actions bot commented Jan 7, 2022

Unit Test Results

     830 files  ±0       830 suites  ±0   8h 41m 7s ⏱️ - 36m 42s
     717 tests ±0       672 ✔️ ±  0       45 💤 ±  0  0 ±0 
17 988 runs  ±0  12 658 ✔️ +14  5 330 💤  - 14  0 ±0 

Results for commit abea4a0. ± Comparison against base commit 976a879.

@github-actions
Copy link

github-actions bot commented Jan 7, 2022

Unit Test Results (with flaky tests)

     924 files   -   32       924 suites   - 32   9h 18m 40s ⏱️ - 58m 9s
     717 tests ±    0       670 ✔️ +    1       45 💤 ±    0  2  - 1 
20 054 runs   - 778  13 966 ✔️  - 486  6 086 💤  - 290  2  - 2 

For more details on these failures, see this check.

Results for commit abea4a0. ± Comparison against base commit 976a879.

@maxhgerlach maxhgerlach merged commit 45e1af6 into horovod:master Jan 10, 2022
@maxhgerlach maxhgerlach deleted the some-fixes-torch-1.10.1 branch January 10, 2022 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants