Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-enable delta optimizer test and fix a bug in adasum communicator init logic #3379

Merged
merged 1 commit into from Jan 24, 2022

Conversation

Tixxx
Copy link
Collaborator

@Tixxx Tixxx commented Jan 22, 2022

re-enable delta optimizer test and fix a bug in adasum communicator init logic
Signed-off-by: TJ tix@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This addresses issue 3314. Torch's delta_optimizer test was disabled due to a deadlock issue with torch 1.10, this is to enable the test with a barrier at the end to properly synchronize. While the main root cause was due to the python GIL which was fixed by this pr, another bug with adasum reduction comm init logic was discovered during debugging, adasum was always initialized relative to mpi_world, the correct relative communicator should be mpi_context->mpi_comm.

Fixes # (issue).
#3314

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

…nit logic

Signed-off-by: TJ <tix@uber.com>
@Tixxx Tixxx linked an issue Jan 22, 2022 that may be closed by this pull request
@github-actions
Copy link

Unit Test Results

     845 files       845 suites   8h 51m 10s ⏱️
     717 tests      675 ✔️      42 💤 0
18 192 runs  12 674 ✔️ 5 518 💤 0

Results for commit a9a0d3e.

@github-actions
Copy link

Unit Test Results (with flaky tests)

     967 files       967 suites   10h 23m 29s ⏱️
     717 tests      672 ✔️      42 💤 3
20 770 runs  14 281 ✔️ 6 484 💤 5

For more details on these failures, see this check.

Results for commit a9a0d3e.

Copy link
Collaborator

@maxhgerlach maxhgerlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, good to hear that this hang was also fixed by releasing the GIL.

@Tixxx Tixxx merged commit a1f17d8 into master Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

TorchTests::test_delta_optimizer causes deadlocks in later tests
2 participants