Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch: sync values_handle firstly in sparse_allreduce_async #2965

Merged
merged 1 commit into from Jun 10, 2021

Conversation

chongxiaoc
Copy link
Collaborator

@chongxiaoc chongxiaoc commented Jun 10, 2021

Root cause hasn't been found out. But if values_handle is not synced
firstly, sparse_allreduce_async will be hanging in unit test.

Signed-off-by: Chongxiao Cao chongxiaoc@uber.com

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #2961

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

Root cause hasn't been found out. But if values_handle is not synced
firstly, sparse_allreduce_async will be hanging in unit test.

Signed-off-by: Chongxiao Cao <chongxiaoc@uber.com>
@github-actions
Copy link

github-actions bot commented Jun 10, 2021

Unit Test Results

     783 files  ±0       783 suites  ±0   6h 12m 25s ⏱️ ±0s
     601 tests ±0       566 ✔️ ±0       35 💤 ±0  0 ❌ ±0 
16 311 runs  ±0  12 291 ✔️ ±0  4 020 💤 ±0  0 ❌ ±0 

Results for commit b01bc54. ± Comparison against base commit b01bc54.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why it works, but nice job!

@tgaddair tgaddair merged commit b01bc54 into horovod:master Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

PyTorch sparse allreduce fails with torch nightly
2 participants