Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue 3264 #3267

Conversation

woodlgz
Copy link
Contributor

@woodlgz woodlgz commented Nov 10, 2021

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

This is a PR from JIZHI Team & Taiji AI platform in Tencent.
Fixes #3264.
avoid enqueuing more collective ops while horovod is shutting down.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@github-actions
Copy link

github-actions bot commented Nov 10, 2021

Unit Test Results

     718 files   -   28       718 suites   - 28   5h 56m 57s ⏱️ + 9m 40s
     591 tests ±    0       552 ✔️ ±    0       34 💤 ±    0    5 ±0 
15 375 runs   - 662  10 705 ✔️  - 512  4 652 💤  - 150  18 ±0 

For more details on these failures, see this check.

Results for commit 280a716. ± Comparison against base commit 28eaf2b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Nov 10, 2021

Unit Test Results (with flaky tests)

     942 files  ±0       942 suites  ±0   6h 44m 51s ⏱️ + 15m 35s
     591 tests ±0       551 ✔️  - 1       34 💤 ±0    6 +1 
20 671 runs  ±0  14 214 ✔️  - 5  6 402 💤 +4  55 +1 

For more details on these failures, see this check.

Results for commit 280a716. ± Comparison against base commit 28eaf2b.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @woodlgz. Can you rebase off master to pick up any test fixes?

@romerojosh can you double check the changes to checking the shut down state here? It looks like it should be fine, but wanted to get a second opinion.

@romerojosh
Copy link
Collaborator

The changes look fine to me, but I'm not exactly sure what we expect to happen in an elastic scenario, which seems to be what this issue is about.

In an elastic case, do we expect these workers to resubmit these collective requests? If so, then this change looks good to me.

@ashahab
Copy link
Collaborator

ashahab commented Nov 11, 2021

@romerojosh I have a clarifying question: is horovod_global.shut_down a worker-only state or a shared state among all workers? If it's the former then yes, the remaining elastic workers should resubmit the collective ops(after forming a new ring).

@woodlgz
Copy link
Contributor Author

woodlgz commented Nov 11, 2021

@ashahab yes, horovod_global.shut_down is a worker only state, but somehow all workers remaining will reach a common shutdown state and in elastic training they will form a new ring in next loop.
@romerojosh current training loop is wrapped in a so-called elastic loop in https://github.com/horovod/horovod/blob/master/horovod/common/elastic.py#L151. so in next loop,user training function will try to redo the calculation and resubmit the ops.

Signed-off-by: guoze.lin <guozelin@tencent.com>
@woodlgz woodlgz force-pushed the lgz/fix-enqueuing-tensor-while-doing-finalizaion branch from e5d126e to 280a716 Compare November 11, 2021 04:11
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Looks like the failing tests are head versions, which have some known issues being addressed. So we can go ahead and land this.

@tgaddair tgaddair merged commit 32c278b into horovod:master Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

All workers failed on failure with Elastic Horovod
4 participants