New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix issue 3264 #3267
fix issue 3264 #3267
Conversation
Unit Test Results 718 files - 28 718 suites - 28 5h 56m 57s ⏱️ + 9m 40s For more details on these failures, see this check. Results for commit 280a716. ± Comparison against base commit 28eaf2b. ♻️ This comment has been updated with latest results. |
Unit Test Results (with flaky tests) 942 files ±0 942 suites ±0 6h 44m 51s ⏱️ + 15m 35s For more details on these failures, see this check. Results for commit 280a716. ± Comparison against base commit 28eaf2b. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @woodlgz. Can you rebase off master to pick up any test fixes?
@romerojosh can you double check the changes to checking the shut down state here? It looks like it should be fine, but wanted to get a second opinion.
The changes look fine to me, but I'm not exactly sure what we expect to happen in an elastic scenario, which seems to be what this issue is about. In an elastic case, do we expect these workers to resubmit these collective requests? If so, then this change looks good to me. |
@romerojosh I have a clarifying question: is |
@ashahab yes, horovod_global.shut_down is a worker only state, but somehow all workers remaining will reach a common shutdown state and in elastic training they will form a new ring in next loop. |
Signed-off-by: guoze.lin <guozelin@tencent.com>
e5d126e
to
280a716
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Looks like the failing tests are head versions, which have some known issues being addressed. So we can go ahead and land this.
Checklist before submitting
Description
This is a PR from JIZHI Team & Taiji AI platform in Tencent.
Fixes #3264.
avoid enqueuing more collective ops while horovod is shutting down.
Review process to land