New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Un-pin torch-nightly #2829
Un-pin torch-nightly #2829
Conversation
This comment has been minimized.
This comment has been minimized.
d51e15a
to
f9e5840
Compare
This comment has been minimized.
This comment has been minimized.
f9e5840
to
ff4c324
Compare
Unit Test Results 783 files ±0 783 suites ±0 5h 28m 17s ⏱️ ±0s For more details on these failures, see this check. Results for commit a74e42b. ± Comparison against base commit a74e42b. ♻️ This comment has been updated with latest results. |
Raised issue pytorch/pytorch#57900 with pytorch. |
ae031fc
to
2ca8194
Compare
We have to un-pin torch-head because the pinned nightly version has been cleaned up, so that master has started to break over the weekend. We cannot un-pin it because our The change replaces mean arguments with sum and count. So there is an additional count tensor required in 1.9.0, that I am trying to provide (2ca8194) but it fails with
Maybe someone can give me a hint? |
687ae0d
to
c5cf6fb
Compare
Which unit test/example can reproduce this error? @EnricoMi |
I found it in buildkite log
|
@chongxiaoc my last commit "Method signature of non-public torch function changed in 1.9.0" is the one that tries to fix the broken |
They have change the signature of |
@EnricoMi Please try with below diff and rebase.
I tested with
Result:
@tgaddair FYI |
@chongxiaoc looks like I had been pretty close. Thanks for fixing this. I'll have a look in the remaining new issues. |
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Travis Addair <tgaddair@gmail.com>
94c95ba
to
6deaa8b
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
6deaa8b
to
6a4b4e6
Compare
e3450ec
to
b783e73
Compare
The last remaining issue is puzzling. The elastic torch test makes the training script fail with SIGKILL on process with rank one, which should make elastic Horovod downscale the cluster to 3 training Horovod processes. From the log and exist codes we can see the other three training processes also fail, due to an uncaught Gloo IO Exception:
Here are the exit codes of the four Horovod processes:
Exit code 137 indicates a SIGKILL, exit code 134 indicates a SIGABRT. This only occurs with torch head on GPU (torch-1.9.0.dev20210514+cu111), not torch head on CPU (torch-1.9.0.dev20210514+cpu). @tgaddair this feels like we have seen this some time ago. Shouldn't that Gloo UI Exception been caught in the main elastic loop? |
Hmm, seems like there may be some flakiness here, as I don't see what would make this error specific to GPUs. Maybe we can create an issue and skip that test for now so we can unblock. |
Elastic tests are more and more degrading... I have created #2908. |
|
||
@mock.patch('horovod.runner.elastic.driver.DISCOVER_HOSTS_FREQUENCY_SECS', 0.01) | ||
@mock.patch('horovod.runner.gloo_run._get_min_start_hosts', return_value=1) | ||
def test_fault_tolerance_without_scaling(self, mock_get_min_start_hosts): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could further restrict this to torch head and see if it is exclusively an issue for >= 1.9.0
:
if LooseVersion(torch.__version__) >= LooseVersion('1.9.0'):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
2f61304
to
e7ca8c3
Compare
Signed-off-by: Travis Addair <tgaddair@gmail.com>
e7ca8c3
to
acf8994
Compare
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
acf8994
to
d232de8
Compare
In #2730 we have pinned torch nightly, this un-pins it by adjusting for upstream changed.
torch.batch_norm_backward_elemt
changes in 1.9.0.Unrelated changes but fixed in this PR:
BatchedDataLoader
constructor break master, so limiting topetastorm<0.11.0
(Make Horovod work with petastorm >= 0.11.0 #2909).Not handled in this PR: