Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un-pin torch-nightly #2829

Merged
merged 6 commits into from May 17, 2021
Merged

Un-pin torch-nightly #2829

merged 6 commits into from May 17, 2021

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Apr 8, 2021

In #2730 we have pinned torch nightly, this un-pins it by adjusting for upstream changed.

  • The API of torch.batch_norm_backward_elemt changes in 1.9.0.

Unrelated changes but fixed in this PR:

Not handled in this PR:

@EnricoMi EnricoMi changed the title Branch unpin torchhead unpin torch-head Apr 8, 2021
@EnricoMi EnricoMi changed the title unpin torch-head Unpin torch-head Apr 8, 2021
@EnricoMi EnricoMi changed the title Unpin torch-head Un-pin torch-head Apr 8, 2021
@EnricoMi EnricoMi changed the title Un-pin torch-head Un-pin torch-nightly Apr 8, 2021
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link

github-actions bot commented May 4, 2021

Unit Test Results

     783 files  ±0       783 suites  ±0   5h 28m 17s ⏱️ ±0s
     592 tests ±0       557 ✔️ ±0       34 💤 ±0  1 ❌ ±0 
16 212 runs  ±0  12 247 ✔️ ±0  3 963 💤 ±0  2 ❌ ±0 

For more details on these failures, see this check.

Results for commit a74e42b. ± Comparison against base commit a74e42b.

♻️ This comment has been updated with latest results.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented May 8, 2021

Raised issue pytorch/pytorch#57900 with pytorch.

@EnricoMi EnricoMi force-pushed the branch-unpin-torchhead branch 2 times, most recently from ae031fc to 2ca8194 Compare May 9, 2021 19:14
@EnricoMi
Copy link
Collaborator Author

We have to un-pin torch-head because the pinned nightly version has been cleaned up, so that master has started to break over the weekend. We cannot un-pin it because our SyncBatchNorm uses non-public torch method batch_norm_backward_elemt that changed signature in unreleased 1.9.0 (torch-head): pytorch/pytorch#46906

The change replaces mean arguments with sum and count. So there is an additional count tensor required in 1.9.0, that I am trying to provide (2ca8194) but it fails with

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Maybe someone can give me a hint?
@tgaddair @romerojosh @nvcastet @abditag2

@EnricoMi EnricoMi force-pushed the branch-unpin-torchhead branch 5 times, most recently from 687ae0d to c5cf6fb Compare May 14, 2021 09:05
@chongxiaoc
Copy link
Collaborator

chongxiaoc commented May 14, 2021

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Which unit test/example can reproduce this error? @EnricoMi

@chongxiaoc
Copy link
Collaborator

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Which unit test/example can reproduce this error? @EnricoMi

I found it in buildkite log

[0]<stdout>:FAILED test_torch.py::TorchTests::test_horovod_sync_batch_norm - AssertionErr...

@EnricoMi
Copy link
Collaborator Author

@chongxiaoc my last commit "Method signature of non-public torch function changed in 1.9.0" is the one that tries to fix the broken SyncBatchNorm implementation. Undo as much as you need to.

@EnricoMi
Copy link
Collaborator Author

They have change the signature of torch.batch_norm_backward_elemt from taking mean values to taking sums and count: pytorch/pytorch#46906

@chongxiaoc
Copy link
Collaborator

chongxiaoc commented May 14, 2021

@EnricoMi Please try with below diff and rebase.

(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod #]git diff .
diff --git a/horovod/torch/sync_batch_norm.py b/horovod/torch/sync_batch_norm.py
index 19f666d..7467802 100644
--- a/horovod/torch/sync_batch_norm.py
+++ b/horovod/torch/sync_batch_norm.py
@@ -168,9 +168,8 @@ class _SyncBatchNorm(Function):

             if _SYNC_BN_V4:
                 # from 1.9.0 on we need a count tensor on all devices
-                count_all_handle = allreduce_async(count_all, op=Sum, name='sync_batch_norm.count_all')
-                count_all = synchronize(count_all_handle)
-                count_all = count_all.view(-1).int().to(grad_output.device)
+                # count_all is calculated as total count across all ranks in forward function
+                count_all = count_all.to(dtype=torch.int, device=grad_output.device)
             elif _SYNC_BN_V2 or _SYNC_BN_V3:
                 # before 1.9.0 we need the count as an integer to compute means values
                 count = count_all.sum()

I tested with torch==1.9.0.dev20210514 and torchvision==0.10.0.dev20210514 as below configs on our 2-GPUs system:

pip install torch==1.9.0.dev20210514 -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
pip install torchvision==0.10.0.dev20210514 -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
MAKEFLAGS="-j1" HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/usr/local/nccl HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 python setup.py install

Result:

(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod/test/parallel #]horovodrun -np 2 -H worker-0:2 pytest -v -s  test_torch.py::TorchTests::test_horovod_sync_batch_norm
2021-05-14 22:20:23.331191: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/gcc-7.4/lib64:/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:/opt/hadoop/latest/lib/native:/usr/local/cuda/extras/CUPTI/lib64:/opt/michelangelo/python_code/lib:/usr/local/openmpi/lib:/usr/local/lib
2021-05-14 22:20:23.331243: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[1,1]<stdout>:============================= test session starts ==============================
[1,1]<stdout>:platform linux -- Python 3.6.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /mnt/share/chongxiaoc/deepeat-py3.6-env/bin/python
[1,0]<stdout>:============================= test session starts ==============================
[1,0]<stdout>:platform linux -- Python 3.6.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /mnt/share/chongxiaoc/deepeat-py3.6-env/bin/python
[1,0]<stdout>:cachedir: .pytest_cache
[1,0]<stdout>:rootdir: /mnt/share/chongxiaoc/git/horovod, configfile: setup.cfg
[1,0]<stdout>:plugins: cov-2.11.1, forked-1.3.0, xdist-2.2.1, repeat-0.9.1, pycharm-0.7.0, logger-0.5.1, timeout-1.4.2
[1,1]<stdout>:cachedir: .pytest_cache
[1,1]<stdout>:rootdir: /mnt/share/chongxiaoc/git/horovod, configfile: setup.cfg
[1,1]<stdout>:plugins: cov-2.11.1, forked-1.3.0, xdist-2.2.1, repeat-0.9.1, pycharm-0.7.0, logger-0.5.1, timeout-1.4.2
collected 1 item                                                               [1,1]<stdout>:
collected 1 item                                                               [1,0]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:test_torch.py::TorchTests::test_horovod_sync_batch_norm [1,0]<stdout>:
[1,0]<stdout>:test_torch.py::TorchTests::test_horovod_sync_batch_norm [1,1]<stdout>:PASSED[1,0]<stdout>:PASSED[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>:============================== 1 passed in 7.96s ===============================
[1,1]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:============================== 1 passed in 7.96s ===============================
(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod/test/parallel #]

@tgaddair FYI

@EnricoMi
Copy link
Collaborator Author

Please try with below diff and rebase.

@chongxiaoc looks like I had been pretty close. Thanks for fixing this. I'll have a look in the remaining new issues.

EnricoMi and others added 3 commits May 15, 2021 11:34
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Travis Addair <tgaddair@gmail.com>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi force-pushed the branch-unpin-torchhead branch 2 times, most recently from e3450ec to b783e73 Compare May 15, 2021 17:22
@EnricoMi
Copy link
Collaborator Author

The last remaining issue is puzzling. The elastic torch test makes the training script fail with SIGKILL on process with rank one, which should make elastic Horovod downscale the cluster to 3 training Horovod processes.

From the log and exist codes we can see the other three training processes also fail, due to an uncaught Gloo IO Exception:

[0]<stderr>:RuntimeError: check_rank and exit epoch=1 batch=0 start_rank=0 rank=0
[0]<stderr>:Killed

[1]<stderr>:Terminated
[1]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[1]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [127.0.0.1]:3977: Connection reset by peer
[1]<stderr>:Aborted (core dumped)

[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:54168
[2]<stderr>:Aborted (core dumped)

[3]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[3]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [127.0.0.1]:52398: Connection reset by peer
[3]<stderr>:Aborted (core dumped)

Here are the exit codes of the four Horovod processes:

Process 0 exit with status code 137.
Process 3 exit with status code 134.
Process 2 exit with status code 134.
Process 1 exit with status code 134.

Exit code 137 indicates a SIGKILL, exit code 134 indicates a SIGABRT.

This only occurs with torch head on GPU (torch-1.9.0.dev20210514+cu111), not torch head on CPU (torch-1.9.0.dev20210514+cpu).

@tgaddair this feels like we have seen this some time ago. Shouldn't that Gloo UI Exception been caught in the main elastic loop?

@tgaddair
Copy link
Collaborator

Hmm, seems like there may be some flakiness here, as I don't see what would make this error specific to GPUs. Maybe we can create an issue and skip that test for now so we can unblock.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented May 16, 2021

Elastic tests are more and more degrading... I have created #2908.


@mock.patch('horovod.runner.elastic.driver.DISCOVER_HOSTS_FREQUENCY_SECS', 0.01)
@mock.patch('horovod.runner.gloo_run._get_min_start_hosts', return_value=1)
def test_fault_tolerance_without_scaling(self, mock_get_min_start_hosts):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could further restrict this to torch head and see if it is exclusively an issue for >= 1.9.0:

if LooseVersion(torch.__version__) >= LooseVersion('1.9.0'):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Signed-off-by: Travis Addair <tgaddair@gmail.com>
@EnricoMi EnricoMi marked this pull request as ready for review May 16, 2021 18:45
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi merged commit a74e42b into master May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants