Un-pin torch-nightly #2829

EnricoMi · 2021-04-08T15:45:53Z

In #2730 we have pinned torch nightly, this un-pins it by adjusting for upstream changed.

The API of torch.batch_norm_backward_elemt changes in 1.9.0.

Unrelated changes but fixed in this PR:

Some package imports broke readthedocs build in master.
Latest petastorm version (0.11.0) breaks BatchedDataLoader constructor break master, so limiting to petastorm<0.11.0 (Make Horovod work with petastorm >= 0.11.0 #2909).

Not handled in this PR:

Elastic torch tests started to fail on all torch versions across CPU and GPU on an uncaught GLOO IO Exception (Elastic torch tests fail #2908)

github-actions · 2021-05-04T10:47:26Z

Unit Test Results

    783 files ±0     783 suites ±0 5h 28m 17s ⏱️ ±0s
    592 tests ±0     557 ✔️ ±0     34 💤 ±0 1 ❌ ±0
16 212 runs ±0 12 247 ✔️ ±0 3 963 💤 ±0 2 ❌ ±0

For more details on these failures, see this check.

Results for commit a74e42b. ± Comparison against base commit a74e42b.

♻️ This comment has been updated with latest results.

EnricoMi · 2021-05-08T20:41:12Z

Raised issue pytorch/pytorch#57900 with pytorch.

EnricoMi · 2021-05-10T07:23:23Z

We have to un-pin torch-head because the pinned nightly version has been cleaned up, so that master has started to break over the weekend. We cannot un-pin it because our SyncBatchNorm uses non-public torch method batch_norm_backward_elemt that changed signature in unreleased 1.9.0 (torch-head): pytorch/pytorch#46906

The change replaces mean arguments with sum and count. So there is an additional count tensor required in 1.9.0, that I am trying to provide (2ca8194) but it fails with

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Maybe someone can give me a hint?
@tgaddair @romerojosh @nvcastet @abditag2

chongxiaoc · 2021-05-14T16:19:17Z

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Which unit test/example can reproduce this error? @EnricoMi

chongxiaoc · 2021-05-14T18:15:41Z

Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking arugment for argument count in method wrapper_batch_norm_backward_elemt)

Which unit test/example can reproduce this error? @EnricoMi

I found it in buildkite log

[0]<stdout>:FAILED test_torch.py::TorchTests::test_horovod_sync_batch_norm - AssertionErr...

EnricoMi · 2021-05-14T19:17:22Z

@chongxiaoc my last commit "Method signature of non-public torch function changed in 1.9.0" is the one that tries to fix the broken SyncBatchNorm implementation. Undo as much as you need to.

EnricoMi · 2021-05-14T19:19:05Z

They have change the signature of torch.batch_norm_backward_elemt from taking mean values to taking sums and count: pytorch/pytorch#46906

chongxiaoc · 2021-05-14T22:20:55Z

@EnricoMi Please try with below diff and rebase.

(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod #]git diff .
diff --git a/horovod/torch/sync_batch_norm.py b/horovod/torch/sync_batch_norm.py
index 19f666d..7467802 100644
--- a/horovod/torch/sync_batch_norm.py
+++ b/horovod/torch/sync_batch_norm.py
@@ -168,9 +168,8 @@ class _SyncBatchNorm(Function):

             if _SYNC_BN_V4:
                 # from 1.9.0 on we need a count tensor on all devices
-                count_all_handle = allreduce_async(count_all, op=Sum, name='sync_batch_norm.count_all')
-                count_all = synchronize(count_all_handle)
-                count_all = count_all.view(-1).int().to(grad_output.device)
+                # count_all is calculated as total count across all ranks in forward function
+                count_all = count_all.to(dtype=torch.int, device=grad_output.device)
             elif _SYNC_BN_V2 or _SYNC_BN_V3:
                 # before 1.9.0 we need the count as an integer to compute means values
                 count = count_all.sum()

I tested with torch==1.9.0.dev20210514 and torchvision==0.10.0.dev20210514 as below configs on our 2-GPUs system:

pip install torch==1.9.0.dev20210514 -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
pip install torchvision==0.10.0.dev20210514 -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
MAKEFLAGS="-j1" HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_NCCL_HOME=/usr/local/nccl HOROVOD_WITHOUT_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 python setup.py install

Result:

(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod/test/parallel #]horovodrun -np 2 -H worker-0:2 pytest -v -s  test_torch.py::TorchTests::test_horovod_sync_batch_norm
2021-05-14 22:20:23.331191: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/gcc-7.4/lib64:/usr/local/cuda/compat:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server:/opt/hadoop/latest/lib/native:/usr/local/cuda/extras/CUPTI/lib64:/opt/michelangelo/python_code/lib:/usr/local/openmpi/lib:/usr/local/lib
2021-05-14 22:20:23.331243: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[1,1]<stdout>:============================= test session starts ==============================
[1,1]<stdout>:platform linux -- Python 3.6.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /mnt/share/chongxiaoc/deepeat-py3.6-env/bin/python
[1,0]<stdout>:============================= test session starts ==============================
[1,0]<stdout>:platform linux -- Python 3.6.9, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 -- /mnt/share/chongxiaoc/deepeat-py3.6-env/bin/python
[1,0]<stdout>:cachedir: .pytest_cache
[1,0]<stdout>:rootdir: /mnt/share/chongxiaoc/git/horovod, configfile: setup.cfg
[1,0]<stdout>:plugins: cov-2.11.1, forked-1.3.0, xdist-2.2.1, repeat-0.9.1, pycharm-0.7.0, logger-0.5.1, timeout-1.4.2
[1,1]<stdout>:cachedir: .pytest_cache
[1,1]<stdout>:rootdir: /mnt/share/chongxiaoc/git/horovod, configfile: setup.cfg
[1,1]<stdout>:plugins: cov-2.11.1, forked-1.3.0, xdist-2.2.1, repeat-0.9.1, pycharm-0.7.0, logger-0.5.1, timeout-1.4.2
collected 1 item                                                               [1,1]<stdout>:
collected 1 item                                                               [1,0]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:test_torch.py::TorchTests::test_horovod_sync_batch_norm [1,0]<stdout>:
[1,0]<stdout>:test_torch.py::TorchTests::test_horovod_sync_batch_norm [1,1]<stdout>:PASSED[1,0]<stdout>:PASSED[1,0]<stdout>:
[1,0]<stdout>:
[1,0]<stdout>:============================== 1 passed in 7.96s ===============================
[1,1]<stdout>:
[1,1]<stdout>:
[1,1]<stdout>:============================== 1 passed in 7.96s ===============================
(deepeat-py3.6-env) [root@/mnt/share/chongxiaoc/git/horovod/test/parallel #]

@tgaddair FYI

EnricoMi · 2021-05-15T09:22:13Z

Please try with below diff and rebase.

@chongxiaoc looks like I had been pretty close. Thanks for fixing this. I'll have a look in the remaining new issues.

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi · 2021-05-16T11:33:56Z

The last remaining issue is puzzling. The elastic torch test makes the training script fail with SIGKILL on process with rank one, which should make elastic Horovod downscale the cluster to 3 training Horovod processes.

From the log and exist codes we can see the other three training processes also fail, due to an uncaught Gloo IO Exception:

[0]<stderr>:RuntimeError: check_rank and exit epoch=1 batch=0 start_rank=0 rank=0
[0]<stderr>:Killed

[1]<stderr>:Terminated
[1]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[1]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [127.0.0.1]:3977: Connection reset by peer
[1]<stderr>:Aborted (core dumped)

[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [127.0.0.1]:54168
[2]<stderr>:Aborted (core dumped)

[3]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[3]<stderr>:  what():  [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:589] Read error [127.0.0.1]:52398: Connection reset by peer
[3]<stderr>:Aborted (core dumped)

Here are the exit codes of the four Horovod processes:

Process 0 exit with status code 137.
Process 3 exit with status code 134.
Process 2 exit with status code 134.
Process 1 exit with status code 134.

Exit code 137 indicates a SIGKILL, exit code 134 indicates a SIGABRT.

This only occurs with torch head on GPU (torch-1.9.0.dev20210514+cu111), not torch head on CPU (torch-1.9.0.dev20210514+cpu).

@tgaddair this feels like we have seen this some time ago. Shouldn't that Gloo UI Exception been caught in the main elastic loop?

tgaddair · 2021-05-16T17:20:12Z

Hmm, seems like there may be some flakiness here, as I don't see what would make this error specific to GPUs. Maybe we can create an issue and skip that test for now so we can unblock.

EnricoMi · 2021-05-16T17:24:54Z

Elastic tests are more and more degrading... I have created #2908.

EnricoMi · 2021-05-16T17:42:32Z

test/integration/test_elastic_torch.py

+
+    @mock.patch('horovod.runner.elastic.driver.DISCOVER_HOSTS_FREQUENCY_SECS', 0.01)
+    @mock.patch('horovod.runner.gloo_run._get_min_start_hosts', return_value=1)
+    def test_fault_tolerance_without_scaling(self, mock_get_min_start_hosts):


we could further restrict this to torch head and see if it is exclusively an issue for >= 1.9.0:

if LooseVersion(torch.__version__) >= LooseVersion('1.9.0'):

Sounds good!

Signed-off-by: Travis Addair <tgaddair@gmail.com>

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi changed the title ~~Branch unpin torchhead~~ unpin torch-head Apr 8, 2021

EnricoMi changed the title ~~unpin torch-head~~ Unpin torch-head Apr 8, 2021

EnricoMi changed the title ~~Unpin torch-head~~ Un-pin torch-head Apr 8, 2021

EnricoMi changed the title ~~Un-pin torch-head~~ Un-pin torch-nightly Apr 8, 2021

This comment has been minimized.

Sign in to view

EnricoMi force-pushed the branch-unpin-torchhead branch from d51e15a to f9e5840 Compare April 20, 2021 06:02

This comment has been minimized.

Sign in to view

EnricoMi force-pushed the branch-unpin-torchhead branch from f9e5840 to ff4c324 Compare May 4, 2021 10:01

EnricoMi force-pushed the branch-unpin-torchhead branch 2 times, most recently from ae031fc to 2ca8194 Compare May 9, 2021 19:14

EnricoMi force-pushed the branch-unpin-torchhead branch 5 times, most recently from 687ae0d to c5cf6fb Compare May 14, 2021 09:05

EnricoMi and others added 3 commits May 15, 2021 11:34

Un-pin torch nightly

2d3fa73

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Method signature of non-public torch function changed in 1.9.0

212a1bb

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

Incorporate count_all fix

33718ea

Signed-off-by: Travis Addair <tgaddair@gmail.com>

EnricoMi force-pushed the branch-unpin-torchhead branch from 94c95ba to 6deaa8b Compare May 15, 2021 09:35

Limit petastorm to <0.11.0

58cd230

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi force-pushed the branch-unpin-torchhead branch from 6deaa8b to 6a4b4e6 Compare May 15, 2021 11:16

EnricoMi force-pushed the branch-unpin-torchhead branch 2 times, most recently from e3450ec to b783e73 Compare May 15, 2021 17:22

tgaddair mentioned this pull request May 16, 2021

torchhead tests fail on Buildkite due to missing module torchvision #2895

Closed

EnricoMi commented May 16, 2021

View reviewed changes

tgaddair approved these changes May 16, 2021

View reviewed changes

EnricoMi force-pushed the branch-unpin-torchhead branch from 2f61304 to e7ca8c3 Compare May 16, 2021 18:07

Fixed Compression docstring issue

85cfc6b

Signed-off-by: Travis Addair <tgaddair@gmail.com>

EnricoMi mentioned this pull request May 16, 2021

Make Horovod work with petastorm >= 0.11.0 #2909

Closed

EnricoMi force-pushed the branch-unpin-torchhead branch from e7ca8c3 to acf8994 Compare May 16, 2021 18:45

EnricoMi marked this pull request as ready for review May 16, 2021 18:45

Skip failing elastic torch tests

d232de8

Signed-off-by: Enrico Minack <github@enrico.minack.dev>

EnricoMi force-pushed the branch-unpin-torchhead branch from acf8994 to d232de8 Compare May 16, 2021 19:40

EnricoMi merged commit a74e42b into master May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Un-pin torch-nightly #2829

Un-pin torch-nightly #2829

EnricoMi commented Apr 8, 2021 •

edited

This comment has been minimized.

This comment has been minimized.

github-actions bot commented May 4, 2021 •

edited

EnricoMi commented May 8, 2021

EnricoMi commented May 10, 2021

chongxiaoc commented May 14, 2021 •

edited

chongxiaoc commented May 14, 2021

EnricoMi commented May 14, 2021

EnricoMi commented May 14, 2021

chongxiaoc commented May 14, 2021 •

edited

EnricoMi commented May 15, 2021

EnricoMi commented May 16, 2021

tgaddair commented May 16, 2021

EnricoMi commented May 16, 2021 •

edited

EnricoMi May 16, 2021

tgaddair May 16, 2021

Un-pin torch-nightly #2829

Un-pin torch-nightly #2829

Conversation

EnricoMi commented Apr 8, 2021 • edited

This comment has been minimized.

This comment has been minimized.

github-actions bot commented May 4, 2021 • edited

Unit Test Results

EnricoMi commented May 8, 2021

EnricoMi commented May 10, 2021

chongxiaoc commented May 14, 2021 • edited

chongxiaoc commented May 14, 2021

EnricoMi commented May 14, 2021

EnricoMi commented May 14, 2021

chongxiaoc commented May 14, 2021 • edited

EnricoMi commented May 15, 2021

EnricoMi commented May 16, 2021

tgaddair commented May 16, 2021

EnricoMi commented May 16, 2021 • edited

EnricoMi May 16, 2021

Choose a reason for hiding this comment

tgaddair May 16, 2021

Choose a reason for hiding this comment

EnricoMi commented Apr 8, 2021 •

edited

github-actions bot commented May 4, 2021 •

edited

chongxiaoc commented May 14, 2021 •

edited

chongxiaoc commented May 14, 2021 •

edited

EnricoMi commented May 16, 2021 •

edited