You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bug report:
when nccl communication is enabled in horovod, in elastic scenario, evicting a worker instance may cause nccl communicator abort from other workers.
specifically, there existed 3 problems in this scenario:
gpu operations event error check (from a thread-pool thread) just throws away exceptions that will never be handled in which case the program aborts. this behaviour simply goes against elastic horovod's will.
in the shadow of problem 1, background loop shutting-down and cleaning nccl resources by calling ncclCommDestroy (when controller detects a shutdown condition, for instance, peers exit) may get into race condition with thread pool error check in which ncclCommAbort will be called, potentially causing a double-free corruption.
in elastic eviction scenario, a program of a particular rank somehow failed to detect its nccl communicator broken, is prone to hang reporting 100% gpu utilization.
one can reproduce this issue by using elastic example pytorch_synthetic_benchmark_elastic.py.
In my setup, I launched 2 workers each with 2 gpu thus yielding 4 slots in total and killed one rank in the middle of the training. For a lot of time, doing so will cause some of the rest 3 ranks to corrupt and others hang with gpu 100% utilization.
log similar to following can be retrieved.
INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 1 exit with status code 137.
Wed Aug 18 10:51:34 2021[2]<stderr>:[2021-08-18 10:51:34.286842: E /data/guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [2]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:*** Error in `python3': double free or corruption (out): 0x00007fce170a1100 ***
Wed Aug 18 10:51:34 2021[2]<stderr>:======= Backtrace: =========
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(+0x7c619)[0x7fcf8e509619]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(+0x31b6c)[0x7fce21b28b6c]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(ncclCommDestroy+0x82)[0x7fce21b2ef42]
Wed Aug 18 10:51:34 2021[3]<stderr>:[2021-08-18 10:51:34.287533: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [3]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext8ShutDownEv+0x41)[0x7fcedf450e51]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x94d93)[0x7fcedf3ddd93]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libstdc++.so.6(+0xba1bf)[0x7fcf82c4f1bf]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libpthread.so.0(+0x7e25)[0x7fcf8ef61e25]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(clone+0x6d)[0x7fcf8e58535d]
In some cases where a coredump does not happen, a NCCL async error may instead be reported from some of the rest 3 ranks and others reporting 100% gpu utilization.
INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 3 exit with status code 137.
Wed Aug 18 11:35:36 2021[0]<stderr>:terminate called after throwing an instance of 'std::logic_error'
Wed Aug 18 11:35:36 2021[0]<stderr>: what(): NCCL async error: unhandled system error
Wed Aug 18 11:35:38 2021[1]<stderr>:[2021-08-18 11:35:38.692416: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [1]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-launcher.titan-test.svc.cluster.local[0] = FAILURE
Process 0 exit with status code 134.
to reproduce the NCCL async error and abort with error code 134, one can manually add some latency in background thread loop before nccl_context.Shutdown()
Environment:
Checklist:
Bug report:
when nccl communication is enabled in horovod, in elastic scenario, evicting a worker instance may cause nccl communicator abort from other workers.
specifically, there existed 3 problems in this scenario:
one can reproduce this issue by using elastic example pytorch_synthetic_benchmark_elastic.py.
In my setup, I launched 2 workers each with 2 gpu thus yielding 4 slots in total and killed one rank in the middle of the training. For a lot of time, doing so will cause some of the rest 3 ranks to corrupt and others hang with gpu 100% utilization.
log similar to following can be retrieved.
In some cases where a coredump does not happen, a NCCL async error may instead be reported from some of the rest 3 ranks and others reporting 100% gpu utilization.
to reproduce the NCCL async error and abort with error code 134, one can manually add some latency in background thread loop before nccl_context.Shutdown()
The text was updated successfully, but these errors were encountered: