nccl error handling in elastic scenario is buggy #3111

woodlgz · 2021-08-16T13:37:53Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet): Pytorch
Framework version: 1.7.0
Horovod version: 0.22.1
MPI version: 4.0.3
CUDA version: 10.2
NCCL version: 2.9.6
Python version: 3.6

Checklist:

Did you search issues to find if somebody asked this question before?
If your question is about hang, did you read this doc?
If your question is about docker, did you read this doc?
Did you check if you question is answered in the troubleshooting guide?

Bug report:
when nccl communication is enabled in horovod, in elastic scenario, evicting a worker instance may cause nccl communicator abort from other workers.

specifically, there existed 3 problems in this scenario:

gpu operations event error check (from a thread-pool thread) just throws away exceptions that will never be handled in which case the program aborts. this behaviour simply goes against elastic horovod's will.
in the shadow of problem 1, background loop shutting-down and cleaning nccl resources by calling ncclCommDestroy (when controller detects a shutdown condition, for instance, peers exit) may get into race condition with thread pool error check in which ncclCommAbort will be called, potentially causing a double-free corruption.
in elastic eviction scenario, a program of a particular rank somehow failed to detect its nccl communicator broken, is prone to hang reporting 100% gpu utilization.

one can reproduce this issue by using elastic example pytorch_synthetic_benchmark_elastic.py.
In my setup, I launched 2 workers each with 2 gpu thus yielding 4 slots in total and killed one rank in the middle of the training. For a lot of time, doing so will cause some of the rest 3 ranks to corrupt and others hang with gpu 100% utilization.
log similar to following can be retrieved.

INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 1 exit with status code 137.
Wed Aug 18 10:51:34 2021[2]<stderr>:[2021-08-18 10:51:34.286842: E /data/guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [2]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:*** Error in `python3': double free or corruption (out): 0x00007fce170a1100 ***
Wed Aug 18 10:51:34 2021[2]<stderr>:======= Backtrace: =========
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(+0x7c619)[0x7fcf8e509619]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(+0x31b6c)[0x7fce21b28b6c]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(ncclCommDestroy+0x82)[0x7fce21b2ef42]
Wed Aug 18 10:51:34 2021[3]<stderr>:[2021-08-18 10:51:34.287533: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [3]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext8ShutDownEv+0x41)[0x7fcedf450e51]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x94d93)[0x7fcedf3ddd93]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libstdc++.so.6(+0xba1bf)[0x7fcf82c4f1bf]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libpthread.so.0(+0x7e25)[0x7fcf8ef61e25]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(clone+0x6d)[0x7fcf8e58535d]

In some cases where a coredump does not happen, a NCCL async error may instead be reported from some of the rest 3 ranks and others reporting 100% gpu utilization.

INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 3 exit with status code 137.
Wed Aug 18 11:35:36 2021[0]<stderr>:terminate called after throwing an instance of 'std::logic_error'
Wed Aug 18 11:35:36 2021[0]<stderr>:  what():  NCCL async error: unhandled system error
Wed Aug 18 11:35:38 2021[1]<stderr>:[2021-08-18 11:35:38.692416: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [1]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-launcher.titan-test.svc.cluster.local[0] = FAILURE
Process 0 exit with status code 134.

to reproduce the NCCL async error and abort with error code 134, one can manually add some latency in background thread loop before nccl_context.Shutdown()

#if HAVE_NCCL
  using namespace std::chrono_literals;
  std::this_thread::sleep_for(60s);
  nccl_context.ShutDown();
#endif

The text was updated successfully, but these errors were encountered:

woodlgz · 2021-09-02T02:30:30Z

@romerojosh @tgaddair any idea?

woodlgz added the bug label Aug 16, 2021

woodlgz mentioned this issue Aug 16, 2021

a better way to handle nccl error under elastic scenario #3112

Merged

4 tasks

tgaddair closed this as completed in #3112 Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl error handling in elastic scenario is buggy #3111

nccl error handling in elastic scenario is buggy #3111

woodlgz commented Aug 16, 2021 •

edited

woodlgz commented Sep 2, 2021

nccl error handling in elastic scenario is buggy #3111

nccl error handling in elastic scenario is buggy #3111

Comments

woodlgz commented Aug 16, 2021 • edited

woodlgz commented Sep 2, 2021

woodlgz commented Aug 16, 2021 •

edited