Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl error handling in elastic scenario is buggy #3111

Closed
woodlgz opened this issue Aug 16, 2021 · 1 comment · Fixed by #3112
Closed

nccl error handling in elastic scenario is buggy #3111

woodlgz opened this issue Aug 16, 2021 · 1 comment · Fixed by #3112
Labels

Comments

@woodlgz
Copy link
Contributor

woodlgz commented Aug 16, 2021

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet): Pytorch
  2. Framework version: 1.7.0
  3. Horovod version: 0.22.1
  4. MPI version: 4.0.3
  5. CUDA version: 10.2
  6. NCCL version: 2.9.6
  7. Python version: 3.6

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
when nccl communication is enabled in horovod, in elastic scenario, evicting a worker instance may cause nccl communicator abort from other workers.

specifically, there existed 3 problems in this scenario:

  1. gpu operations event error check (from a thread-pool thread) just throws away exceptions that will never be handled in which case the program aborts. this behaviour simply goes against elastic horovod's will.
  2. in the shadow of problem 1, background loop shutting-down and cleaning nccl resources by calling ncclCommDestroy (when controller detects a shutdown condition, for instance, peers exit) may get into race condition with thread pool error check in which ncclCommAbort will be called, potentially causing a double-free corruption.
  3. in elastic eviction scenario, a program of a particular rank somehow failed to detect its nccl communicator broken, is prone to hang reporting 100% gpu utilization.

one can reproduce this issue by using elastic example pytorch_synthetic_benchmark_elastic.py.
In my setup, I launched 2 workers each with 2 gpu thus yielding 4 slots in total and killed one rank in the middle of the training. For a lot of time, doing so will cause some of the rest 3 ranks to corrupt and others hang with gpu 100% utilization.
log similar to following can be retrieved.

INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 1 exit with status code 137.
Wed Aug 18 10:51:34 2021[2]<stderr>:[2021-08-18 10:51:34.286842: E /data/guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [2]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:*** Error in `python3': double free or corruption (out): 0x00007fce170a1100 ***
Wed Aug 18 10:51:34 2021[2]<stderr>:======= Backtrace: =========
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(+0x7c619)[0x7fcf8e509619]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(+0x31b6c)[0x7fce21b28b6c]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/nccl_2.9.6-1+cuda11.0_x86_64/lib/libnccl.so.2(ncclCommDestroy+0x82)[0x7fce21b2ef42]
Wed Aug 18 10:51:34 2021[3]<stderr>:[2021-08-18 10:51:34.287533: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [3]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext8ShutDownEv+0x41)[0x7fcedf450e51]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/local/lib64/python3.6/site-packages/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x94d93)[0x7fcedf3ddd93]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libstdc++.so.6(+0xba1bf)[0x7fcf82c4f1bf]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libpthread.so.0(+0x7e25)[0x7fcf8ef61e25]
Wed Aug 18 10:51:34 2021[2]<stderr>:/usr/lib64/libc.so.6(clone+0x6d)[0x7fcf8e58535d]

In some cases where a coredump does not happen, a NCCL async error may instead be reported from some of the rest 3 ranks and others reporting 100% gpu utilization.

INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-worker-0.titan-test.svc.cluster.local[1] = FAILURE
Process 3 exit with status code 137.
Wed Aug 18 11:35:36 2021[0]<stderr>:terminate called after throwing an instance of 'std::logic_error'
Wed Aug 18 11:35:36 2021[0]<stderr>:  what():  NCCL async error: unhandled system error
Wed Aug 18 11:35:38 2021[1]<stderr>:[2021-08-18 11:35:38.692416: E /apdcephfs/private_guozelin/test-horovod/old/horovod-0.22.1/horovod/common/operations.cc:634] [1]: Horovod background loop uncaught exception: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 30000ms for recv operation to complete
INFO:root:record state: ts-a2e9e55167624c36a999c60507804f01-launcher.titan-test.svc.cluster.local[0] = FAILURE
Process 0 exit with status code 134.

to reproduce the NCCL async error and abort with error code 134, one can manually add some latency in background thread loop before nccl_context.Shutdown()

#if HAVE_NCCL
  using namespace std::chrono_literals;
  std::this_thread::sleep_for(60s);
  nccl_context.ShutDown();
#endif
@woodlgz
Copy link
Contributor Author

woodlgz commented Sep 2, 2021

@romerojosh @tgaddair any idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

1 participant