Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All workers failed on failure with Elastic Horovod #3264

Closed
jasperzhong opened this issue Nov 8, 2021 · 4 comments · Fixed by #3267
Closed

All workers failed on failure with Elastic Horovod #3264

jasperzhong opened this issue Nov 8, 2021 · 4 comments · Fixed by #3267
Labels

Comments

@jasperzhong
Copy link

jasperzhong commented Nov 8, 2021

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet): PyTorch
  2. Framework version: 1.10
  3. Horovod version: latest master (3efc229)
  4. MPI version:
  5. CUDA version: 10.2
  6. NCCL version: 2.7.6
  7. Python version: 3.6
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: ubuntu 18.04lts
  11. GCC version: 7.5
  12. CMake version: 3.21.3

Bug report:

We found that this PR #3112 solved the some problems of NCCL. So we tested it with the latest code. However, we found sometimes all workers failed on failure.

We have two machines, each equipped with 2 P100 GPUs. We run the program in the master node (10.28.1.16) with the following command:

horovodrun -np 4 --min-np 2 -H 10.28.1.16:2,10.28.1.17:2 --start-timeout 600 python pytorch_synthetic_benchmark_elastic.py

During the execution, we intentionally killed the workers on the host (10.28.1.17) with pkill python. The workers on that host died immediatelly.

However, sometimes the workers in the master host also failed and exited with the status code 134. From the log, it seems that the workers did not re-initialize since the initialization_done is false. This is weird because the alive workers should re-init (https://github.com/horovod/horovod/blob/master/horovod/torch/elastic/__init__.py#L48).

(d2l) ➜  pytorch git:(master) ✗ horovodrun -np 4 --min-np 2 --host-discovery-script ./discover_hosts.sh --start-timeout 600 --network-interface eth2 python pytorch_synthetic_benchmark_elastic.py
[0]<stdout>:Model: resnet50
[0]<stdout>:Batch size: 32
[0]<stdout>:Number of GPUs: 4
[0]<stdout>:Running warmup...
[0]<stdout>:Running benchmark...
[0]<stdout>:Iter #0: 48.1 img/sec per GPU
[0]<stderr>:[2021-11-08 15:08:30.430065: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [0]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:52145
[1]<stderr>:[2021-11-08 15:08:30.430133: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [1]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:49285
Process 2 exit with status code 255.
Process 3 exit with status code 255.
[0]<stderr>:python: /home/gmsheng/horovod/horovod/common/process_set.cc:20: bool horovod::common::ProcessSet::IsCurrentProcessIncluded() const: Assertion `initialization_done' failed.
[1]<stderr>:python: /home/gmsheng/horovod/horovod/common/process_set.cc:20: bool horovod::common::ProcessSet::IsCurrentProcessIncluded() const: Assertion `initialization_done' failed.
[1]<stderr>:Aborted (core dumped)
Process 1 exit with status code 134.
[0]<stderr>:Aborted (core dumped)
Process 0 exit with status code 134.
ERROR:root:failure count == 4 -> stop running
Traceback (most recent call last):
  File "/home/gmsheng/.conda/envs/d2l/bin/horovodrun", line 33, in <module>
    sys.exit(load_entry_point('horovod', 'console_scripts', 'horovodrun')())
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 770, in run_commandline
    _run(args)
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 758, in _run
    return _run_elastic(args)
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 668, in _run_elastic
    gloo_run_elastic(settings, env, args.command)
  File "/home/gmsheng/horovod/horovod/runner/gloo_run.py", line 350, in gloo_run_elastic
    launch_gloo_elastic(command, exec_command, settings, env, get_common_interfaces, rendezvous)
  File "/home/gmsheng/horovod/horovod/runner/gloo_run.py", line 337, in launch_gloo_elastic
    .format(name=name, code=exit_code))
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 10.28.1.17[0]
Exit code: 255

Expected behaviour:

The program continues to execute normally in the master host. Sometimes we found it succeed as shown in the log below.

(d2l) ➜  pytorch git:(master) ✗ horovodrun -np 4 --min-np 2 --host-discovery-script ./discover_hosts.sh --start-timeout 600 --network-interface eth2 python pytorch_synthetic_benchmark_elastic.py                 [
0]<stdout>:Model: resnet50
[0]<stdout>:Batch size: 32
[0]<stdout>:Number of GPUs: 4
[0]<stdout>:Running warmup...
[0]<stdout>:Running benchmark...
[0]<stdout>:Iter #0: 52.9 img/sec per GPU
[0]<stdout>:Iter #1: 30.3 img/sec per GPU
[0]<stderr>:[2021-11-08 13:56:12. 47430: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [0]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:589] Read error [10.28.1.17]:58789: Connection reset by peer
[1]<stderr>:[2021-11-08 13:56:12.115609: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [1]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:26242
Process 3 exit with status code 255.
Process 2 exit with status code 255.
WARNING:root:blacklist failing host: 10.28.1.17
[0]<stdout>:Iter #1: 140.5 img/sec per GPU
[0]<stdout>:Iter #2: 142.1 img/sec per GPU
[0]<stdout>:Iter #3: 141.8 img/sec per GPU
[0]<stdout>:Iter #4: 139.9 img/sec per GPU
[0]<stdout>:Iter #5: 138.3 img/sec per GPU
[0]<stdout>:Iter #6: 135.8 img/sec per GPU
[0]<stdout>:Iter #7: 139.1 img/sec per GPU
[0]<stdout>:Iter #8: 139.2 img/sec per GPU
[0]<stdout>:Iter #9: 138.0 img/sec per GPU
[0]<stdout>:Img/sec per GPU: 121.6 +-74.6
[0]<stdout>:Total img/sec on 2 GPU(s): 243.3 +-149.2

Other information:

We build horovod from source. Here is our installation command.

HOROVOD_DEBUG=1 CXX=/usr/bin/g++ CC=/usr/bin/gcc HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_GLOO=1  HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v -e .
~ horovodrun --check-build
Horovod v0.23.0:

Available Frameworks:
    [ ] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [ ] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [ ] MPI
    [X] Gloo

cc: @woodlgz @tgaddair

@jasperzhong jasperzhong added the bug label Nov 8, 2021
@jasperzhong
Copy link
Author

jasperzhong commented Nov 9, 2021

FYI. We print some log when the worker catches the HorovodInternalError (https://github.com/horovod/horovod/blob/master/horovod/common/elastic.py#L165). In case of the bug, it did not print the log, indicating that the worker did not throw the HorovodInternalError.

@woodlgz
Copy link
Contributor

woodlgz commented Nov 10, 2021

@vycezhong sorry for late reply, I will take a look at this as soon as possible.

@woodlgz
Copy link
Contributor

woodlgz commented Nov 10, 2021

It's probably due to the background thread is doing finalization while the training program keeps enqueuing tensors within which ProcessSet::IsCurrentProcessIncluded() is called and assert that process_set.initialization_done == true.
making an assertion will abort the program rather than throwing an exception.
I will verify a fix in this scenario later.

@woodlgz woodlgz mentioned this issue Nov 10, 2021
4 tasks
@jasperzhong
Copy link
Author

@woodlgz I have tested your PR and it works. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

2 participants