All workers failed on failure with Elastic Horovod #3264

jasperzhong · 2021-11-08T09:48:12Z

Environment:

Framework: (TensorFlow, Keras, PyTorch, MXNet): PyTorch
Framework version: 1.10
Horovod version: latest master (3efc229)
MPI version:
CUDA version: 10.2
NCCL version: 2.7.6
Python version: 3.6
Spark / PySpark version:
Ray version:
OS and version: ubuntu 18.04lts
GCC version: 7.5
CMake version: 3.21.3

Bug report:

We found that this PR #3112 solved the some problems of NCCL. So we tested it with the latest code. However, we found sometimes all workers failed on failure.

We have two machines, each equipped with 2 P100 GPUs. We run the program in the master node (10.28.1.16) with the following command:

horovodrun -np 4 --min-np 2 -H 10.28.1.16:2,10.28.1.17:2 --start-timeout 600 python pytorch_synthetic_benchmark_elastic.py

During the execution, we intentionally killed the workers on the host (10.28.1.17) with pkill python. The workers on that host died immediatelly.

However, sometimes the workers in the master host also failed and exited with the status code 134. From the log, it seems that the workers did not re-initialize since the initialization_done is false. This is weird because the alive workers should re-init (https://github.com/horovod/horovod/blob/master/horovod/torch/elastic/__init__.py#L48).

(d2l) ➜  pytorch git:(master) ✗ horovodrun -np 4 --min-np 2 --host-discovery-script ./discover_hosts.sh --start-timeout 600 --network-interface eth2 python pytorch_synthetic_benchmark_elastic.py
[0]<stdout>:Model: resnet50
[0]<stdout>:Batch size: 32
[0]<stdout>:Number of GPUs: 4
[0]<stdout>:Running warmup...
[0]<stdout>:Running benchmark...
[0]<stdout>:Iter #0: 48.1 img/sec per GPU
[0]<stderr>:[2021-11-08 15:08:30.430065: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [0]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:52145
[1]<stderr>:[2021-11-08 15:08:30.430133: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [1]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:49285
Process 2 exit with status code 255.
Process 3 exit with status code 255.
[0]<stderr>:python: /home/gmsheng/horovod/horovod/common/process_set.cc:20: bool horovod::common::ProcessSet::IsCurrentProcessIncluded() const: Assertion `initialization_done' failed.
[1]<stderr>:python: /home/gmsheng/horovod/horovod/common/process_set.cc:20: bool horovod::common::ProcessSet::IsCurrentProcessIncluded() const: Assertion `initialization_done' failed.
[1]<stderr>:Aborted (core dumped)
Process 1 exit with status code 134.
[0]<stderr>:Aborted (core dumped)
Process 0 exit with status code 134.
ERROR:root:failure count == 4 -> stop running
Traceback (most recent call last):
  File "/home/gmsheng/.conda/envs/d2l/bin/horovodrun", line 33, in <module>
    sys.exit(load_entry_point('horovod', 'console_scripts', 'horovodrun')())
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 770, in run_commandline
    _run(args)
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 758, in _run
    return _run_elastic(args)
  File "/home/gmsheng/horovod/horovod/runner/launch.py", line 668, in _run_elastic
    gloo_run_elastic(settings, env, args.command)
  File "/home/gmsheng/horovod/horovod/runner/gloo_run.py", line 350, in gloo_run_elastic
    launch_gloo_elastic(command, exec_command, settings, env, get_common_interfaces, rendezvous)
  File "/home/gmsheng/horovod/horovod/runner/gloo_run.py", line 337, in launch_gloo_elastic
    .format(name=name, code=exit_code))
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: 10.28.1.17[0]
Exit code: 255

Expected behaviour:

The program continues to execute normally in the master host. Sometimes we found it succeed as shown in the log below.

(d2l) ➜  pytorch git:(master) ✗ horovodrun -np 4 --min-np 2 --host-discovery-script ./discover_hosts.sh --start-timeout 600 --network-interface eth2 python pytorch_synthetic_benchmark_elastic.py                 [
0]<stdout>:Model: resnet50
[0]<stdout>:Batch size: 32
[0]<stdout>:Number of GPUs: 4
[0]<stdout>:Running warmup...
[0]<stdout>:Running benchmark...
[0]<stdout>:Iter #0: 52.9 img/sec per GPU
[0]<stdout>:Iter #1: 30.3 img/sec per GPU
[0]<stderr>:[2021-11-08 13:56:12. 47430: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [0]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:589] Read error [10.28.1.17]:58789: Connection reset by peer
[1]<stderr>:[2021-11-08 13:56:12.115609: E /home/gmsheng/horovod/horovod/common/operations.cc:654] [1]: Horovod background loop uncaught exception: [/opt/conda/conda-bld/pytorch_1634272115665/work/third_party/glo
o/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.28.1.17]:26242
Process 3 exit with status code 255.
Process 2 exit with status code 255.
WARNING:root:blacklist failing host: 10.28.1.17
[0]<stdout>:Iter #1: 140.5 img/sec per GPU
[0]<stdout>:Iter #2: 142.1 img/sec per GPU
[0]<stdout>:Iter #3: 141.8 img/sec per GPU
[0]<stdout>:Iter #4: 139.9 img/sec per GPU
[0]<stdout>:Iter #5: 138.3 img/sec per GPU
[0]<stdout>:Iter #6: 135.8 img/sec per GPU
[0]<stdout>:Iter #7: 139.1 img/sec per GPU
[0]<stdout>:Iter #8: 139.2 img/sec per GPU
[0]<stdout>:Iter #9: 138.0 img/sec per GPU
[0]<stdout>:Img/sec per GPU: 121.6 +-74.6
[0]<stdout>:Total img/sec on 2 GPU(s): 243.3 +-149.2

Other information:

We build horovod from source. Here is our installation command.

HOROVOD_DEBUG=1 CXX=/usr/bin/g++ CC=/usr/bin/gcc HOROVOD_WITHOUT_MPI=1 HOROVOD_WITH_GLOO=1  HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITHOUT_MXNET=1 pip install --no-cache-dir -v -e .

~ horovodrun --check-build
Horovod v0.23.0:

Available Frameworks:
    [ ] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [ ] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [ ] MPI
    [X] Gloo

cc: @woodlgz @tgaddair

The text was updated successfully, but these errors were encountered:

jasperzhong · 2021-11-09T04:39:51Z

FYI. We print some log when the worker catches the HorovodInternalError (https://github.com/horovod/horovod/blob/master/horovod/common/elastic.py#L165). In case of the bug, it did not print the log, indicating that the worker did not throw the HorovodInternalError.

woodlgz · 2021-11-10T06:56:24Z

@vycezhong sorry for late reply, I will take a look at this as soon as possible.

woodlgz · 2021-11-10T08:53:16Z

It's probably due to the background thread is doing finalization while the training program keeps enqueuing tensors within which ProcessSet::IsCurrentProcessIncluded() is called and assert that process_set.initialization_done == true.
making an assertion will abort the program rather than throwing an exception.
I will verify a fix in this scenario later.

jasperzhong · 2021-11-10T10:12:25Z

@woodlgz I have tested your PR and it works. Thanks for your help!

jasperzhong added the bug label Nov 8, 2021

woodlgz mentioned this issue Nov 10, 2021

fix issue 3264 #3267

Merged

4 tasks

jasperzhong closed this as completed Nov 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All workers failed on failure with Elastic Horovod #3264

All workers failed on failure with Elastic Horovod #3264

jasperzhong commented Nov 8, 2021 •

edited

jasperzhong commented Nov 9, 2021 •

edited

woodlgz commented Nov 10, 2021 •

edited

woodlgz commented Nov 10, 2021 •

edited

jasperzhong commented Nov 10, 2021

All workers failed on failure with Elastic Horovod #3264

All workers failed on failure with Elastic Horovod #3264

Comments

jasperzhong commented Nov 8, 2021 • edited

jasperzhong commented Nov 9, 2021 • edited

woodlgz commented Nov 10, 2021 • edited

woodlgz commented Nov 10, 2021 • edited

jasperzhong commented Nov 10, 2021

jasperzhong commented Nov 8, 2021 •

edited

jasperzhong commented Nov 9, 2021 •

edited

woodlgz commented Nov 10, 2021 •

edited

woodlgz commented Nov 10, 2021 •

edited