Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Worker Failure when using Elastic Horovod + Process Sets #4021

Open
Pranavug opened this issue Feb 7, 2024 · 0 comments
Open
Labels

Comments

@Pranavug
Copy link

Pranavug commented Feb 7, 2024

Environment:

  1. Framework: PyTorch
  2. Framework version: 1.9.0+cu102
  3. Horovod version: 0.28.1
  4. MPI version: N/A
  5. CUDA version: cu102
  6. NCCL version: 2708
  7. Python version: 3.9.18
  8. Spark / PySpark version: N/A
  9. Ray version: N/A
  10. OS and version: Linux SMP x86_64 x86_64 x86_64 GNU/Linux
  11. GCC version: 7.3.1
  12. CMake version: 3.14

Bug report:

import horovod.torch as hvd
import time

worker_1_process_set = hvd.ProcessSet([1])
worker_2_process_set = hvd.ProcessSet([0, 2])

hvd.init(process_sets="dynamic")
hvd.add_process_set(worker_1_process_set)
hvd.add_process_set(worker_2_process_set)

@hvd.elastic.run
def main(state):
    rank = hvd.rank()
    size = hvd.size()

    if rank == 0:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)

    elif rank == 1:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)

    elif rank == 2:
        while True:
            print(f"Sleeping for 1 second: {rank}", flush=True)
            time.sleep(1)


if __name__ == '__main__':
    print(f"Initialized with rank {hvd.rank()}", flush=True)

    # Initialize the TorchState
    state = hvd.elastic.TorchState()

    print(f"Running main with rank {hvd.rank()}", flush=True)
    main(state)
    print(f"Finished running main with rank {hvd.rank()}", flush=True)

    print(f"Joined with rank {hvd.rank()}", flush=True)

I am running the code above using elastic horovod and using process sets as described above. I am using the following command to run all 3 workers on a single node. After killing one of the processes from a terminal, all the remaining processes are killed. If I do the same workflow using the same command BUT WITHOUT using process sets, after terminating only one process the remaining 2 workers are not terminated. Basically, while using process sets with elastic horovod I was expecting that one worker failure would not terminate the remaining processes as it's happening in the log below. However, for some reason when I dont use process sets, the remaining workers stay alive as expected. What could be the reason here? Is this a bug or am i missing something while using the process sets? Please help

Similar issues:

  1. uncaught exception in elastic mode with pytorch #2484
(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$ horovodrun -np 3 --min-np 2 --host-discovery-script discover-hosts.sh --elastic-timeout 5 --network-interfaces eth0,lo python mast
er-child-exp.py
[1]<stdout>:Initialized with rank 1
[1]<stdout>:Running main with rank 1
[2]<stdout>:Initialized with rank 2
[2]<stdout>:Running main with rank 2
[0]<stdout>:Initialized with rank 0
[0]<stdout>:Running main with rank 0
[1]<stdout>:Sleeping for 1 second: 1
[2]<stdout>:Sleeping for 1 second: 2
[0]<stdout>:Sleeping for 1 second: 0
[2]<stderr>:[2024-02-07 04:16:27.910743: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [2]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:[2024-02-07 04:16:27.910752: E /tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/horovod/common/operations.cc:697] [0]: Horovod background loop uncaught exception: [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
[2]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[0]<stderr>:terminate called after throwing an instance of 'gloo::IoException'
[2]<stderr>:  what():  [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:20903: Connection reset by peer
[0]<stderr>:  what():  [/tmp/pip-install-ozxdndi9/horovod_e8e6eba6ed5e495cb7b495d7bb552c01/third_party/compatible_gloo/gloo/transport/tcp/pair.cc:589] Read error [10.20.1.15]:49541: Connection reset by peer
Process 1 exit with status code 143.
Process 2 exit with status code 134.
Process 0 exit with status code 134.
ERROR:root:failure count == 3 -> stop running
Traceback (most recent call last):
  File "/home/pgadikar/miniconda3/envs/horovod-setup/bin/horovodrun", line 8, in <module>
    sys.exit(run_commandline())
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 837, in run_commandline
    _run(args)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 825, in _run
    return _run_elastic(args)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/launch.py", line 738, in _run_elastic
    return gloo_run_elastic(settings, env, args.run_func if args.run_func else args.command, executable)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 380, in gloo_run_elastic
    return launch_gloo_elastic(command_or_func, exec_command, settings, env, get_common_interfaces, rendezvous, executable)
  File "/home/pgadikar/miniconda3/envs/horovod-setup/lib/python3.9/site-packages/horovod/runner/gloo_run.py", line 351, in launch_gloo_elastic
    raise RuntimeError('Horovod detected that one or more processes exited with non-zero '
RuntimeError: Horovod detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: ip-10-20-1-15.us-east-2.compute.internal[1]
Exit code: 143

(horovod-setup) (miniconda3) [pgadikar@ip-10-20-1-15 experiments]$ 
@Pranavug Pranavug added the bug label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

1 participant