Process on second node is being killed after exactly 15 minutes #1114

odellus · 2023-02-25T12:56:03Z

System Info

- `Accelerate` version: 0.16.0
- Platform: Linux-5.14.0-1051-oem-x86_64-with-glibc2.31
- Python version: 3.10.9
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.13.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 192.168.5.5
        - main_process_port: 2333
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_multinode_launcher': 'standard', 'gradient_accumulation_steps': 0, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

Buy two Dell workstations with a single A6000 each
Enable passwordless ssh between the two
Install miniconda on both machines and create mirrored virtual environments on both
Clone peft
Run accelerate config --config_file <config_out.yaml> on the main node
Change node rank in config file and scp this configuration onto the second node
Run accelerate launch --config_file <config_out.yaml> examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Expected behavior

The process on the second node does not killed by launch.py after 15 minutes

The text was updated successfully, but these errors were encountered:

odellus · 2023-02-25T12:57:44Z

I'm unable to figure out why the process running on the second node is killed by launch.py after 15 minutes whenever I try to run the conditional generation script in the peft examples directory. Doesn't seem to be an issue with peft. A second but highly related issue is that I'm not able to find any logs regarding why the process is being killed. Is there a health check endpoint I can post a GET request to? I go into more detail about the issue here.

muellerzr · 2023-02-25T15:11:13Z

@odellus on each node you can't use the exact same config file, there is a machine_rank entry that needs to be 1 on the second node: https://github.com/huggingface/accelerate/blob/main/tests/test_configs/0_12_0.yaml#L6

Are you setting that on the non-main machine?

odellus · 2023-02-25T16:39:05Z

Derp. Will try again with this insight later today. Thank you!

…

On Sat, Feb 25, 2023, 9:11 AM Zachary Mueller ***@***.***> wrote: @odellus <https://github.com/odellus> on each node you can't use the exact same config file, there is a machine_rank entry that needs to be 1 on the second node: https://github.com/huggingface/accelerate/blob/main/tests/test_configs/0_12_0.yaml#L6 Are you setting that on the non-main machine? — Reply to this email directly, view it on GitHub <#1114 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDYI3GICSQPRADOGRRMIXDWZIOJZANCNFSM6AAAAAAVH3CGAI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

odellus · 2023-02-25T17:55:43Z

Seeing the same error when I correctly attribute node rank on the second machine's configs unfortunately.

lifebioprodai: [2023-02-25 12:46:44,946] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3877160
lifebioprodai: [2023-02-25 12:46:44,946] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250

ssh-ing into the second node and running ps aux | grep conda shows that the second node is getting the correct rank

/home/thomas/miniconda3/envs/nlp/bin/python -u \
-m deepspeed.launcher.launch \
--world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== \
--node_rank=1 \
--master_addr=192.168.6.5 \
--master_port=8989 \
--no_local_rank \
/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Is there a good way to debug this remote process on the second node? I've always gotten by copying and pasting problematic code into the REPL and haven't built up a lot of experience using pdb so if you have any helpful hints on debugging remote python processes launched by accelerate I'd love to hear them!

odellus · 2023-03-13T17:24:46Z

@muellerzr how do you debug remote processes launched with pdsh? I've found a few example of how to attach pdb to running processes on stack overflow but any help you could give on how you do it personally would be well received.

odellus · 2023-03-14T23:44:29Z

I tried it with examples/by_features/deepspeed_with_config_support.py in the accelerate repo. I ran accelerate config on both machines and gave the same answers except for machine rank in addition to exporting the accelerate configuration to yaml, editing the rank before scp-ing to the second node, then changing the rank back again. I fed the script the stanfordnlp/SHP dataset to train on just for fun. Same behavior as the PEFT example from before. Subprocess killed. Return code = -6.

This seems to be connected to NCCL timing out during preprocessing. Is there a way to bump up the timeout? How does the team at accelerate debug remote processes launched with pdsh? Any help would be most welcome.

accelerate launch --config_file /home/thomas/src/accelerate/examples/by_feature/ds_zero3_cpu.yaml /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name stanfordnlp/SHP
[2023-03-14 19:08:40,589] [INFO] [runner.py:454:main] Using IP address of 192.168.6.5 for node lifebiodevai
[2023-03-14 19:08:40,591] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: lifebiodevai,lifebioprodai
[2023-03-14 19:08:40,591] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w lifebiodevai,lifebioprodai export PYTHONPATH=/home/thomas/src/accelerate/examples/by_feature; export SHELL=/bin/bash; export COLORTERM=truecolor; export TERM_PROGRAM_VERSION=1.70.2; export CONDA_EXE=/home/thomas/miniconda3/bin/conda; export _CE_M=; export PWD=/home/thomas/src/accelerate/examples/by_feature; export LOGNAME=thomas; export XDG_SESSION_TYPE=tty; export CONDA_PREFIX=/home/thomas/miniconda3/envs/nlp; export JUPYTER_SERVER_URL=http://lifebiodevai:8872/; export VSCODE_GIT_ASKPASS_NODE=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/node; export MOTD_SHOWN=pam; export LINES=29; export HOME=/home/thomas; export LANG=en_US.UTF-8; export COLUMNS=120; export GIT_ASKPASS=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass.sh; export PYDEVD_USE_FRAME_EVAL=NO; export VSCODE_GIT_ASKPASS_EXTRA_ARGS=; export XDG_SESSION_CLASS=user; export JUPYTER_SERVER_ROOT=/home/thomas; export TERM=xterm-256color; export _CE_CONDA=; export USER=thomas; export VSCODE_GIT_IPC_HANDLE=/run/user/1002/vscode-git-e710c8e75d.sock; export CONDA_SHLVL=2; export SHLVL=1; export PYXTERM_DIMENSIONS=80x25; export XDG_SESSION_ID=605; export CONDA_PYTHON_EXE=/home/thomas/miniconda3/bin/python; export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64; export XDG_RUNTIME_DIR=/run/user/1002; export CONDA_DEFAULT_ENV=nlp; export VSCODE_GIT_ASKPASS_MAIN=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass-main.js; export XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop; export BROWSER=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/helpers/browser.sh; export PATH=/home/thomas/miniconda3/envs/nlp/bin:/home/thomas/miniconda3/condabin:/home/thomas/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin; export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1002/bus; export CONDA_PREFIX_1=/home/thomas/miniconda3; export OLDPWD=/home/thomas/src; export TERM_PROGRAM=vscode; export VSCODE_IPC_HOOK_CLI=/run/user/1002/vscode-ipc-f03caaee-1d2c-4b01-8f00-57b95d1a0043.sock; export _=/home/thomas/miniconda3/envs/nlp/bin/accelerate; export ACCELERATE_MIXED_PRECISION=bf16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,gradient_clipping,offload_optimizer_device,offload_param_device,zero3_init_flag,zero3_save_16bit_model,zero_stage,mixed_precision; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=3; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=4; export ACCELERATE_GRADIENT_CLIPPING=1.0; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export ACCELERATE_DEEPSPEED_ZERO3_INIT=true; export ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true;  cd /home/thomas/src/accelerate/examples/by_feature; /home/thomas/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== --node_rank=%n --master_addr=192.168.6.5 --master_port=8989 --no_local_rank /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name 'stanfordnlp/SHP'
lifebiodevai: [2023-03-14 19:08:42,572] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:162:main] dist_world_size=2
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:162:main] dist_world_size=2
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebiodevai: [2023-03-14 19:08:45,784] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
lifebiodevai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
lifebiodevai: Num processes: 2
lifebiodevai: Process index: 0
lifebiodevai: Local process index: 0
lifebiodevai: Device: cuda:0
lifebiodevai: Mixed precision type: bf16
lifebiodevai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebiodevai: 
lifebioprodai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
lifebioprodai: Num processes: 2
lifebioprodai: Process index: 1
lifebioprodai: Local process index: 0
lifebioprodai: Device: cuda:0
lifebioprodai: Mixed precision type: bf16
lifebioprodai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebioprodai: 
lifebioprodai: [2023-03-14 19:23:47,131] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 137679
lifebioprodai: [2023-03-14 19:23:47,131] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py', '--dataset_name', 'stanfordnlp/SHP'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250

odellus · 2023-03-16T05:48:23Z

Looks like debugging remote subprocesses is not something the accelerate team has experience with. I will work on better logging with what I found in #1178, an issue I also faced when trying to debug the timing out issue. BTW I can run the PEFT example in single node mode on both nodes but for some reason they're failing to communicate and after 900 seconds NCCL (best guess) says alright that is enough. I'm going to have a look at nccl-tests to see if the problem lies there. Thank you!
#1067 (comment)

github-actions · 2023-04-09T15:05:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

odellus mentioned this issue Mar 16, 2023

How can I debug on vscode #400

Closed

odellus mentioned this issue Mar 23, 2023

unable to complete a TCP connection to another process NVIDIA/nccl-tests#132

Open

github-actions bot closed this as completed Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process on second node is being killed after exactly 15 minutes #1114

Process on second node is being killed after exactly 15 minutes #1114

odellus commented Feb 25, 2023 •

edited

Loading

odellus commented Feb 25, 2023

muellerzr commented Feb 25, 2023

odellus commented Feb 25, 2023 via email

odellus commented Feb 25, 2023

odellus commented Mar 13, 2023

odellus commented Mar 14, 2023 •

edited

Loading

odellus commented Mar 16, 2023

github-actions bot commented Apr 9, 2023

Process on second node is being killed after exactly 15 minutes #1114

Process on second node is being killed after exactly 15 minutes #1114

Comments

odellus commented Feb 25, 2023 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

odellus commented Feb 25, 2023

muellerzr commented Feb 25, 2023

odellus commented Feb 25, 2023 via email

odellus commented Feb 25, 2023

odellus commented Mar 13, 2023

odellus commented Mar 14, 2023 • edited Loading

odellus commented Mar 16, 2023

github-actions bot commented Apr 9, 2023

odellus commented Feb 25, 2023 •

edited

Loading

odellus commented Mar 14, 2023 •

edited

Loading