-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process on second node is being killed after exactly 15 minutes #1114
Comments
I'm unable to figure out why the process running on the second node is killed by launch.py after 15 minutes whenever I try to run the conditional generation script in the peft examples directory. Doesn't seem to be an issue with peft. A second but highly related issue is that I'm not able to find any logs regarding why the process is being killed. Is there a health check endpoint I can post a GET request to? I go into more detail about the issue here. |
@odellus on each node you can't use the exact same config file, there is a Are you setting that on the non-main machine? |
Derp. Will try again with this insight later today. Thank you!
…On Sat, Feb 25, 2023, 9:11 AM Zachary Mueller ***@***.***> wrote:
@odellus <https://github.com/odellus> on each node you can't use the
exact same config file, there is a machine_rank entry that needs to be 1
on the second node:
https://github.com/huggingface/accelerate/blob/main/tests/test_configs/0_12_0.yaml#L6
Are you setting that on the non-main machine?
—
Reply to this email directly, view it on GitHub
<#1114 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDYI3GICSQPRADOGRRMIXDWZIOJZANCNFSM6AAAAAAVH3CGAI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Seeing the same error when I correctly attribute node rank on the second machine's configs unfortunately. lifebioprodai: [2023-02-25 12:46:44,946] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3877160
lifebioprodai: [2023-02-25 12:46:44,946] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250
/home/thomas/miniconda3/envs/nlp/bin/python -u \
-m deepspeed.launcher.launch \
--world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== \
--node_rank=1 \
--master_addr=192.168.6.5 \
--master_port=8989 \
--no_local_rank \
/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py Is there a good way to debug this remote process on the second node? I've always gotten by copying and pasting problematic code into the REPL and haven't built up a lot of experience using |
@muellerzr how do you debug remote processes launched with pdsh? I've found a few example of how to attach pdb to running processes on stack overflow but any help you could give on how you do it personally would be well received. |
I tried it with This seems to be connected to NCCL timing out during preprocessing. Is there a way to bump up the timeout? How does the team at accelerate debug remote processes launched with pdsh? Any help would be most welcome. accelerate launch --config_file /home/thomas/src/accelerate/examples/by_feature/ds_zero3_cpu.yaml /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name stanfordnlp/SHP
[2023-03-14 19:08:40,589] [INFO] [runner.py:454:main] Using IP address of 192.168.6.5 for node lifebiodevai
[2023-03-14 19:08:40,591] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: lifebiodevai,lifebioprodai
[2023-03-14 19:08:40,591] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w lifebiodevai,lifebioprodai export PYTHONPATH=/home/thomas/src/accelerate/examples/by_feature; export SHELL=/bin/bash; export COLORTERM=truecolor; export TERM_PROGRAM_VERSION=1.70.2; export CONDA_EXE=/home/thomas/miniconda3/bin/conda; export _CE_M=; export PWD=/home/thomas/src/accelerate/examples/by_feature; export LOGNAME=thomas; export XDG_SESSION_TYPE=tty; export CONDA_PREFIX=/home/thomas/miniconda3/envs/nlp; export JUPYTER_SERVER_URL=http://lifebiodevai:8872/; export VSCODE_GIT_ASKPASS_NODE=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/node; export MOTD_SHOWN=pam; export LINES=29; export HOME=/home/thomas; export LANG=en_US.UTF-8; export COLUMNS=120; export GIT_ASKPASS=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass.sh; export PYDEVD_USE_FRAME_EVAL=NO; export VSCODE_GIT_ASKPASS_EXTRA_ARGS=; export XDG_SESSION_CLASS=user; export JUPYTER_SERVER_ROOT=/home/thomas; export TERM=xterm-256color; export _CE_CONDA=; export USER=thomas; export VSCODE_GIT_IPC_HANDLE=/run/user/1002/vscode-git-e710c8e75d.sock; export CONDA_SHLVL=2; export SHLVL=1; export PYXTERM_DIMENSIONS=80x25; export XDG_SESSION_ID=605; export CONDA_PYTHON_EXE=/home/thomas/miniconda3/bin/python; export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64; export XDG_RUNTIME_DIR=/run/user/1002; export CONDA_DEFAULT_ENV=nlp; export VSCODE_GIT_ASKPASS_MAIN=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass-main.js; export XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop; export BROWSER=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/helpers/browser.sh; export PATH=/home/thomas/miniconda3/envs/nlp/bin:/home/thomas/miniconda3/condabin:/home/thomas/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin; export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1002/bus; export CONDA_PREFIX_1=/home/thomas/miniconda3; export OLDPWD=/home/thomas/src; export TERM_PROGRAM=vscode; export VSCODE_IPC_HOOK_CLI=/run/user/1002/vscode-ipc-f03caaee-1d2c-4b01-8f00-57b95d1a0043.sock; export _=/home/thomas/miniconda3/envs/nlp/bin/accelerate; export ACCELERATE_MIXED_PRECISION=bf16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,gradient_clipping,offload_optimizer_device,offload_param_device,zero3_init_flag,zero3_save_16bit_model,zero_stage,mixed_precision; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=3; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=4; export ACCELERATE_GRADIENT_CLIPPING=1.0; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export ACCELERATE_DEEPSPEED_ZERO3_INIT=true; export ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true; cd /home/thomas/src/accelerate/examples/by_feature; /home/thomas/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== --node_rank=%n --master_addr=192.168.6.5 --master_port=8989 --no_local_rank /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name 'stanfordnlp/SHP'
lifebiodevai: [2023-03-14 19:08:42,572] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:162:main] dist_world_size=2
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:162:main] dist_world_size=2
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebiodevai: [2023-03-14 19:08:45,784] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
lifebiodevai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
lifebiodevai: Num processes: 2
lifebiodevai: Process index: 0
lifebiodevai: Local process index: 0
lifebiodevai: Device: cuda:0
lifebiodevai: Mixed precision type: bf16
lifebiodevai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebiodevai:
lifebioprodai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED Backend: nccl
lifebioprodai: Num processes: 2
lifebioprodai: Process index: 1
lifebioprodai: Local process index: 0
lifebioprodai: Device: cuda:0
lifebioprodai: Mixed precision type: bf16
lifebioprodai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebioprodai:
lifebioprodai: [2023-03-14 19:23:47,131] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 137679
lifebioprodai: [2023-03-14 19:23:47,131] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py', '--dataset_name', 'stanfordnlp/SHP'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250 |
Looks like debugging remote subprocesses is not something the accelerate team has experience with. I will work on better logging with what I found in #1178, an issue I also faced when trying to debug the timing out issue. BTW I can run the PEFT example in single node mode on both nodes but for some reason they're failing to communicate and after 900 seconds NCCL (best guess) says alright that is enough. I'm going to have a look at nccl-tests to see if the problem lies there. Thank you! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Steps to reproduce the behavior:
accelerate config --config_file <config_out.yaml>
on the main nodeaccelerate launch --config_file <config_out.yaml> examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py
Expected behavior
The text was updated successfully, but these errors were encountered: