Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process on second node is being killed after exactly 15 minutes #1114

Closed
2 of 4 tasks
odellus opened this issue Feb 25, 2023 · 8 comments
Closed
2 of 4 tasks

Process on second node is being killed after exactly 15 minutes #1114

odellus opened this issue Feb 25, 2023 · 8 comments

Comments

@odellus
Copy link

odellus commented Feb 25, 2023

System Info

- `Accelerate` version: 0.16.0
- Platform: Linux-5.14.0-1051-oem-x86_64-with-glibc2.31
- Python version: 3.10.9
- Numpy version: 1.23.5
- PyTorch version (GPU?): 1.13.1 (True)
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - dynamo_backend: NO
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 2
        - main_process_ip: 192.168.5.5
        - main_process_port: 2333
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'deepspeed_multinode_launcher': 'standard', 'gradient_accumulation_steps': 0, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': True, 'zero_stage': 3}
        - fsdp_config: {}
        - megatron_lm_config: {}
        - downcast_bf16: no

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. Buy two Dell workstations with a single A6000 each
  2. Enable passwordless ssh between the two
  3. Install miniconda on both machines and create mirrored virtual environments on both
  4. Clone peft
  5. Run accelerate config --config_file <config_out.yaml> on the main node
  6. Change node rank in config file and scp this configuration onto the second node
  7. Run accelerate launch --config_file <config_out.yaml> examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Expected behavior

The process on the second node does not killed by launch.py after 15 minutes
@odellus
Copy link
Author

odellus commented Feb 25, 2023

I'm unable to figure out why the process running on the second node is killed by launch.py after 15 minutes whenever I try to run the conditional generation script in the peft examples directory. Doesn't seem to be an issue with peft. A second but highly related issue is that I'm not able to find any logs regarding why the process is being killed. Is there a health check endpoint I can post a GET request to? I go into more detail about the issue here.

@muellerzr
Copy link
Collaborator

@odellus on each node you can't use the exact same config file, there is a machine_rank entry that needs to be 1 on the second node: https://github.com/huggingface/accelerate/blob/main/tests/test_configs/0_12_0.yaml#L6

Are you setting that on the non-main machine?

@odellus
Copy link
Author

odellus commented Feb 25, 2023 via email

@odellus
Copy link
Author

odellus commented Feb 25, 2023

Seeing the same error when I correctly attribute node rank on the second machine's configs unfortunately.

lifebioprodai: [2023-02-25 12:46:44,946] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3877160
lifebioprodai: [2023-02-25 12:46:44,946] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250

ssh-ing into the second node and running ps aux | grep conda shows that the second node is getting the correct rank

/home/thomas/miniconda3/envs/nlp/bin/python -u \
-m deepspeed.launcher.launch \
--world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== \
--node_rank=1 \
--master_addr=192.168.6.5 \
--master_port=8989 \
--no_local_rank \
/home/thomas/src/peft/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py

Is there a good way to debug this remote process on the second node? I've always gotten by copying and pasting problematic code into the REPL and haven't built up a lot of experience using pdb so if you have any helpful hints on debugging remote python processes launched by accelerate I'd love to hear them!

@odellus
Copy link
Author

odellus commented Mar 13, 2023

@muellerzr how do you debug remote processes launched with pdsh? I've found a few example of how to attach pdb to running processes on stack overflow but any help you could give on how you do it personally would be well received.

@odellus
Copy link
Author

odellus commented Mar 14, 2023

I tried it with examples/by_features/deepspeed_with_config_support.py in the accelerate repo. I ran accelerate config on both machines and gave the same answers except for machine rank in addition to exporting the accelerate configuration to yaml, editing the rank before scp-ing to the second node, then changing the rank back again. I fed the script the stanfordnlp/SHP dataset to train on just for fun. Same behavior as the PEFT example from before. Subprocess killed. Return code = -6.

This seems to be connected to NCCL timing out during preprocessing. Is there a way to bump up the timeout? How does the team at accelerate debug remote processes launched with pdsh? Any help would be most welcome.

accelerate launch --config_file /home/thomas/src/accelerate/examples/by_feature/ds_zero3_cpu.yaml /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name stanfordnlp/SHP
[2023-03-14 19:08:40,589] [INFO] [runner.py:454:main] Using IP address of 192.168.6.5 for node lifebiodevai
[2023-03-14 19:08:40,591] [INFO] [multinode_runner.py:65:get_cmd] Running on the following workers: lifebiodevai,lifebioprodai
[2023-03-14 19:08:40,591] [INFO] [runner.py:548:main] cmd = pdsh -S -f 1024 -w lifebiodevai,lifebioprodai export PYTHONPATH=/home/thomas/src/accelerate/examples/by_feature; export SHELL=/bin/bash; export COLORTERM=truecolor; export TERM_PROGRAM_VERSION=1.70.2; export CONDA_EXE=/home/thomas/miniconda3/bin/conda; export _CE_M=; export PWD=/home/thomas/src/accelerate/examples/by_feature; export LOGNAME=thomas; export XDG_SESSION_TYPE=tty; export CONDA_PREFIX=/home/thomas/miniconda3/envs/nlp; export JUPYTER_SERVER_URL=http://lifebiodevai:8872/; export VSCODE_GIT_ASKPASS_NODE=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/node; export MOTD_SHOWN=pam; export LINES=29; export HOME=/home/thomas; export LANG=en_US.UTF-8; export COLUMNS=120; export GIT_ASKPASS=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass.sh; export PYDEVD_USE_FRAME_EVAL=NO; export VSCODE_GIT_ASKPASS_EXTRA_ARGS=; export XDG_SESSION_CLASS=user; export JUPYTER_SERVER_ROOT=/home/thomas; export TERM=xterm-256color; export _CE_CONDA=; export USER=thomas; export VSCODE_GIT_IPC_HANDLE=/run/user/1002/vscode-git-e710c8e75d.sock; export CONDA_SHLVL=2; export SHLVL=1; export PYXTERM_DIMENSIONS=80x25; export XDG_SESSION_ID=605; export CONDA_PYTHON_EXE=/home/thomas/miniconda3/bin/python; export LD_LIBRARY_PATH=:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64; export XDG_RUNTIME_DIR=/run/user/1002; export CONDA_DEFAULT_ENV=nlp; export VSCODE_GIT_ASKPASS_MAIN=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/extensions/git/dist/askpass-main.js; export XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop; export BROWSER=/home/thomas/.vscode-server/bin/e4503b30fc78200f846c62cf8091b76ff5547662/bin/helpers/browser.sh; export PATH=/home/thomas/miniconda3/envs/nlp/bin:/home/thomas/miniconda3/condabin:/home/thomas/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/opt/mssql-tools/bin; export DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1002/bus; export CONDA_PREFIX_1=/home/thomas/miniconda3; export OLDPWD=/home/thomas/src; export TERM_PROGRAM=vscode; export VSCODE_IPC_HOOK_CLI=/run/user/1002/vscode-ipc-f03caaee-1d2c-4b01-8f00-57b95d1a0043.sock; export _=/home/thomas/miniconda3/envs/nlp/bin/accelerate; export ACCELERATE_MIXED_PRECISION=bf16; export ACCELERATE_CONFIG_DS_FIELDS=deepspeed_hostfile,deepspeed_multinode_launcher,gradient_accumulation_steps,gradient_clipping,offload_optimizer_device,offload_param_device,zero3_init_flag,zero3_save_16bit_model,zero_stage,mixed_precision; export ACCELERATE_USE_DEEPSPEED=true; export ACCELERATE_DEEPSPEED_ZERO_STAGE=3; export ACCELERATE_GRADIENT_ACCUMULATION_STEPS=4; export ACCELERATE_GRADIENT_CLIPPING=1.0; export ACCELERATE_DEEPSPEED_OFFLOAD_OPTIMIZER_DEVICE=cpu; export ACCELERATE_DEEPSPEED_OFFLOAD_PARAM_DEVICE=cpu; export ACCELERATE_DEEPSPEED_ZERO3_INIT=true; export ACCELERATE_DEEPSPEED_ZERO3_SAVE_16BIT_MODEL=true;  cd /home/thomas/src/accelerate/examples/by_feature; /home/thomas/miniconda3/envs/nlp/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsaWZlYmlvZGV2YWkiOiBbMF0sICJsaWZlYmlvcHJvZGFpIjogWzBdfQ== --node_rank=%n --master_addr=192.168.6.5 --master_port=8989 --no_local_rank /home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py --dataset_name 'stanfordnlp/SHP'
lifebiodevai: [2023-03-14 19:08:42,572] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=0
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:162:main] dist_world_size=2
lifebiodevai: [2023-03-14 19:08:42,573] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:142:main] WORLD INFO DICT: {'lifebiodevai': [0], 'lifebioprodai': [0]}
lifebioprodai: [2023-03-14 19:08:43,049] [INFO] [launch.py:148:main] nnodes=2, num_local_procs=1, node_rank=1
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'lifebiodevai': [0], 'lifebioprodai': [1]})
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:162:main] dist_world_size=2
lifebioprodai: [2023-03-14 19:08:43,050] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
lifebiodevai: [2023-03-14 19:08:45,784] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
lifebiodevai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
lifebiodevai: Num processes: 2
lifebiodevai: Process index: 0
lifebiodevai: Local process index: 0
lifebiodevai: Device: cuda:0
lifebiodevai: Mixed precision type: bf16
lifebiodevai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebiodevai: 
lifebioprodai: 03/14/2023 19:08:46 - INFO - __main__ - Distributed environment: DEEPSPEED  Backend: nccl
lifebioprodai: Num processes: 2
lifebioprodai: Process index: 1
lifebioprodai: Local process index: 0
lifebioprodai: Device: cuda:0
lifebioprodai: Mixed precision type: bf16
lifebioprodai: ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'cpu'}, 'offload_param': {'device': 'cpu'}, 'stage3_gather_16bit_weights_on_model_save': True}, 'gradient_clipping': 1.0, 'steps_per_print': inf, 'bf16': {'enabled': True}, 'fp16': {'enabled': False}}
lifebioprodai: 
lifebioprodai: [2023-03-14 19:23:47,131] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 137679
lifebioprodai: [2023-03-14 19:23:47,131] [ERROR] [launch.py:324:sigkill_handler] ['/home/thomas/miniconda3/envs/nlp/bin/python', '-u', '/home/thomas/src/accelerate/examples/by_feature/deepspeed_with_config_support.py', '--dataset_name', 'stanfordnlp/SHP'] exits with return code = -6
pdsh@lifebiodevai: lifebioprodai: ssh exited with exit code 250

@odellus
Copy link
Author

odellus commented Mar 16, 2023

Looks like debugging remote subprocesses is not something the accelerate team has experience with. I will work on better logging with what I found in #1178, an issue I also faced when trying to debug the timing out issue. BTW I can run the PEFT example in single node mode on both nodes but for some reason they're failing to communicate and after 900 seconds NCCL (best guess) says alright that is enough. I'm going to have a look at nccl-tests to see if the problem lies there. Thank you!
#1067 (comment)

@github-actions
Copy link

github-actions bot commented Apr 9, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants