-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DeepSpeed hangs during evaluation under multi-GPU #5394
Comments
The same problem |
@kai-0430 Can you provide the output of |
@jomayeri Sure. For the setting of 4 A100s, they have NVLink interconnecting them. But no matter if NCCL_P2P_DISABLE=1 or not, the hanging always occur. Here is another issue. For (1), setting Thread 36455 (idle): "MainThread"
backward (torch/autograd/__init__.py:266)
backward (torch/_tensor.py:522)
backward (deepspeed/runtime/fp16/loss_scaler.py:63)
backward (deepspeed/runtime/zero/stage_1_and_2.py:2051)
backward (deepspeed/runtime/engine.py:1976)
wrapped_fn (deepspeed/utils/nvtx.py:15)
backward (accelerate/utils/deepspeed.py:166)
backward (accelerate/accelerator.py:1995)
training_step (transformers/trainer.py:3045)
_inner_training_loop (transformers/trainer.py:2118)
train (transformers/trainer.py:1780)
main (llama2_ds_v3.py:232)
<module> (llama2_ds_v3.py:240)
Thread 36635 (idle): "Thread-1"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 1020 (idle): "Thread-11"
wait (threading.py:331)
wait (threading.py:629)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2867 (idle): "Thread-15 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:947)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:30)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:53)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2938 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2939 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2940 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2941 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2942 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2943 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2944 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2945 (idle): "QueueFeederThread"
wait (threading.py:327)
_feed (multiprocessing/queues.py:231)
run (threading.py:982)
_bootstrap_inner (threading.py:1045)
_bootstrap (threading.py:1002)
Thread 2963 (idle)
Thread 2964 (idle)
Thread 2962 (active)
_flatten_dense_tensors (torch/_utils.py:526)
allreduce_bucket (deepspeed/runtime/zero/stage_1_and_2.py:1477)
allreduce_and_copy_with_multiple_ranks (deepspeed/runtime/zero/stage_1_and_2.py:1000)
allreduce_and_scatter (deepspeed/runtime/zero/stage_1_and_2.py:1027)
average_tensor (deepspeed/runtime/zero/stage_1_and_2.py:1123)
reduce_ipg_grads (deepspeed/runtime/zero/stage_1_and_2.py:1363)
reduce_independent_p_g_buckets_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:928)
reduce_ready_partitions_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:1412)
reduce_partition_and_remove_grads (deepspeed/runtime/zero/stage_1_and_2.py:899)
backward (torch/autograd/__init__.py:266)
backward (torch/utils/checkpoint.py:319)
apply (torch/autograd/function.py:289)
Thread 2965 (idle) |
This seems to be a systems issue. If you run without DeepSpeed does the hang also occur? |
Thank for your reply! @jomayeri |
I guess it could be the issue happened in the accelerate / transformer. Hence, I filed a related issue here. Have you tried to use the original FSDP API of Pytorch to conduct parallel training in DDP? |
No, I haven't. Maybe I'll try it these days. |
Edit - 1
The same problem occurs when using ZeRO2 with offloading.
Describe the bug
I am trying to train Llama2-7B-fp16 using 4 V100.
I use ZeRO-3 without offloading, with huggingFace trainer.
However, the training hagns during the 1st evaluation (validation), or after the 1st evaluation completed.
The process is then killed due to NCCL timeout:
It's so weired since during training the forward/backward pass seems fine. Why it hangs during evaluation??
What I've tried:
To Reproduce
Here is the ds_config.json file:
Here is my code:
Expected behavior
Train and validate without error.
ds_report output
System info (please complete the following information):
Launcher context
Here is how I run my code.
Additional context
Use py-spy command,
pgrep -P $(pgrep -o deepspeed) | xargs -I {} py-spy dump --pid {}
. It shows:I also tried my script on 2 A100, but the same problem happens.
Also, I tried the accelerate launcher, and the trouble still occurs.
I am not sure where the problem comes from:
I saw this "On Linux with kernel version < 5.5, hanging processes have been reported. To avoid this problem, upgrade your system to a later kernel version." in this website. My kernel version is 3.10.0. I'm not sure whether it is the cause. For my case, installing a new kernel is not an easy solution.
Any suggestions are greatly appreciated.
The text was updated successfully, but these errors were encountered: