Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed zero3 配置下 DPO 训练 会出现训练进程卡死的问题,这是怎么回事呢? #1775

Closed
1 task done
suparek opened this issue Dec 8, 2023 · 5 comments
Labels
solved This problem has been already solved.

Comments

@suparek
Copy link

suparek commented Dec 8, 2023

Reminder

  • I have read the README and searched the existing issues.

Reproduction

ds_config.json 配置如下:

{
     "fp16": {
         "enabled": "auto",
         "loss_scale": 0,
         "loss_scale_window": 1000,
         "initial_scale_power": 16,
         "hysteresis": 2,
         "min_loss_scale": 1
     },
     "zero_optimization": {
         "stage": 3,
         "allgather_partitions": true,
         "allgather_bucket_size": 500000000,
         "overlap_comm": true,
         "reduce_scatter": true,
         "reduce_bucket_size": 500000000,
         "contiguous_gradients": true
     },
     "flops_profiler": {
         "enabled": true,
         "profile_step": 1,
         "module_depth": -1,
         "top_modules": 1,
         "detailed": true,
         "output_file": null
     },
     "comms_logger": {
         "enabled": true,
         "verbose": false,
         "prof_all": true,
         "debug": false
     },
     "gradient_accumulation_steps": "auto",
     "gradient_clipping": "auto",
     "steps_per_print": 10,
     "train_batch_size": "auto",
     "train_micro_batch_size_per_gpu": "auto",
     "wall_clock_breakdown": true
 }

训练执行shell如下:

WANDB_DISABLED=True nohup deepspeed --hostfile=hostfile.txt --master_port=24818 src/train_bash.py \
     --stage dpo \
     --deepspeed ds_config.json \
     --model_name_or_path $MODEL_NAME_OR_PATH \
     --use_fast_tokenizer true \
     --do_train \
     --finetuning_type full \
     --dataset dpo_data.jsonl \
     --dpo_beta 0.1 \
     --output_dir $OUTPUT_DIR \
     --overwrite_cache \
     --per_device_train_batch_size 1 \
     --cutoff_len 4096 \
     --gradient_accumulation_steps 4 \
     --preprocessing_num_workers 32 \
     --save_strategy "epoch" \
     --logging_steps 1 \
     --save_steps 50 \
     --lr_scheduler_type cosine \
     --learning_rate 5e-7 \
     --warmup_steps 5 \
     --num_train_epochs 6.0 \
     --plot_loss \
     --gradient_checkpointing \
     --template qwen \
     --flash_attn \
     --fp16 > qwen_train.log  &

训练模型是基于 Qwen-14B sft训练的结果

我的prompt比较长一些,无论是单机多卡还是多机多卡上都必现该问题,显存并没有占满,power100%,整个训练进程完全卡住,没有任何报错与崩溃,等多久都不会好。
image
training start之后便直接卡住。

ps:NCCL等环境变量配置过,并没有好转

但我如果在 src/llmtuner/data/preprocess.py 的 preprocess_pairwise_dataset 方法里限制一下 token长度:

total_len = len(prompt_ids) + max(len(chosen_ids), len(rejected_ids))
if total_len < (data_args.cutoff_len / 2): #cutoff_len我设置4096
    continue

问题便得到解决,进程不会卡死,能正常训练。
上面代码如果不除以2,只限制到4096,还是会卡死。

我怀疑是数据太长导致了GPU间通信卡死,不知道各位有没有遇到这样的问题,到底是哪导致的?

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga
Copy link
Owner

hiyouga commented Dec 8, 2023

试试 #1683 (comment)

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Dec 8, 2023
@suparek
Copy link
Author

suparek commented Dec 8, 2023

试试 #1683 (comment)

export NCCL_P2P_LEVEL=NVL

这个试过,不好使。。。

@hiyouga
Copy link
Owner

hiyouga commented Dec 8, 2023

zero3 比较吃通信,不是 nvlink 就会卡

@suparek
Copy link
Author

suparek commented Dec 8, 2023

zero3 比较吃通信,不是 nvlink 就会卡

trl 上有人提相关issue,等trl发新版了我更新下试试

huggingface/trl#1011 (comment)

@hiyouga hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels Dec 16, 2023
@hiyouga hiyouga closed this as completed Dec 16, 2023
@phphappy
Copy link

zero3 比较吃通信,不是 nvlink 就会卡

trl 上有人提相关issue,等trl发新版了我更新下试试

huggingface/trl#1011 (comment)

不好使

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

3 participants