deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

suparek · 2023-12-08T03:34:47Z

Reminder

I have read the README and searched the existing issues.

Reproduction

ds_config.json 配置如下：

{
     "fp16": {
         "enabled": "auto",
         "loss_scale": 0,
         "loss_scale_window": 1000,
         "initial_scale_power": 16,
         "hysteresis": 2,
         "min_loss_scale": 1
     },
     "zero_optimization": {
         "stage": 3,
         "allgather_partitions": true,
         "allgather_bucket_size": 500000000,
         "overlap_comm": true,
         "reduce_scatter": true,
         "reduce_bucket_size": 500000000,
         "contiguous_gradients": true
     },
     "flops_profiler": {
         "enabled": true,
         "profile_step": 1,
         "module_depth": -1,
         "top_modules": 1,
         "detailed": true,
         "output_file": null
     },
     "comms_logger": {
         "enabled": true,
         "verbose": false,
         "prof_all": true,
         "debug": false
     },
     "gradient_accumulation_steps": "auto",
     "gradient_clipping": "auto",
     "steps_per_print": 10,
     "train_batch_size": "auto",
     "train_micro_batch_size_per_gpu": "auto",
     "wall_clock_breakdown": true
 }

训练执行shell如下：

WANDB_DISABLED=True nohup deepspeed --hostfile=hostfile.txt --master_port=24818 src/train_bash.py \
     --stage dpo \
     --deepspeed ds_config.json \
     --model_name_or_path $MODEL_NAME_OR_PATH \
     --use_fast_tokenizer true \
     --do_train \
     --finetuning_type full \
     --dataset dpo_data.jsonl \
     --dpo_beta 0.1 \
     --output_dir $OUTPUT_DIR \
     --overwrite_cache \
     --per_device_train_batch_size 1 \
     --cutoff_len 4096 \
     --gradient_accumulation_steps 4 \
     --preprocessing_num_workers 32 \
     --save_strategy "epoch" \
     --logging_steps 1 \
     --save_steps 50 \
     --lr_scheduler_type cosine \
     --learning_rate 5e-7 \
     --warmup_steps 5 \
     --num_train_epochs 6.0 \
     --plot_loss \
     --gradient_checkpointing \
     --template qwen \
     --flash_attn \
     --fp16 > qwen_train.log  &

训练模型是基于 Qwen-14B sft训练的结果

我的prompt比较长一些，无论是单机多卡还是多机多卡上都必现该问题，显存并没有占满，power100%，整个训练进程完全卡住，没有任何报错与崩溃，等多久都不会好。

training start之后便直接卡住。

ps：NCCL等环境变量配置过，并没有好转

但我如果在 src/llmtuner/data/preprocess.py 的 preprocess_pairwise_dataset 方法里限制一下 token长度：

total_len = len(prompt_ids) + max(len(chosen_ids), len(rejected_ids))
if total_len < (data_args.cutoff_len / 2): #cutoff_len我设置4096
    continue

问题便得到解决，进程不会卡死，能正常训练。
上面代码如果不除以2，只限制到4096，还是会卡死。

我怀疑是数据太长导致了GPU间通信卡死，不知道各位有没有遇到这样的问题，到底是哪导致的？

Expected behavior

No response

System Info

No response

Others

No response

The text was updated successfully, but these errors were encountered:

hiyouga · 2023-12-08T08:03:51Z

试试 #1683 (comment)

suparek · 2023-12-08T09:07:42Z

试试 #1683 (comment)

export NCCL_P2P_LEVEL=NVL

这个试过，不好使。。。

hiyouga · 2023-12-08T09:12:43Z

zero3 比较吃通信，不是 nvlink 就会卡

suparek · 2023-12-08T09:19:21Z

zero3 比较吃通信，不是 nvlink 就会卡

trl 上有人提相关issue，等trl发新版了我更新下试试

huggingface/trl#1011 (comment)

phphappy · 2024-01-10T07:42:25Z

zero3 比较吃通信，不是 nvlink 就会卡

trl 上有人提相关issue，等trl发新版了我更新下试试

huggingface/trl#1011 (comment)

不好使

hiyouga added the pending This problem is yet to be addressed. label Dec 8, 2023

hiyouga added solved This problem has been already solved. and removed pending This problem is yet to be addressed. labels Dec 16, 2023

hiyouga closed this as completed Dec 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

suparek commented Dec 8, 2023 •

edited

hiyouga commented Dec 8, 2023

suparek commented Dec 8, 2023

hiyouga commented Dec 8, 2023

suparek commented Dec 8, 2023

phphappy commented Jan 10, 2024

deepspeed zero3 配置下 DPO 训练 会出现训练进程卡死的问题，这是怎么回事呢？ #1775

deepspeed zero3 配置下 DPO 训练 会出现训练进程卡死的问题，这是怎么回事呢？ #1775

Comments

suparek commented Dec 8, 2023 • edited

Reminder

Reproduction

Expected behavior

System Info

Others

hiyouga commented Dec 8, 2023

suparek commented Dec 8, 2023

hiyouga commented Dec 8, 2023

suparek commented Dec 8, 2023

phphappy commented Jan 10, 2024

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

deepspeed zero3 配置下 DPO 训练会出现训练进程卡死的问题，这是怎么回事呢？ #1775

suparek commented Dec 8, 2023 •

edited