We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ds_config.json 配置如下:
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "zero_optimization": { "stage": 3, "allgather_partitions": true, "allgather_bucket_size": 500000000, "overlap_comm": true, "reduce_scatter": true, "reduce_bucket_size": 500000000, "contiguous_gradients": true }, "flops_profiler": { "enabled": true, "profile_step": 1, "module_depth": -1, "top_modules": 1, "detailed": true, "output_file": null }, "comms_logger": { "enabled": true, "verbose": false, "prof_all": true, "debug": false }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 10, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": true }
训练执行shell如下:
WANDB_DISABLED=True nohup deepspeed --hostfile=hostfile.txt --master_port=24818 src/train_bash.py \ --stage dpo \ --deepspeed ds_config.json \ --model_name_or_path $MODEL_NAME_OR_PATH \ --use_fast_tokenizer true \ --do_train \ --finetuning_type full \ --dataset dpo_data.jsonl \ --dpo_beta 0.1 \ --output_dir $OUTPUT_DIR \ --overwrite_cache \ --per_device_train_batch_size 1 \ --cutoff_len 4096 \ --gradient_accumulation_steps 4 \ --preprocessing_num_workers 32 \ --save_strategy "epoch" \ --logging_steps 1 \ --save_steps 50 \ --lr_scheduler_type cosine \ --learning_rate 5e-7 \ --warmup_steps 5 \ --num_train_epochs 6.0 \ --plot_loss \ --gradient_checkpointing \ --template qwen \ --flash_attn \ --fp16 > qwen_train.log &
训练模型是基于 Qwen-14B sft训练的结果
我的prompt比较长一些,无论是单机多卡还是多机多卡上都必现该问题,显存并没有占满,power100%,整个训练进程完全卡住,没有任何报错与崩溃,等多久都不会好。 training start之后便直接卡住。
ps:NCCL等环境变量配置过,并没有好转
但我如果在 src/llmtuner/data/preprocess.py 的 preprocess_pairwise_dataset 方法里限制一下 token长度:
total_len = len(prompt_ids) + max(len(chosen_ids), len(rejected_ids)) if total_len < (data_args.cutoff_len / 2): #cutoff_len我设置4096 continue
问题便得到解决,进程不会卡死,能正常训练。 上面代码如果不除以2,只限制到4096,还是会卡死。
我怀疑是数据太长导致了GPU间通信卡死,不知道各位有没有遇到这样的问题,到底是哪导致的?
No response
The text was updated successfully, but these errors were encountered:
试试 #1683 (comment)
Sorry, something went wrong.
export NCCL_P2P_LEVEL=NVL
这个试过,不好使。。。
zero3 比较吃通信,不是 nvlink 就会卡
trl 上有人提相关issue,等trl发新版了我更新下试试
huggingface/trl#1011 (comment)
zero3 比较吃通信,不是 nvlink 就会卡 trl 上有人提相关issue,等trl发新版了我更新下试试 huggingface/trl#1011 (comment)
不好使
No branches or pull requests
Reminder
Reproduction
ds_config.json 配置如下:
训练执行shell如下:
训练模型是基于 Qwen-14B sft训练的结果
我的prompt比较长一些,无论是单机多卡还是多机多卡上都必现该问题,显存并没有占满,power100%,整个训练进程完全卡住,没有任何报错与崩溃,等多久都不会好。
training start之后便直接卡住。
ps:NCCL等环境变量配置过,并没有好转
但我如果在 src/llmtuner/data/preprocess.py 的 preprocess_pairwise_dataset 方法里限制一下 token长度:
问题便得到解决,进程不会卡死,能正常训练。
上面代码如果不除以2,只限制到4096,还是会卡死。
我怀疑是数据太长导致了GPU间通信卡死,不知道各位有没有遇到这样的问题,到底是哪导致的?
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: