-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
训练脚本
swift rlhf
--rlhf_type grpo
# 基础模型
--model "${ACTOR_MODEL}"
# 加载自定义reward function
--external_plugins orm.py
--reward_funcs avqa
# 使用外部 rollout vLLM
--use_vllm true
--vllm_mode server
--vllm_use_async_engine false
# 训练类型: LoRA 微调
--train_type lora
--lora_rank ${LORA_RANK}
--lora_alpha ${LORA_ALPHA}
--lora_dropout ${LORA_DROPOUT}
--target_modules "${LORA_TARGET_MODULES}"
--torch_dtype bfloat16
# 数据集
--dataset "${TRAIN_DATASET}"
--load_from_cache_file false
# 生成配置
--max_completion_length 2048
--temperature 0.9
--repetition_penalty 1.1
# 训练配置
# 修改并行策略,保持单副本8卡不变
--num_train_epochs 1
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--gradient_checkpointing true
--learning_rate "${LORA_LEARNING_RATE}"
--warmup_ratio 0.05
# GRPO
--num_generations 8
--num_iterations 1
--beta 0.001
--max_grad_norm 0.5
--async_generate false
# 保存策略
--save_strategy steps
--save_steps 100
--save_total_limit 5
# 评估策略
--eval_strategy steps
--eval_steps 50
--per_device_eval_batch_size 1
# 日志
--logging_steps 1
--log_completions true
# 输出
--output_dir "${OUTPUT_DIR}"
# 数据加载
--dataloader_num_workers 0
--dataset_num_proc 1
# DeepSpeed ZeRO-3
--deepspeed zero3_offload
# Wandb - 只有rank 0报告
--report_to "${REPORT_TO}"
# 视觉编码器配置
--freeze_vit true
--freeze_aligner true
Rollout Deploy脚本
swift rollout
--model "${ROLLOUT_MODEL}"
--infer_backend vllm
--host "${ROLLOUT_HOST}"
--port "${ROLLOUT_PORT}"
--vllm_tensor_parallel_size 8
--vllm_gpu_memory_utilization 0.85
--vllm_max_model_len 16384
--vllm_enforce_eager true
--vllm_enable_prefix_caching false
--served_model_name "${ROLLOUT_SERVED_NAME}"
训练节点日志显示:
Train: 0%| | 0/25826 [00:00<?, ?it/s]
Rollout日志显示:
首先
INFO: 10.120.5.11:43236 - "POST /update_flattened_params/ HTTP/1.1" 200 OK
之后输出了一些
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
说明rollout节点可能是收到了一些东西?
然后就开始不知道在加载什么:
0%| | 0/64 [00:00<?, ?it/s]
2%|▏ | 1/64 [00:06<06:18, 6.01s/it]
3%|▎ | 2/64 [00:06<02:40, 2.58s/it]
5%|▍ | 3/64 [00:06<01:28, 1.46s/it]
6%|▋ | 4/64 [00:06<00:56, 1.07it/s]
11%|█ | 7/64 [00:06<00:21, 2.70it/s]
14%|█▍ | 9/64 [00:06<00:15, 3.60it/s]
17%|█▋ | 11/64 [00:07<00:13, 4.07it/s]
19%|█▉ | 12/64 [00:07<00:11, 4.55it/s]
23%|██▎ | 15/64 [00:07<00:06, 7.04it/s]
27%|██▋ | 17/64 [00:07<00:06, 7.38it/s]
31%|███▏ | 20/64 [00:07<00:05, 8.31it/s]
36%|███▌ | 23/64 [00:08<00:03, 11.15it/s]
41%|████ | 26/64 [00:08<00:02, 13.37it/s]
44%|████▍ | 28/64 [00:08<00:04, 8.98it/s]
48%|████▊ | 31/64 [00:08<00:02, 11.36it/s]
52%|█████▏ | 33/64 [00:09<00:02, 10.94it/s]
56%|█████▋ | 36/64 [00:09<00:02, 12.91it/s]
59%|█████▉ | 38/64 [00:09<00:02, 12.46it/s]
62%|██████▎ | 40/64 [00:09<00:01, 12.72it/s]
67%|██████▋ | 43/64 [00:09<00:01, 13.47it/s]
70%|███████ | 45/64 [00:10<00:02, 8.79it/s]
73%|███████▎ | 47/64 [00:10<00:01, 10.15it/s]
77%|███████▋ | 49/64 [00:10<00:01, 8.86it/s]
80%|███████▉ | 51/64 [00:10<00:01, 10.39it/s]
83%|████████▎ | 53/64 [00:10<00:01, 8.88it/s]
86%|████████▌ | 55/64 [00:11<00:01, 7.77it/s]
88%|████████▊ | 56/64 [00:11<00:01, 4.53it/s]
=====卡死在这里,至少12小时没有更新=====
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
GPU:swift rlhf节点为28H100,Rollout节点为18H100
环境:
name: qwen
channels:
- defaults
dependencies: - _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- bzip2=1.0.8=h5eee18b_6
- ca-certificates=2025.11.4=h06a4308_0
- expat=2.7.3=h3385a95_0
- ld_impl_linux-64=2.44=h153f514_2
- libffi=3.4.4=h6a678d5_1
- libgcc=15.2.0=h69a1729_7
- libgcc-ng=15.2.0=h166f726_7
- libgomp=15.2.0=h4751f2c_7
- libnsl=2.0.0=h5eee18b_0
- libstdcxx=15.2.0=h39759b7_7
- libstdcxx-ng=15.2.0=hc03a8fd_7
- libuuid=1.41.5=h5eee18b_0
- libxcb=1.17.0=h9b100fa_0
- libzlib=1.3.1=hb25bd0a_0
- ncurses=6.5=h7934f7d_0
- openssl=3.0.18=hd6dcaed_0
- pip=25.2=pyhc872135_1
- pthread-stubs=0.3=h0ce48e5_1
- python=3.10.19=h6fa692b_0
- readline=8.3=hc2a1206_0
- sqlite=3.51.0=h2a70700_0
- tk=8.6.15=h54e0aa7_0
- wheel=0.45.1=py310h06a4308_0
- xorg-libx11=1.8.12=h9b100fa_1
- xorg-libxau=1.0.12=h9b100fa_0
- xorg-libxdmcp=1.1.5=h9b100fa_0
- xorg-xorgproto=2024.1=h5eee18b_1
- xz=5.6.4=h5eee18b_1
- zlib=1.3.1=hb25bd0a_0
- pip:
- absl-py==2.3.1
- accelerate==1.11.0
- addict==2.4.0
- aiofiles==24.1.0
- aiohappyeyeballs==2.6.1
- aiohttp==3.13.2
- aiosignal==1.4.0
- airportsdata==20250909
- aliyun-python-sdk-core==2.16.0
- aliyun-python-sdk-kms==2.16.5
- annotated-doc==0.0.4
- annotated-types==0.7.0
- antlr4-python3-runtime==4.9.3
- anyio==4.11.0
- apex==0.1
- astor==0.8.1
- async-timeout==5.0.1
- attrdict==2.0.1
- attrs==25.4.0
- audioread==3.1.0
- av==16.0.1
- binpacking==1.5.2
- blake3==1.0.8
- brotli==1.2.0
- cachetools==6.2.1
- cbor2==5.7.1
- certifi==2025.10.5
- cffi==2.0.0
- charset-normalizer==3.4.4
- click==8.2.1
- cloudpickle==3.1.2
- cmake==4.1.2
- compressed-tensors==0.11.0
- contourpy==1.3.2
- cpm-kernels==1.0.11
- crcmod==1.7
- cryptography==46.0.3
- cupy-cuda12x==13.6.0
- cycler==0.12.1
- dacite==1.9.2
- datasets==3.6.0
- decorator==5.2.1
- decord==0.6.0
- deepspeed==0.18.2
- depyf==0.19.0
- dill==0.3.8
- diskcache==5.6.3
- distro==1.9.0
- dnspython==2.8.0
- dora-search==0.1.12
- einops==0.8.1
- email-validator==2.3.0
- exceptiongroup==1.3.0
- fastapi==0.121.1
- fastapi-cli==0.0.16
- fastapi-cloud-cli==0.3.1
- fastrlock==0.8.3
- ffmpy==1.0.0
- filelock==3.20.0
- flash-attn==2.8.3
- fonttools==4.60.1
- frozendict==2.4.7
- frozenlist==1.8.0
- fsspec==2025.3.0
- future==1.0.0
- gguf==0.17.1
- gitdb==4.0.12
- gitpython==3.1.45
- gradio==6.0.0
- gradio-client==2.0.0.dev3
- groovy==0.1.2
- grpcio==1.76.0
- h11==0.16.0
- hf-xet==1.2.0
- hjson==3.1.0
- httpcore==1.0.9
- httptools==0.7.1
- httpx==0.28.1
- huggingface-hub==0.36.0
- idna==3.11
- importlib-metadata==8.7.0
- interegular==0.3.3
- jieba==0.42.1
- jinja2==3.1.6
- jiter==0.12.0
- jmespath==0.10.0
- joblib==1.5.2
- json-repair==0.54.1
- jsonschema==4.25.1
- jsonschema-specifications==2025.9.1
- julius==0.2.7
- kiwisolver==1.4.9
- lameenc==1.8.1
- lark==1.2.2
- latex2sympy2-extended==1.10.2
- lazy-loader==0.4
- librosa==0.11.0
- llguidance==0.7.30
- llvmlite==0.44.0
- lm-format-enforcer==0.11.3
- markdown==3.10
- markdown-it-py==4.0.0
- markupsafe==3.0.3
- math-verify==0.8.0
- matplotlib==3.10.7
- mdurl==0.1.2
- megatron-core==0.14.1
- mistral-common==1.8.5
- ml-dtypes==0.5.4
- modelscope==1.31.0
- mpmath==1.3.0
- msgpack==1.1.2
- msgspec==0.19.0
- multidict==6.7.0
- multiprocess==0.70.16
- nest-asyncio==1.6.0
- networkx==3.4.2
- ninja==1.13.0
- nltk==3.9.2
- numba==0.61.2
- numpy==1.26.4
- nvidia-cublas-cu12==12.8.4.1
- nvidia-cuda-cupti-cu12==12.8.90
- nvidia-cuda-nvrtc-cu12==12.8.93
- nvidia-cuda-runtime-cu12==12.8.90
- nvidia-cudnn-cu12==9.10.2.21
- nvidia-cufft-cu12==11.3.3.83
- nvidia-cufile-cu12==1.13.1.3
- nvidia-curand-cu12==10.3.9.90
- nvidia-cusolver-cu12==11.7.3.90
- nvidia-cusparse-cu12==12.5.8.93
- nvidia-cusparselt-cu12==0.7.1
- nvidia-nccl-cu12==2.27.3
- nvidia-nvjitlink-cu12==12.8.93
- nvidia-nvtx-cu12==12.8.90
- omegaconf==2.3.0
- onnx==1.19.1
- onnx-ir==0.1.12
- onnxscript==0.5.6
- openai==2.8.1
- openai-harmony==0.0.8
- opencv-python-headless==4.12.0.88
- openunmix==1.3.0
- orjson==3.11.4
- oss2==2.19.1
- outlines==0.1.11
- outlines-core==0.1.26
- packaging==25.0
- pandas==2.3.3
- partial-json-parser==0.2.1.1.post6
- peft==0.17.1
- pillow==11.3.0
- platformdirs==4.5.0
- pooch==1.8.2
- prometheus-client==0.23.1
- prometheus-fastapi-instrumentator==7.1.0
- propcache==0.4.1
- protobuf==6.33.0
- psutil==7.1.3
- py-cpuinfo==9.0.0
- pyarrow==22.0.0
- pybase64==1.4.2
- pybind11==3.0.1
- pycountry==24.6.1
- pycparser==2.23
- pycryptodome==3.23.0
- pydantic==2.12.4
- pydantic-core==2.41.5
- pydantic-extra-types==2.10.6
- pydub==0.25.1
- pygments==2.19.2
- pyparsing==3.2.5
- python-dateutil==2.9.0.post0
- python-dotenv==1.2.1
- python-json-logger==4.0.0
- python-multipart==0.0.20
- pytz==2025.2
- pyyaml==6.0.3
- pyzmq==27.1.0
- qwen-omni-utils==0.0.8
- ray==2.51.1
- referencing==0.37.0
- regex==2025.11.3
- requests==2.32.5
- retrying==1.4.2
- rich==14.2.0
- rich-toolkit==0.15.1
- rignore==0.7.6
- rouge==1.0.1
- rpds-py==0.28.0
- safehttpx==0.1.7
- safetensors==0.6.2
- scikit-learn==1.7.2
- scipy==1.15.3
- semantic-version==2.10.0
- sentencepiece==0.2.1
- sentry-sdk==2.43.0
- setproctitle==1.3.7
- setuptools==79.0.1
- setuptools-scm==9.2.2
- shellingham==1.5.4
- simplejson==3.20.2
- six==1.17.0
- smmap==5.0.2
- sniffio==1.3.1
- socksio==1.0.0
- sortedcontainers==2.4.0
- soundfile==0.13.1
- soxr==1.0.0
- starlette==0.49.3
- submitit==1.5.3
- sympy==1.14.0
- tensorboard==2.20.0
- tensorboard-data-server==0.7.2
- threadpoolctl==3.6.0
- tiktoken==0.12.0
- tokenizers==0.22.1
- tomli==2.3.0
- tomlkit==0.13.3
- torch==2.8.0
- torchaudio==2.8.0
- torchvision==0.23.0
- tqdm==4.67.1
- transformer-engine==2.9.0
- transformer-engine-cu12==2.9.0
- transformer-engine-torch==2.9.0
- transformers==4.57.1
- transformers-stream-generator==0.0.5
- treetable==0.2.6
- triton==3.4.0
- trl==0.24.0
- typer==0.20.0
- typer-slim==0.20.0
- typing-extensions==4.15.0
- typing-inspection==0.4.2
- tzdata==2025.2
- urllib3==2.5.0
- uv==0.9.8
- uvicorn==0.38.0
- uvloop==0.22.1
- vllm==0.10.2
- wandb==0.23.0
- watchfiles==1.1.1
- websockets==15.0.1
- werkzeug==3.1.3
- xformers==0.0.32.post1
- xgrammar==0.1.23
- xxhash==3.6.0
- yarl==1.22.0
- zipp==3.23.0
- zstandard==0.25.0
prefix: /root/miniconda3/envs/qwen
ms_swift
Version: 3.11.0.dev0
Additional context
Add any other context about the problem here(在这里补充其他信息)
此处尝试了https://github.com/modelscope/ms-swift/issues/6759这个issue的各种解决方式尚未解决