Skip to content

对Qwen3-Omni进行GRPO的时候卡死在训练开始 #6882

@aprylewu

Description

@aprylewu

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

训练脚本

swift rlhf
--rlhf_type grpo
# 基础模型
--model "${ACTOR_MODEL}"
# 加载自定义reward function
--external_plugins orm.py
--reward_funcs avqa
# 使用外部 rollout vLLM
--use_vllm true
--vllm_mode server
--vllm_use_async_engine false
# 训练类型: LoRA 微调
--train_type lora
--lora_rank ${LORA_RANK}
--lora_alpha ${LORA_ALPHA}
--lora_dropout ${LORA_DROPOUT}
--target_modules "${LORA_TARGET_MODULES}"
--torch_dtype bfloat16
# 数据集
--dataset "${TRAIN_DATASET}"
--load_from_cache_file false
# 生成配置
--max_completion_length 2048
--temperature 0.9
--repetition_penalty 1.1
# 训练配置
# 修改并行策略,保持单副本8卡不变
--num_train_epochs 1
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--gradient_checkpointing true
--learning_rate "${LORA_LEARNING_RATE}"
--warmup_ratio 0.05
# GRPO
--num_generations 8
--num_iterations 1
--beta 0.001
--max_grad_norm 0.5
--async_generate false
# 保存策略
--save_strategy steps
--save_steps 100
--save_total_limit 5
# 评估策略
--eval_strategy steps
--eval_steps 50
--per_device_eval_batch_size 1
# 日志
--logging_steps 1
--log_completions true
# 输出
--output_dir "${OUTPUT_DIR}"
# 数据加载
--dataloader_num_workers 0
--dataset_num_proc 1
# DeepSpeed ZeRO-3
--deepspeed zero3_offload
# Wandb - 只有rank 0报告
--report_to "${REPORT_TO}"
# 视觉编码器配置
--freeze_vit true
--freeze_aligner true

Rollout Deploy脚本
swift rollout
--model "${ROLLOUT_MODEL}"
--infer_backend vllm
--host "${ROLLOUT_HOST}"
--port "${ROLLOUT_PORT}"
--vllm_tensor_parallel_size 8
--vllm_gpu_memory_utilization 0.85
--vllm_max_model_len 16384
--vllm_enforce_eager true
--vllm_enable_prefix_caching false
--served_model_name "${ROLLOUT_SERVED_NAME}"

训练节点日志显示:
Train: 0%| | 0/25826 [00:00<?, ?it/s]

Rollout日志显示:
首先
INFO: 10.120.5.11:43236 - "POST /update_flattened_params/ HTTP/1.1" 200 OK

之后输出了一些
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
说明rollout节点可能是收到了一些东西?

然后就开始不知道在加载什么:
0%| | 0/64 [00:00<?, ?it/s]
2%|▏ | 1/64 [00:06<06:18, 6.01s/it]
3%|▎ | 2/64 [00:06<02:40, 2.58s/it]
5%|▍ | 3/64 [00:06<01:28, 1.46s/it]
6%|▋ | 4/64 [00:06<00:56, 1.07it/s]
11%|█ | 7/64 [00:06<00:21, 2.70it/s]
14%|█▍ | 9/64 [00:06<00:15, 3.60it/s]
17%|█▋ | 11/64 [00:07<00:13, 4.07it/s]
19%|█▉ | 12/64 [00:07<00:11, 4.55it/s]
23%|██▎ | 15/64 [00:07<00:06, 7.04it/s]
27%|██▋ | 17/64 [00:07<00:06, 7.38it/s]
31%|███▏ | 20/64 [00:07<00:05, 8.31it/s]
36%|███▌ | 23/64 [00:08<00:03, 11.15it/s]
41%|████ | 26/64 [00:08<00:02, 13.37it/s]
44%|████▍ | 28/64 [00:08<00:04, 8.98it/s]
48%|████▊ | 31/64 [00:08<00:02, 11.36it/s]
52%|█████▏ | 33/64 [00:09<00:02, 10.94it/s]
56%|█████▋ | 36/64 [00:09<00:02, 12.91it/s]
59%|█████▉ | 38/64 [00:09<00:02, 12.46it/s]
62%|██████▎ | 40/64 [00:09<00:01, 12.72it/s]
67%|██████▋ | 43/64 [00:09<00:01, 13.47it/s]
70%|███████ | 45/64 [00:10<00:02, 8.79it/s]
73%|███████▎ | 47/64 [00:10<00:01, 10.15it/s]
77%|███████▋ | 49/64 [00:10<00:01, 8.86it/s]
80%|███████▉ | 51/64 [00:10<00:01, 10.39it/s]
83%|████████▎ | 53/64 [00:10<00:01, 8.88it/s]
86%|████████▌ | 55/64 [00:11<00:01, 7.77it/s]
88%|████████▊ | 56/64 [00:11<00:01, 4.53it/s]
=====卡死在这里,至少12小时没有更新=====

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

GPU:swift rlhf节点为28H100,Rollout节点为18H100

环境:
name: qwen
channels:

  • defaults
    dependencies:
  • _libgcc_mutex=0.1=main
  • _openmp_mutex=5.1=1_gnu
  • bzip2=1.0.8=h5eee18b_6
  • ca-certificates=2025.11.4=h06a4308_0
  • expat=2.7.3=h3385a95_0
  • ld_impl_linux-64=2.44=h153f514_2
  • libffi=3.4.4=h6a678d5_1
  • libgcc=15.2.0=h69a1729_7
  • libgcc-ng=15.2.0=h166f726_7
  • libgomp=15.2.0=h4751f2c_7
  • libnsl=2.0.0=h5eee18b_0
  • libstdcxx=15.2.0=h39759b7_7
  • libstdcxx-ng=15.2.0=hc03a8fd_7
  • libuuid=1.41.5=h5eee18b_0
  • libxcb=1.17.0=h9b100fa_0
  • libzlib=1.3.1=hb25bd0a_0
  • ncurses=6.5=h7934f7d_0
  • openssl=3.0.18=hd6dcaed_0
  • pip=25.2=pyhc872135_1
  • pthread-stubs=0.3=h0ce48e5_1
  • python=3.10.19=h6fa692b_0
  • readline=8.3=hc2a1206_0
  • sqlite=3.51.0=h2a70700_0
  • tk=8.6.15=h54e0aa7_0
  • wheel=0.45.1=py310h06a4308_0
  • xorg-libx11=1.8.12=h9b100fa_1
  • xorg-libxau=1.0.12=h9b100fa_0
  • xorg-libxdmcp=1.1.5=h9b100fa_0
  • xorg-xorgproto=2024.1=h5eee18b_1
  • xz=5.6.4=h5eee18b_1
  • zlib=1.3.1=hb25bd0a_0
  • pip:
    • absl-py==2.3.1
    • accelerate==1.11.0
    • addict==2.4.0
    • aiofiles==24.1.0
    • aiohappyeyeballs==2.6.1
    • aiohttp==3.13.2
    • aiosignal==1.4.0
    • airportsdata==20250909
    • aliyun-python-sdk-core==2.16.0
    • aliyun-python-sdk-kms==2.16.5
    • annotated-doc==0.0.4
    • annotated-types==0.7.0
    • antlr4-python3-runtime==4.9.3
    • anyio==4.11.0
    • apex==0.1
    • astor==0.8.1
    • async-timeout==5.0.1
    • attrdict==2.0.1
    • attrs==25.4.0
    • audioread==3.1.0
    • av==16.0.1
    • binpacking==1.5.2
    • blake3==1.0.8
    • brotli==1.2.0
    • cachetools==6.2.1
    • cbor2==5.7.1
    • certifi==2025.10.5
    • cffi==2.0.0
    • charset-normalizer==3.4.4
    • click==8.2.1
    • cloudpickle==3.1.2
    • cmake==4.1.2
    • compressed-tensors==0.11.0
    • contourpy==1.3.2
    • cpm-kernels==1.0.11
    • crcmod==1.7
    • cryptography==46.0.3
    • cupy-cuda12x==13.6.0
    • cycler==0.12.1
    • dacite==1.9.2
    • datasets==3.6.0
    • decorator==5.2.1
    • decord==0.6.0
    • deepspeed==0.18.2
    • depyf==0.19.0
    • dill==0.3.8
    • diskcache==5.6.3
    • distro==1.9.0
    • dnspython==2.8.0
    • dora-search==0.1.12
    • einops==0.8.1
    • email-validator==2.3.0
    • exceptiongroup==1.3.0
    • fastapi==0.121.1
    • fastapi-cli==0.0.16
    • fastapi-cloud-cli==0.3.1
    • fastrlock==0.8.3
    • ffmpy==1.0.0
    • filelock==3.20.0
    • flash-attn==2.8.3
    • fonttools==4.60.1
    • frozendict==2.4.7
    • frozenlist==1.8.0
    • fsspec==2025.3.0
    • future==1.0.0
    • gguf==0.17.1
    • gitdb==4.0.12
    • gitpython==3.1.45
    • gradio==6.0.0
    • gradio-client==2.0.0.dev3
    • groovy==0.1.2
    • grpcio==1.76.0
    • h11==0.16.0
    • hf-xet==1.2.0
    • hjson==3.1.0
    • httpcore==1.0.9
    • httptools==0.7.1
    • httpx==0.28.1
    • huggingface-hub==0.36.0
    • idna==3.11
    • importlib-metadata==8.7.0
    • interegular==0.3.3
    • jieba==0.42.1
    • jinja2==3.1.6
    • jiter==0.12.0
    • jmespath==0.10.0
    • joblib==1.5.2
    • json-repair==0.54.1
    • jsonschema==4.25.1
    • jsonschema-specifications==2025.9.1
    • julius==0.2.7
    • kiwisolver==1.4.9
    • lameenc==1.8.1
    • lark==1.2.2
    • latex2sympy2-extended==1.10.2
    • lazy-loader==0.4
    • librosa==0.11.0
    • llguidance==0.7.30
    • llvmlite==0.44.0
    • lm-format-enforcer==0.11.3
    • markdown==3.10
    • markdown-it-py==4.0.0
    • markupsafe==3.0.3
    • math-verify==0.8.0
    • matplotlib==3.10.7
    • mdurl==0.1.2
    • megatron-core==0.14.1
    • mistral-common==1.8.5
    • ml-dtypes==0.5.4
    • modelscope==1.31.0
    • mpmath==1.3.0
    • msgpack==1.1.2
    • msgspec==0.19.0
    • multidict==6.7.0
    • multiprocess==0.70.16
    • nest-asyncio==1.6.0
    • networkx==3.4.2
    • ninja==1.13.0
    • nltk==3.9.2
    • numba==0.61.2
    • numpy==1.26.4
    • nvidia-cublas-cu12==12.8.4.1
    • nvidia-cuda-cupti-cu12==12.8.90
    • nvidia-cuda-nvrtc-cu12==12.8.93
    • nvidia-cuda-runtime-cu12==12.8.90
    • nvidia-cudnn-cu12==9.10.2.21
    • nvidia-cufft-cu12==11.3.3.83
    • nvidia-cufile-cu12==1.13.1.3
    • nvidia-curand-cu12==10.3.9.90
    • nvidia-cusolver-cu12==11.7.3.90
    • nvidia-cusparse-cu12==12.5.8.93
    • nvidia-cusparselt-cu12==0.7.1
    • nvidia-nccl-cu12==2.27.3
    • nvidia-nvjitlink-cu12==12.8.93
    • nvidia-nvtx-cu12==12.8.90
    • omegaconf==2.3.0
    • onnx==1.19.1
    • onnx-ir==0.1.12
    • onnxscript==0.5.6
    • openai==2.8.1
    • openai-harmony==0.0.8
    • opencv-python-headless==4.12.0.88
    • openunmix==1.3.0
    • orjson==3.11.4
    • oss2==2.19.1
    • outlines==0.1.11
    • outlines-core==0.1.26
    • packaging==25.0
    • pandas==2.3.3
    • partial-json-parser==0.2.1.1.post6
    • peft==0.17.1
    • pillow==11.3.0
    • platformdirs==4.5.0
    • pooch==1.8.2
    • prometheus-client==0.23.1
    • prometheus-fastapi-instrumentator==7.1.0
    • propcache==0.4.1
    • protobuf==6.33.0
    • psutil==7.1.3
    • py-cpuinfo==9.0.0
    • pyarrow==22.0.0
    • pybase64==1.4.2
    • pybind11==3.0.1
    • pycountry==24.6.1
    • pycparser==2.23
    • pycryptodome==3.23.0
    • pydantic==2.12.4
    • pydantic-core==2.41.5
    • pydantic-extra-types==2.10.6
    • pydub==0.25.1
    • pygments==2.19.2
    • pyparsing==3.2.5
    • python-dateutil==2.9.0.post0
    • python-dotenv==1.2.1
    • python-json-logger==4.0.0
    • python-multipart==0.0.20
    • pytz==2025.2
    • pyyaml==6.0.3
    • pyzmq==27.1.0
    • qwen-omni-utils==0.0.8
    • ray==2.51.1
    • referencing==0.37.0
    • regex==2025.11.3
    • requests==2.32.5
    • retrying==1.4.2
    • rich==14.2.0
    • rich-toolkit==0.15.1
    • rignore==0.7.6
    • rouge==1.0.1
    • rpds-py==0.28.0
    • safehttpx==0.1.7
    • safetensors==0.6.2
    • scikit-learn==1.7.2
    • scipy==1.15.3
    • semantic-version==2.10.0
    • sentencepiece==0.2.1
    • sentry-sdk==2.43.0
    • setproctitle==1.3.7
    • setuptools==79.0.1
    • setuptools-scm==9.2.2
    • shellingham==1.5.4
    • simplejson==3.20.2
    • six==1.17.0
    • smmap==5.0.2
    • sniffio==1.3.1
    • socksio==1.0.0
    • sortedcontainers==2.4.0
    • soundfile==0.13.1
    • soxr==1.0.0
    • starlette==0.49.3
    • submitit==1.5.3
    • sympy==1.14.0
    • tensorboard==2.20.0
    • tensorboard-data-server==0.7.2
    • threadpoolctl==3.6.0
    • tiktoken==0.12.0
    • tokenizers==0.22.1
    • tomli==2.3.0
    • tomlkit==0.13.3
    • torch==2.8.0
    • torchaudio==2.8.0
    • torchvision==0.23.0
    • tqdm==4.67.1
    • transformer-engine==2.9.0
    • transformer-engine-cu12==2.9.0
    • transformer-engine-torch==2.9.0
    • transformers==4.57.1
    • transformers-stream-generator==0.0.5
    • treetable==0.2.6
    • triton==3.4.0
    • trl==0.24.0
    • typer==0.20.0
    • typer-slim==0.20.0
    • typing-extensions==4.15.0
    • typing-inspection==0.4.2
    • tzdata==2025.2
    • urllib3==2.5.0
    • uv==0.9.8
    • uvicorn==0.38.0
    • uvloop==0.22.1
    • vllm==0.10.2
    • wandb==0.23.0
    • watchfiles==1.1.1
    • websockets==15.0.1
    • werkzeug==3.1.3
    • xformers==0.0.32.post1
    • xgrammar==0.1.23
    • xxhash==3.6.0
    • yarl==1.22.0
    • zipp==3.23.0
    • zstandard==0.25.0
      prefix: /root/miniconda3/envs/qwen

ms_swift
Version: 3.11.0.dev0

Additional context
Add any other context about the problem here(在这里补充其他信息)
此处尝试了https://github.com/modelscope/ms-swift/issues/6759这个issue的各种解决方式尚未解决

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions