对Qwen3-Omni进行GRPO的时候卡死在训练开始

**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

_训练脚本_

swift rlhf
    --rlhf_type grpo
    # 基础模型
    --model "${ACTOR_MODEL}"
    # 加载自定义reward function
    --external_plugins orm.py
    --reward_funcs avqa
    # 使用外部 rollout vLLM
    --use_vllm true
    --vllm_mode server
    --vllm_use_async_engine false
    # 训练类型: LoRA 微调
    --train_type lora
    --lora_rank ${LORA_RANK}
    --lora_alpha ${LORA_ALPHA}
    --lora_dropout ${LORA_DROPOUT}
    --target_modules "${LORA_TARGET_MODULES}"
    --torch_dtype bfloat16
    # 数据集
    --dataset "${TRAIN_DATASET}"
    --load_from_cache_file false
    # 生成配置
    --max_completion_length 2048
    --temperature 0.9
    --repetition_penalty 1.1
    # 训练配置
    # 修改并行策略，保持单副本8卡不变
    --num_train_epochs 1
    --per_device_train_batch_size 1
    --gradient_accumulation_steps 1
    --gradient_checkpointing true
    --learning_rate "${LORA_LEARNING_RATE}"
    --warmup_ratio 0.05
    # GRPO
    --num_generations 8
    --num_iterations 1
    --beta 0.001
    --max_grad_norm 0.5
    --async_generate false
    # 保存策略
    --save_strategy steps
    --save_steps 100
    --save_total_limit 5
    # 评估策略
    --eval_strategy steps
    --eval_steps 50
    --per_device_eval_batch_size 1
    # 日志
    --logging_steps 1
    --log_completions true
    # 输出
    --output_dir "${OUTPUT_DIR}"
    # 数据加载
    --dataloader_num_workers 0
    --dataset_num_proc 1
    # DeepSpeed ZeRO-3
    --deepspeed zero3_offload
    # Wandb - 只有rank 0报告
    --report_to "${REPORT_TO}"
    # 视觉编码器配置
    --freeze_vit true
    --freeze_aligner true

_Rollout Deploy脚本_
    swift rollout
    --model "${ROLLOUT_MODEL}"
    --infer_backend vllm
    --host "${ROLLOUT_HOST}"
    --port "${ROLLOUT_PORT}"
    --vllm_tensor_parallel_size 8
    --vllm_gpu_memory_utilization 0.85
    --vllm_max_model_len 16384
    --vllm_enforce_eager true
    --vllm_enable_prefix_caching false
    --served_model_name "${ROLLOUT_SERVED_NAME}"

训练节点日志显示：
Train: 0%| | 0/25826 [00:00<?, ?it/s]

Rollout日志显示：
首先
INFO:     10.120.5.11:43236 - "POST /update_flattened_params/ HTTP/1.1" 200 OK

之后输出了一些
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
qwen-vl-utils using decord to read video.
说明rollout节点可能是收到了一些东西？

然后就开始不知道在加载什么：
  0%|          | 0/64 [00:00<?, ?it/s]
  2%|▏         | 1/64 [00:06<06:18,  6.01s/it]
  3%|▎         | 2/64 [00:06<02:40,  2.58s/it]
  5%|▍         | 3/64 [00:06<01:28,  1.46s/it]
  6%|▋         | 4/64 [00:06<00:56,  1.07it/s]
 11%|█         | 7/64 [00:06<00:21,  2.70it/s]
 14%|█▍        | 9/64 [00:06<00:15,  3.60it/s]
 17%|█▋        | 11/64 [00:07<00:13,  4.07it/s]
 19%|█▉        | 12/64 [00:07<00:11,  4.55it/s]
 23%|██▎       | 15/64 [00:07<00:06,  7.04it/s]
 27%|██▋       | 17/64 [00:07<00:06,  7.38it/s]
 31%|███▏      | 20/64 [00:07<00:05,  8.31it/s]
 36%|███▌      | 23/64 [00:08<00:03, 11.15it/s]
 41%|████      | 26/64 [00:08<00:02, 13.37it/s]
 44%|████▍     | 28/64 [00:08<00:04,  8.98it/s]
 48%|████▊     | 31/64 [00:08<00:02, 11.36it/s]
 52%|█████▏    | 33/64 [00:09<00:02, 10.94it/s]
 56%|█████▋    | 36/64 [00:09<00:02, 12.91it/s]
 59%|█████▉    | 38/64 [00:09<00:02, 12.46it/s]
 62%|██████▎   | 40/64 [00:09<00:01, 12.72it/s]
 67%|██████▋   | 43/64 [00:09<00:01, 13.47it/s]
 70%|███████   | 45/64 [00:10<00:02,  8.79it/s]
 73%|███████▎  | 47/64 [00:10<00:01, 10.15it/s]
 77%|███████▋  | 49/64 [00:10<00:01,  8.86it/s]
 80%|███████▉  | 51/64 [00:10<00:01, 10.39it/s]
 83%|████████▎ | 53/64 [00:10<00:01,  8.88it/s]
 86%|████████▌ | 55/64 [00:11<00:01,  7.77it/s]
 88%|████████▊ | 56/64 [00:11<00:01,  4.53it/s]
=====卡死在这里，至少12小时没有更新=====

**Your hardware and system info**
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

GPU：swift rlhf节点为2*8*H100，Rollout节点为1*8*H100

环境：
name: qwen
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - _openmp_mutex=5.1=1_gnu
  - bzip2=1.0.8=h5eee18b_6
  - ca-certificates=2025.11.4=h06a4308_0
  - expat=2.7.3=h3385a95_0
  - ld_impl_linux-64=2.44=h153f514_2
  - libffi=3.4.4=h6a678d5_1
  - libgcc=15.2.0=h69a1729_7
  - libgcc-ng=15.2.0=h166f726_7
  - libgomp=15.2.0=h4751f2c_7
  - libnsl=2.0.0=h5eee18b_0
  - libstdcxx=15.2.0=h39759b7_7
  - libstdcxx-ng=15.2.0=hc03a8fd_7
  - libuuid=1.41.5=h5eee18b_0
  - libxcb=1.17.0=h9b100fa_0
  - libzlib=1.3.1=hb25bd0a_0
  - ncurses=6.5=h7934f7d_0
  - openssl=3.0.18=hd6dcaed_0
  - pip=25.2=pyhc872135_1
  - pthread-stubs=0.3=h0ce48e5_1
  - python=3.10.19=h6fa692b_0
  - readline=8.3=hc2a1206_0
  - sqlite=3.51.0=h2a70700_0
  - tk=8.6.15=h54e0aa7_0
  - wheel=0.45.1=py310h06a4308_0
  - xorg-libx11=1.8.12=h9b100fa_1
  - xorg-libxau=1.0.12=h9b100fa_0
  - xorg-libxdmcp=1.1.5=h9b100fa_0
  - xorg-xorgproto=2024.1=h5eee18b_1
  - xz=5.6.4=h5eee18b_1
  - zlib=1.3.1=hb25bd0a_0
  - pip:
      - absl-py==2.3.1
      - accelerate==1.11.0
      - addict==2.4.0
      - aiofiles==24.1.0
      - aiohappyeyeballs==2.6.1
      - aiohttp==3.13.2
      - aiosignal==1.4.0
      - airportsdata==20250909
      - aliyun-python-sdk-core==2.16.0
      - aliyun-python-sdk-kms==2.16.5
      - annotated-doc==0.0.4
      - annotated-types==0.7.0
      - antlr4-python3-runtime==4.9.3
      - anyio==4.11.0
      - apex==0.1
      - astor==0.8.1
      - async-timeout==5.0.1
      - attrdict==2.0.1
      - attrs==25.4.0
      - audioread==3.1.0
      - av==16.0.1
      - binpacking==1.5.2
      - blake3==1.0.8
      - brotli==1.2.0
      - cachetools==6.2.1
      - cbor2==5.7.1
      - certifi==2025.10.5
      - cffi==2.0.0
      - charset-normalizer==3.4.4
      - click==8.2.1
      - cloudpickle==3.1.2
      - cmake==4.1.2
      - compressed-tensors==0.11.0
      - contourpy==1.3.2
      - cpm-kernels==1.0.11
      - crcmod==1.7
      - cryptography==46.0.3
      - cupy-cuda12x==13.6.0
      - cycler==0.12.1
      - dacite==1.9.2
      - datasets==3.6.0
      - decorator==5.2.1
      - decord==0.6.0
      - deepspeed==0.18.2
      - depyf==0.19.0
      - dill==0.3.8
      - diskcache==5.6.3
      - distro==1.9.0
      - dnspython==2.8.0
      - dora-search==0.1.12
      - einops==0.8.1
      - email-validator==2.3.0
      - exceptiongroup==1.3.0
      - fastapi==0.121.1
      - fastapi-cli==0.0.16
      - fastapi-cloud-cli==0.3.1
      - fastrlock==0.8.3
      - ffmpy==1.0.0
      - filelock==3.20.0
      - flash-attn==2.8.3
      - fonttools==4.60.1
      - frozendict==2.4.7
      - frozenlist==1.8.0
      - fsspec==2025.3.0
      - future==1.0.0
      - gguf==0.17.1
      - gitdb==4.0.12
      - gitpython==3.1.45
      - gradio==6.0.0
      - gradio-client==2.0.0.dev3
      - groovy==0.1.2
      - grpcio==1.76.0
      - h11==0.16.0
      - hf-xet==1.2.0
      - hjson==3.1.0
      - httpcore==1.0.9
      - httptools==0.7.1
      - httpx==0.28.1
      - huggingface-hub==0.36.0
      - idna==3.11
      - importlib-metadata==8.7.0
      - interegular==0.3.3
      - jieba==0.42.1
      - jinja2==3.1.6
      - jiter==0.12.0
      - jmespath==0.10.0
      - joblib==1.5.2
      - json-repair==0.54.1
      - jsonschema==4.25.1
      - jsonschema-specifications==2025.9.1
      - julius==0.2.7
      - kiwisolver==1.4.9
      - lameenc==1.8.1
      - lark==1.2.2
      - latex2sympy2-extended==1.10.2
      - lazy-loader==0.4
      - librosa==0.11.0
      - llguidance==0.7.30
      - llvmlite==0.44.0
      - lm-format-enforcer==0.11.3
      - markdown==3.10
      - markdown-it-py==4.0.0
      - markupsafe==3.0.3
      - math-verify==0.8.0
      - matplotlib==3.10.7
      - mdurl==0.1.2
      - megatron-core==0.14.1
      - mistral-common==1.8.5
      - ml-dtypes==0.5.4
      - modelscope==1.31.0
      - mpmath==1.3.0
      - msgpack==1.1.2
      - msgspec==0.19.0
      - multidict==6.7.0
      - multiprocess==0.70.16
      - nest-asyncio==1.6.0
      - networkx==3.4.2
      - ninja==1.13.0
      - nltk==3.9.2
      - numba==0.61.2
      - numpy==1.26.4
      - nvidia-cublas-cu12==12.8.4.1
      - nvidia-cuda-cupti-cu12==12.8.90
      - nvidia-cuda-nvrtc-cu12==12.8.93
      - nvidia-cuda-runtime-cu12==12.8.90
      - nvidia-cudnn-cu12==9.10.2.21
      - nvidia-cufft-cu12==11.3.3.83
      - nvidia-cufile-cu12==1.13.1.3
      - nvidia-curand-cu12==10.3.9.90
      - nvidia-cusolver-cu12==11.7.3.90
      - nvidia-cusparse-cu12==12.5.8.93
      - nvidia-cusparselt-cu12==0.7.1
      - nvidia-nccl-cu12==2.27.3
      - nvidia-nvjitlink-cu12==12.8.93
      - nvidia-nvtx-cu12==12.8.90
      - omegaconf==2.3.0
      - onnx==1.19.1
      - onnx-ir==0.1.12
      - onnxscript==0.5.6
      - openai==2.8.1
      - openai-harmony==0.0.8
      - opencv-python-headless==4.12.0.88
      - openunmix==1.3.0
      - orjson==3.11.4
      - oss2==2.19.1
      - outlines==0.1.11
      - outlines-core==0.1.26
      - packaging==25.0
      - pandas==2.3.3
      - partial-json-parser==0.2.1.1.post6
      - peft==0.17.1
      - pillow==11.3.0
      - platformdirs==4.5.0
      - pooch==1.8.2
      - prometheus-client==0.23.1
      - prometheus-fastapi-instrumentator==7.1.0
      - propcache==0.4.1
      - protobuf==6.33.0
      - psutil==7.1.3
      - py-cpuinfo==9.0.0
      - pyarrow==22.0.0
      - pybase64==1.4.2
      - pybind11==3.0.1
      - pycountry==24.6.1
      - pycparser==2.23
      - pycryptodome==3.23.0
      - pydantic==2.12.4
      - pydantic-core==2.41.5
      - pydantic-extra-types==2.10.6
      - pydub==0.25.1
      - pygments==2.19.2
      - pyparsing==3.2.5
      - python-dateutil==2.9.0.post0
      - python-dotenv==1.2.1
      - python-json-logger==4.0.0
      - python-multipart==0.0.20
      - pytz==2025.2
      - pyyaml==6.0.3
      - pyzmq==27.1.0
      - qwen-omni-utils==0.0.8
      - ray==2.51.1
      - referencing==0.37.0
      - regex==2025.11.3
      - requests==2.32.5
      - retrying==1.4.2
      - rich==14.2.0
      - rich-toolkit==0.15.1
      - rignore==0.7.6
      - rouge==1.0.1
      - rpds-py==0.28.0
      - safehttpx==0.1.7
      - safetensors==0.6.2
      - scikit-learn==1.7.2
      - scipy==1.15.3
      - semantic-version==2.10.0
      - sentencepiece==0.2.1
      - sentry-sdk==2.43.0
      - setproctitle==1.3.7
      - setuptools==79.0.1
      - setuptools-scm==9.2.2
      - shellingham==1.5.4
      - simplejson==3.20.2
      - six==1.17.0
      - smmap==5.0.2
      - sniffio==1.3.1
      - socksio==1.0.0
      - sortedcontainers==2.4.0
      - soundfile==0.13.1
      - soxr==1.0.0
      - starlette==0.49.3
      - submitit==1.5.3
      - sympy==1.14.0
      - tensorboard==2.20.0
      - tensorboard-data-server==0.7.2
      - threadpoolctl==3.6.0
      - tiktoken==0.12.0
      - tokenizers==0.22.1
      - tomli==2.3.0
      - tomlkit==0.13.3
      - torch==2.8.0
      - torchaudio==2.8.0
      - torchvision==0.23.0
      - tqdm==4.67.1
      - transformer-engine==2.9.0
      - transformer-engine-cu12==2.9.0
      - transformer-engine-torch==2.9.0
      - transformers==4.57.1
      - transformers-stream-generator==0.0.5
      - treetable==0.2.6
      - triton==3.4.0
      - trl==0.24.0
      - typer==0.20.0
      - typer-slim==0.20.0
      - typing-extensions==4.15.0
      - typing-inspection==0.4.2
      - tzdata==2025.2
      - urllib3==2.5.0
      - uv==0.9.8
      - uvicorn==0.38.0
      - uvloop==0.22.1
      - vllm==0.10.2
      - wandb==0.23.0
      - watchfiles==1.1.1
      - websockets==15.0.1
      - werkzeug==3.1.3
      - xformers==0.0.32.post1
      - xgrammar==0.1.23
      - xxhash==3.6.0
      - yarl==1.22.0
      - zipp==3.23.0
      - zstandard==0.25.0
prefix: /root/miniconda3/envs/qwen

ms_swift
Version: 3.11.0.dev0


**Additional context**
Add any other context about the problem here(在这里补充其他信息)
此处尝试了https://github.com/modelscope/ms-swift/issues/6759这个issue的各种解决方式尚未解决

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

对Qwen3-Omni进行GRPO的时候卡死在训练开始 #6882

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

对Qwen3-Omni进行GRPO的时候卡死在训练开始 #6882

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions