Skip to content

Bug: Training Qwen-2.5-omni with GRPO and multimodal (audio+video) input results in 0 loss and empty output #6069

@cheliu-computation

Description

@cheliu-computation

Describe the bug When fine-tuning the Qwen-2.5-omni model using the GRPO algorithm on a dataset with both audio and visual inputs from the same video source, the training metrics show no signs of learning. The loss, grad_norm, and reward all remain at 0.0 from the very first step.

Consequently, when trying to generate output from the model during or after this training process, it produces an empty string. This behavior suggests a fundamental issue in how the data is being processed, how gradients are calculated, or how the reward is being computed for this specific multimodal setup.

To Reproduce Steps to reproduce the behavior:

  1. Set up a training environment using the MS-Swift framework with the versions specified below.

  2. Configure the training script to use qwen-2.5-omni as the base model.

  3. Set the fine-tuning algorithm to sft_type: 'grpo'.

  4. Use a custom dataset where each sample contains paired video frames and audio waveforms from a single video.

  5. Launch the fine-tuning process.

  6. Observe the training logs, which will display the metrics as shown below.

  7. Attempt to run inference with the saved checkpoints, which will result in an empty string output.

Expected behavior The expected behavior is that the model should learn from the multimodal inputs. Specifically:

  • The loss should be a non-zero value and should decrease as training progresses.

  • The reward metric for GRPO should be a meaningful, non-zero value.

  • The grad_norm should be non-zero, indicating that gradients are flowing.

  • The model should be able to generate relevant, non-empty text when prompted.

Training Logs The following log is consistently produced at each step:

JSON
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 4e-08, 'reward': 0.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'rewards/BoxMatch/mean': 0.0, 'rewards/BoxMatch/std': 0.0, 'completions/mean_length': 1699.9375, 'completions/min_length': 2.0, 'completions/max_length': 4097.0, 'completions/clipped_ratio': 0.375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/2219', 'percentage': '0.05%', 'elapsed_time': '1m 50s', 'remaining_time': '2d 19h 56m 55s', 'memory(GiB)': 24.14, 'train_speed(iter/s)': 0.009067}

Environment

Component | Version -- | -- MS-Swift | Current vLLM | 0.11.0 Python | 3.12 GPU/Hardware | H100 80GB CUDA Version | 12.8 PyTorch Version | 2.8.0

Additional context

Image

This issue raises a few questions:

Is the combination of Qwen-2.5-omni + GRPO + simultaneous audio + vision inputs officially supported in the current version of MS-Swift? The behavior suggests there might be an incompatibility.

Could this issue also affect the upcoming Qwen3-omni model, or is it specific to Qwen-2.5-omni?

Any insights or potential workarounds would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions