Bug: Training Qwen-2.5-omni with GRPO and multimodal (audio+video) input results in 0 loss and empty output

Describe the bug
When fine-tuning the <code>Qwen-2.5-omni</code> model using the GRPO algorithm on a dataset with both audio and visual inputs from the same video source, the training metrics show no signs of learning. The <code>loss</code>, <code>grad_norm</code>, and <code>reward</code> all remain at <code>0.0</code> from the very first step.Consequently, when trying to generate output from the model during or after this training process, it produces an empty string. This behavior suggests a fundamental issue in how the data is being processed, how gradients are calculated, or how the reward is being computed for this specific multimodal setup.To Reproduce
Steps to reproduce the behavior:<ol start="1"><li>Set up a training environment using the MS-Swift framework with the versions specified below.</li><li>Configure the training script to use <code>qwen-2.5-omni</code> as the base model.</li><li>Set the fine-tuning algorithm to <code>sft_type: 'grpo'</code>.</li><li>Use a custom dataset where each sample contains paired video frames and audio waveforms from a single video.</li><li>Launch the fine-tuning process.</li><li>Observe the training logs, which will display the metrics as shown below.</li><li>Attempt to run inference with the saved checkpoints, which will result in an empty string output.</li></ol>Expected behavior
The expected behavior is that the model should learn from the multimodal inputs. Specifically:<ul><li>The <code>loss</code> should be a non-zero value and should decrease as training progresses.</li><li>The <code>reward</code> metric for GRPO should be a meaningful, non-zero value.</li><li>The <code>grad_norm</code> should be non-zero, indicating that gradients are flowing.</li><li>The model should be able to generate relevant, non-empty text when prompted.</li></ul>Training Logs
The following log is consistently produced at each step:<response-element class="" ng-version="0.0.0-PLACEHOLDER"><code-block _nghost-ng-c3606480851="" class="ng-tns-c3606480851-76 ng-star-inserted" style=""><div _ngcontent-ng-c3606480851="" class="code-block ng-tns-c3606480851-76 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation" jslog="223238;track:impression,attention;BardVeMetadataKey:[[&quot;r_971247ca519c39ce&quot;,&quot;c_ed43e9b7d7d0bdac&quot;,null,&quot;rc_5d4af0d990cd3805&quot;,null,null,&quot;en&quot;,null,1,null,null,1,0]]" data-hveid="0" decode-data-ved="1" data-ved="0CAAQhtANahgKEwi62NaD9ZSQAxUAAAAAHQAAAAAQjgI" style="display: block;"><div _ngcontent-ng-c3606480851="" class="code-block-decoration header-formatted gds-title-s ng-tns-c3606480851-76 ng-star-inserted" style="">JSON<div _ngcontent-ng-c3606480851="" class="buttons ng-tns-c3606480851-76 ng-star-inserted"><button _ngcontent-ng-c3606480851="" aria-label="Copy code" mat-icon-button="" mattooltip="Copy code" class="mdc-icon-button mat-mdc-icon-button mat-mdc-button-base mat-mdc-tooltip-trigger copy-button ng-tns-c3606480851-76 mat-unthemed ng-star-inserted" mat-ripple-loader-uninitialized="" mat-ripple-loader-class-name="mat-mdc-button-ripple" mat-ripple-loader-centered="" jslog="179062;track:generic_click,impression;BardVeMetadataKey:[[&quot;r_971247ca519c39ce&quot;,&quot;c_ed43e9b7d7d0bdac&quot;,null,&quot;rc_5d4af0d990cd3805&quot;,null,null,&quot;en&quot;,null,1,null,null,1,0]];mutable:true"><mat-icon _ngcontent-ng-c3606480851="" role="img" fonticon="content_copy" class="mat-icon notranslate google-symbols mat-ligature-font mat-icon-no-color" aria-hidden="true" data-mat-icon-type="font" data-mat-icon-name="content_copy"></mat-icon></button></div></div><div _ngcontent-ng-c3606480851="" class="formatted-code-block-internal-container ng-tns-c3606480851-76"><div _ngcontent-ng-c3606480851="" class="animated-opacity ng-tns-c3606480851-76"><pre _ngcontent-ng-c3606480851="" class="ng-tns-c3606480851-76"><code _ngcontent-ng-c3606480851="" role="text" data-test-id="code-content" class="code-container formatted ng-tns-c3606480851-76">{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 4e-08, 'reward': 0.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'rewards/BoxMatch/mean': 0.0, 'rewards/BoxMatch/std': 0.0, 'completions/mean_length': 1699.9375, 'completions/min_length': 2.0, 'completions/max_length': 4097.0, 'completions/clipped_ratio': 0.375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/2219', 'percentage': '0.05%', 'elapsed_time': '1m 50s', 'remaining_time': '2d 19h 56m 55s', 'memory(GiB)': 24.14, 'train_speed(iter/s)': 0.009067}
</code></pre></div></div></div></code-block></response-element>Environment<div class="horizontal-scroll-wrapper"><div class="table-block-component"><response-element class="" ng-version="0.0.0-PLACEHOLDER">
Component | Version
-- | --
MS-Swift | Current
vLLM | 0.11.0
Python | 3.12
GPU/Hardware | H100 80GB
CUDA Version | 12.8
PyTorch Version | 2.8.0

</div></div></table-block></response-element></div></div>

Additional context

<img width="1632" height="1080" alt="Image" src="https://github.com/user-attachments/assets/a22abbe1-5c2f-4693-b6ce-4d19eda83c70" />

This issue raises a few questions:

Is the combination of Qwen-2.5-omni + GRPO + simultaneous audio + vision inputs officially supported in the current version of MS-Swift? The behavior suggests there might be an incompatibility.

Could this issue also affect the upcoming Qwen3-omni model, or is it specific to Qwen-2.5-omni?

Any insights or potential workarounds would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Training Qwen-2.5-omni with GRPO and multimodal (audio+video) input results in 0 loss and empty output #6069

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Training Qwen-2.5-omni with GRPO and multimodal (audio+video) input results in 0 loss and empty output #6069

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions