-
Notifications
You must be signed in to change notification settings - Fork 923
Description
Describe the bug
When fine-tuning the Qwen-2.5-omni model using the GRPO algorithm on a dataset with both audio and visual inputs from the same video source, the training metrics show no signs of learning. The loss, grad_norm, and reward all remain at 0.0 from the very first step.
Consequently, when trying to generate output from the model during or after this training process, it produces an empty string. This behavior suggests a fundamental issue in how the data is being processed, how gradients are calculated, or how the reward is being computed for this specific multimodal setup.
To Reproduce Steps to reproduce the behavior:
Set up a training environment using the MS-Swift framework with the versions specified below.
Configure the training script to use
qwen-2.5-omnias the base model.Set the fine-tuning algorithm to
sft_type: 'grpo'.Use a custom dataset where each sample contains paired video frames and audio waveforms from a single video.
Launch the fine-tuning process.
Observe the training logs, which will display the metrics as shown below.
Attempt to run inference with the saved checkpoints, which will result in an empty string output.
Expected behavior The expected behavior is that the model should learn from the multimodal inputs. Specifically:
The
lossshould be a non-zero value and should decrease as training progresses.The
rewardmetric for GRPO should be a meaningful, non-zero value.The
grad_normshould be non-zero, indicating that gradients are flowing.The model should be able to generate relevant, non-empty text when prompted.
Training Logs The following log is consistently produced at each step:
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 4e-08, 'reward': 0.0, 'reward_std': 0.0, 'frac_reward_zero_std': 1.0, 'rewards/BoxMatch/mean': 0.0, 'rewards/BoxMatch/std': 0.0, 'completions/mean_length': 1699.9375, 'completions/min_length': 2.0, 'completions/max_length': 4097.0, 'completions/clipped_ratio': 0.375, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/2219', 'percentage': '0.05%', 'elapsed_time': '1m 50s', 'remaining_time': '2d 19h 56m 55s', 'memory(GiB)': 24.14, 'train_speed(iter/s)': 0.009067}
Environment
Additional context
This issue raises a few questions:
Is the combination of Qwen-2.5-omni + GRPO + simultaneous audio + vision inputs officially supported in the current version of MS-Swift? The behavior suggests there might be an incompatibility.
Could this issue also affect the upcoming Qwen3-omni model, or is it specific to Qwen-2.5-omni?
Any insights or potential workarounds would be greatly appreciated.