feat: support audio input for minicpm-o-4_5#9147
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the MiniCPMO4_5Template class, which extends MiniCPMV4_5Template to support audio and video processing. Key additions include audio feature extraction using Whisper, interleaved image and audio placeholder handling for video inputs, and specialized encoding and data collation logic for multimodal data. The review feedback identifies that temporal_ids are missing from both the _encode output and the _data_collator gathering process, which are necessary for the model's vision processing.
| 'loss_scale': loss_scale, | ||
| 'image_bound': image_bound, | ||
| 'pixel_values': image_inputs['pixel_values'], | ||
| 'tgt_sizes': image_inputs['tgt_sizes'], |
There was a problem hiding this comment.
The encoded dictionary is missing the temporal_ids key, which is present in the parent MiniCPMV4_5Template._encode implementation. This key is required for correct video and image slice processing in the model.
| 'tgt_sizes': image_inputs['tgt_sizes'], | |
| 'tgt_sizes': image_inputs['tgt_sizes'], | |
| 'temporal_ids': image_inputs.get('temporal_ids'), |
| for k in ['pixel_values', 'image_bound', 'tgt_sizes']: | ||
| res[k] = self.gather_list(batch, k) |
There was a problem hiding this comment.
The _data_collator should also gather temporal_ids from the batch, as they are part of the vision data for this model architecture. This ensures consistency with the parent MiniCPMV4_5Template implementation.
| for k in ['pixel_values', 'image_bound', 'tgt_sizes']: | |
| res[k] = self.gather_list(batch, k) | |
| # Vision data | |
| for k in ['pixel_values', 'image_bound', 'tgt_sizes', 'temporal_ids']: | |
| res[k] = self.gather_list(batch, k) |
|
Thanks for the PR. Please take a look at Gemini's review suggestions first and see if any code changes are needed. |
| inputs.audios[index] = load_audio(inputs.audios[index], sampling_rate=self.SAMPLING_RATE) | ||
| return ['<|audio_start|><|audio_end|>'] | ||
| elif media_type == 'video': | ||
| from minicpmo.utils import get_video_frame_audio_segments |
There was a problem hiding this comment.
是直接 pip install minicpmo-utils 安装的,文档已补充
|
thanks! LGTM |

PR type
PR information
Support audio input for MiniCPM-O-4_5 with the help of Qoder, which reads
modeling_minicpmo.pyandprocessing_minicpmo.pyand implement theMiniCPMO4_5Template.Experiment results
I verified that the implementation is consistent with that of the official provided script. The script is as follows:
script