https://medium.com/@prdeepak.babu/the-rise-of-multimodal-large-speech-language-models-4fc5ea34d04f

Qwen-Audio is a foundational audio model that is capable of handling diverse audio types(human speech, natural sound, music and songs) and audio tasks (ASR, acoustic scence classification,etc.). The Qwen-Audio model is trained on over 30+ diverse audio tasks like audio classification, speech recognition and emotion recognition). They demonstrate the Qwen-Audio model beats SoTA models on varied tasks indicating good performance in zero-shot setting. Author further demonstrate timestamp prediction shows improvement in grounding and grounding based QA tasks beyond speech signals, as well as ASR. Qwen-Audio is built using whisper-large-v2 as the audio encoder and using Qwen-7B decoder-only LLM model as the core component. The audio encoder is based on whisper-large-v2 composed of 640M parameters. Authors also modify the prompt tags to hierarchially organize tasks and datasets, to avoid loss in gains from interferance. The authors also train a Qwen-Audio-Chat model by fine-tuning Qwen-Audio model using supervised instruction fine-tuning on 30+ tasks and datasets. Chat model support multi-turn dialogs.

Key Contributions from the paper include
[1] Introduces Qwen-Audio, a versatile multi-task audio-language model, alongside its extension Qwen-Audio-Chat for multi-turn dialogues, both of which are open-source to benefit the audio-text multimodal community.
[2] Development a multi-task training framework to handle textual label variations across datasets, allowing for knowledge sharing and reducing interference, with Qwen-Audio excelling in over 30 tasks.
[3] Demonstrates the importance of the SRWT task for audio-language pre-training, showing improvements in grounding tasks, question answering, and ASR performance.
[4] Qwen-Audio outperforms similar models on benchmark tasks without task-specific fine-tuning, setting new records on Aishell1, cochlscene, ClothoAQA, and VocalSound datasets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls