GitHub - prdeepakbabu/Qwen-AudioFM: Experimenting capabilities of Audio FM series from Alibaba on 30+ audio tasks using textual prompts in the form of instructions

https://medium.com/@prdeepak.babu/the-rise-of-multimodal-large-speech-language-models-4fc5ea34d04f

Qwen-Audio is a foundational audio model that is capable of handling diverse audio types(human speech, natural sound, music and songs) and audio tasks (ASR, acoustic scence classification,etc.). The Qwen-Audio model is trained on over 30+ diverse audio tasks like audio classification, speech recognition and emotion recognition). They demonstrate the Qwen-Audio model beats SoTA models on varied tasks indicating good performance in zero-shot setting. Author further demonstrate timestamp prediction shows improvement in grounding and grounding based QA tasks beyond speech signals, as well as ASR. Qwen-Audio is built using whisper-large-v2 as the audio encoder and using Qwen-7B decoder-only LLM model as the core component. The audio encoder is based on whisper-large-v2 composed of 640M parameters. Authors also modify the prompt tags to hierarchially organize tasks and datasets, to avoid loss in gains from interferance. The authors also train a Qwen-Audio-Chat model by fine-tuning Qwen-Audio model using supervised instruction fine-tuning on 30+ tasks and datasets. Chat model support multi-turn dialogs.

Key Contributions from the paper include
[1] Introduces Qwen-Audio, a versatile multi-task audio-language model, alongside its extension Qwen-Audio-Chat for multi-turn dialogues, both of which are open-source to benefit the audio-text multimodal community.
[2] Development a multi-task training framework to handle textual label variations across datasets, allowing for knowledge sharing and reducing interference, with Qwen-Audio excelling in over 30 tasks.
[3] Demonstrates the importance of the SRWT task for audio-language pre-training, showing improvements in grounding tasks, question answering, and ASR performance.
[4] Qwen-Audio outperforms similar models on benchmark tasks without task-specific fine-tuning, setting new records on Aishell1, cochlscene, ClothoAQA, and VocalSound datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt
sample_dialog.txt		sample_dialog.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

prdeepakbabu/Qwen-AudioFM

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages