R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

Introduction

R1-AQA is a audio question answering (AQA) model based on Qwen2-Audio-7B-Instruct, optimized through reinforcement learning (RL) using the group relative policy optimization (GRPO) algorithm. This implementation has achieved state-of-the-art performance on MMAU Test-mini benchmark with only 38k post-training samples.

Our main findings are as follows:

The GRPO algorithm can be directly and effectively applied to the audio modality, even to Qwen2-Audio-7B-Instruct with only 8.2B parameters.
With only 38k post-training samples, reinforcement learning outperforms supervised fine-tuning, indicating that RL-based approaches can be effective without large datasets.
The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage deep thinking or step-by-step remains an open question for further research.
Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.

Table: Accuracies (%) on MMAU Test-mini benchmark

Model	Method	Sound	Music	Speech	Average
\	Human*	86.31	78.22	82.17	82.23
Gemini Pro 2.0 Flash	Direct Inference*	56.46	58.68	51.65	55.60
Audio Flamingo 2	Direct Inference*	61.56	73.95	30.93	55.48
GPT4o + Strong Cap.	Direct Inference*	57.35	49.70	64.86	57.30
Llama-3-8B-Instruct + Strong Cap.	Direct Inference*	50.75	48.93	55.25	52.10
Gemini Pro v1.5	Direct Inference*	56.75	49.40	58.55	54.90
Qwen2-Audio-7B-Instruct	Direct Inference*	54.95	50.98	42.04	49.20
GPT4o + Weak Cap.	Direct Inference*	39.33	41.90	58.25	45.70
Llama-3-8B-Instruct + Weak Cap.	Direct Inference*	34.23	38.02	54.05	42.10
SALMONN	Direct Inference*	41.00	34.80	25.50	33.70
Qwen2-Audio-7B-Instruct	CoTA [1]	60.06	64.30	60.70	61.71
Qwen2-Audio-7B-Instruct	Zero-Shot-CoT [2]	61.86	56.29	55.26	57.80
Qwen2-Audio-7B-Instruct	GRPO (Ours)	69.37	66.77	57.36	64.50

Notes:

* The data are sourced from the MMAU official website: https://sakshi113.github.io/mmau_homepage/
[1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
[2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).

Huggingface:
🤗 R1-AQA Models: mispeech/r1-aqa

arXiv:
📝 Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

R1-AQA Team:
Gang Li* · Jizhong Liu* · Heinrich Dinkel · Yadong Niu · Junbo Zhang

* Euqual contribution.

Updates

2025-03-17: We release the R1-AQA repo.

Training

Data Preparation

We use the AVQA training subset (train_qa.josn), and convert the data to the R1-AQA format, where each line in the text file represents a JSON object with specific keys.

{
    # The data presented below originate from the original AVQA dataset.
    "id": 183,
    "video_name": "-HG3Omg_89c_000030",
    "video_id": 341,
    "question_text": "What happened in the video?",
    "multi_choice": [  
        "motorboat",  
        "Yacht consignment",  
        "Sailboat set sail",  
        "Consignment car"  
    ],
    "answer": 1,
    "question_relation": "View",
    "question_type": "Happening", 
    # We add the following data.
    "dataset_name": "AVQA",
    "audio_path": "Path to wav dir/-HG3Omg_89c_30.wav"
}

GRPO

sh run_grpo.sh

Wandb

NOTE:

Replace the DATA_FILE variable in the run_grpo.sh with your dataset path.
If you already have the Qwen2-Audio-7B-Instruct model, please modify the MODEL_NP variable in run_grpo.sh to your local model path.

Testing

MMAU-mini

Evaluate the MMAU test-mini dataset, please follow these steps:

Download Data
- To test the MMAU Test-mini dataset requires the following files from the MMAU repository: mmau-test-mini.json, evaluation.py, and test-mini-audios.tar.gz. The method for obtaining data is as follows:

mkdir -p data && cd data

git clone https://github.com/Sakshi113/MMAU.git

cd data/MMAU

#TODO you should download test-mini-audios.tar.gz to here
***download test-mini-audios.tar.gz to here***

# Uncompress wav files
tar -xzvf test-mini-audios.tar.gz

cd ../../

Format Data

# Prepare the data format file we need
python src/utils/prepare_mmau.py \
    --input_file data/MMAU/mmau-test-mini.json \
    --wav_dir data/MMAU/test-mini-audios \
    --out_file data/MMAU/mmau-mini.data

Evaluation

# Testing MMAU-mini with in steps: [100, 200, 300, 400, 500].
# You can modify the script to test other steps or change other parameters.
sh test_mmau.sh

Acknowledgement

We have referred to the implementation of R1-V for the GRPO-based training.

We sincerely thank AVQA and MMAU for providing the datasets.

Citation

@misc{li2025reinforcementlearningoutperformssupervised,
      title={Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering}, 
      author={Gang Li and Jizhong Liu and Heinrich Dinkel and Yadong Niu and Junbo Zhang and Jian Luan},
      year={2025},
      eprint={2503.11197},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.1119; https://github.com/xiaomi-research/r1-aqa}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
conf		conf
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_grpo.sh		run_grpo.sh
test_mmau.sh		test_mmau.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

Introduction

Table: Accuracies (%) on MMAU Test-mini benchmark

Notes:

Updates

Training

Data Preparation

GRPO

Wandb

NOTE:

Testing

MMAU-mini

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

keshan459/r1-aqa

Folders and files

Latest commit

History

Repository files navigation

R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

Introduction

Table: Accuracies (%) on MMAU Test-mini benchmark

Notes:

Updates

Training

Data Preparation

GRPO

Wandb

NOTE:

Testing

MMAU-mini

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages