VoiceAgentBench is a large-scale benchmark for evaluating end-to-end speech-based agents on realistic, tool-driven tasks.
It is introduced in: “VoiceAgentBench: Are Voice Assistants Ready for Agentic Tasks?” https://arxiv.org/abs/2510.07978
Unlike ASR or intent-only benchmarks, VoiceAgentBench evaluates whether a voice system can:
- Understand spoken requests
- Select the correct tool(s)
- Generate structured arguments
- Execute multi-step workflows (sequential + parallel)
- Handle multi-turn spoken dialogs
- Correctly refuse unsafe requests
This repository contains the evaluation framework. The benchmark data is hosted separately.
Download the dataset from Hugging Face:
Hugging Face: https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench
After downloading, you should have:
- JSON files (examples)
.wavaudio files
Each example contains a path field pointing to the corresponding audio file.
Paths can be:
- absolute, or
- relative to your current working directory
| Subset | Description |
|---|---|
single_tool |
Single tool call with argument filling |
single_tool_retrieval |
Select correct tool from a list + fill arguments |
parallel_tool |
Multiple independent tool calls |
seqdep_tool |
Sequential dependent tool calls |
multi_turn |
Multi-turn dialogs before tool call |
safety |
Unsafe/adversarial queries requiring refusal |
Each example may contain:
-
id: example ID -
query/user_request: text form of spoken request -
functions: tool schemas (or tool-name list for safety) -
expected_tool_call: reference tool invocation(s) -
path: relative path to.wavfile -
duration: audio length in seconds -
instruction: system prompt template -
chat_history(multi_turn only): list of dialog turns- user turns include
pathandduration
- user turns include
Evaluation is performed using an LLM-as-a-judge to assess:
- Parameter filling correctness (whether tool arguments match the reference)
- Refusal behavior for unsafe or adversarial spoken inputs
The judge compares the model’s predicted tool call(s) with the reference expected_tool_call, allowing for minor formatting differences while checking semantic correctness.
To enable evaluation, set your OpenAI API key:
export OPENAI_API_KEY=your_key_herevoice_agent_bench/
├── inference.py # run models on dataset
└── evaluate.py # score model outputs
This repo does not contain the dataset itself.
python voice_agent_bench/inference.py \
--data /path/to/single_tool_english.json \
--evaluator single \
--model qwen_omni \
--device cudaThis produces:
single_tool_english_qwen_omni_responses.json
Note: run commands from the repo root so voice_agent_bench/ is on the Python path.
python voice_agent_bench/evaluate.py \
--src_file /path/to/single_tool_english_qwen_omni_responses.json \
--evaluator singleThis writes:
*_results.json
Use with voice_agent_bench/evaluate.py --evaluator:
single→ single tool call (also forsingle_tool_retrieval)multiple→ multiple independent calls (parallel_tool)dependent→ chained tool calls (seqdep_tool)multiturn→ multi-turn dialogsafety→ refusal & safety behavior
Use with voice_agent_bench/inference.py --model:
qwen_omnikimi_audiowhisper_gemmawhisper_qwenwhisper_llama
To integrate a new model into the evaluation framework:
- Create a new class in
voice_agent_bench/models/that subclassesVoiceAssistant. - Implement:
process_input(self, input)to build the model-ready inputs from one dataset item (e.g., format the instruction prompt, attach tool schemas, load/transcribe audio).generate_response(self, inputs, evaluator)to run the model and return(generated_text, parsed_response).
- Use the shared parser so output matches the expected tool-call format:
- Call
self.parse_response(generated_text, evaluator)to getparsed_response.
- Call
- Register the model in
voice_agent_bench/models/__init__.pyundermodel_mapping.
Minimal skeleton:
from .base import VoiceAssistant
class MyModel(VoiceAssistant):
def process_input(self, input):
# build model inputs
return inputs
def generate_response(self, inputs, evaluator):
# run the model and decode to plain text
generated_text = "..."
parsed_response = self.parse_response(generated_text, evaluator)
return generated_text, parsed_responseNote: self.parse_response(...) routes to voice_agent_bench/utils/parser.py, which is shared
across all models for consistent parsing.
Checklist:
- file in
voice_agent_bench/models/ - class name exported in
models/__init__.py --modelname matchesmodel_mappingkeygenerate_response()returns(generated_text, parsed_response)
This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0
If you use VoiceAgentBench, please cite:
@article{jain2025voiceagentbench,
title={VoiceAgentBench: Are Voice Assistants ready for agentic tasks?},
author={Dhruv Jain and Harshit Shukla and Gautam Rajeev and Ashish Kulkarni and Chandra Khatri and Shubham Agarwal},
year={2025},
eprint={2510.07978},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2510.07978},
}