VoiceAgentBench

VoiceAgentBench is a large-scale benchmark for evaluating end-to-end speech-based agents on realistic, tool-driven tasks.

It is introduced in: “VoiceAgentBench: Are Voice Assistants Ready for Agentic Tasks?” https://arxiv.org/abs/2510.07978

Unlike ASR or intent-only benchmarks, VoiceAgentBench evaluates whether a voice system can:

Understand spoken requests
Select the correct tool(s)
Generate structured arguments
Execute multi-step workflows (sequential + parallel)
Handle multi-turn spoken dialogs
Correctly refuse unsafe requests

This repository contains the evaluation framework. The benchmark data is hosted separately.

Benchmark Data

Download the dataset from Hugging Face:

Hugging Face: https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench

After downloading, you should have:

JSON files (examples)
.wav audio files

Each example contains a path field pointing to the corresponding audio file. Paths can be:

absolute, or
relative to your current working directory

Benchmark Subsets

Subset	Description
`single_tool`	Single tool call with argument filling
`single_tool_retrieval`	Select correct tool from a list + fill arguments
`parallel_tool`	Multiple independent tool calls
`seqdep_tool`	Sequential dependent tool calls
`multi_turn`	Multi-turn dialogs before tool call
`safety`	Unsafe/adversarial queries requiring refusal

Data Format

Each example may contain:

id: example ID
query / user_request: text form of spoken request
functions: tool schemas (or tool-name list for safety)
expected_tool_call: reference tool invocation(s)
path: relative path to .wav file
duration: audio length in seconds
instruction: system prompt template
chat_history (multi_turn only): list of dialog turns
- user turns include path and duration

Scoring

Evaluation is performed using an LLM-as-a-judge to assess:

Parameter filling correctness (whether tool arguments match the reference)
Refusal behavior for unsafe or adversarial spoken inputs

The judge compares the model’s predicted tool call(s) with the reference expected_tool_call, allowing for minor formatting differences while checking semantic correctness.

To enable evaluation, set your OpenAI API key:

export OPENAI_API_KEY=your_key_here

Repository Structure

voice_agent_bench/
  ├── inference.py   # run models on dataset
  └── evaluate.py    # score model outputs

This repo does not contain the dataset itself.

🚀 Quickstart

1. Run inference

python voice_agent_bench/inference.py \
  --data /path/to/single_tool_english.json \
  --evaluator single \
  --model qwen_omni \
  --device cuda

This produces:

single_tool_english_qwen_omni_responses.json

Note: run commands from the repo root so voice_agent_bench/ is on the Python path.

2. Run evaluation

python voice_agent_bench/evaluate.py \
  --src_file /path/to/single_tool_english_qwen_omni_responses.json \
  --evaluator single

This writes:

*_results.json

Supported Evaluators

Use with voice_agent_bench/evaluate.py --evaluator:

single → single tool call (also for single_tool_retrieval)
multiple → multiple independent calls (parallel_tool)
dependent → chained tool calls (seqdep_tool)
multiturn → multi-turn dialog
safety → refusal & safety behavior

Supported Models

Use with voice_agent_bench/inference.py --model:

qwen_omni
kimi_audio
whisper_gemma
whisper_qwen
whisper_llama

Add a New Model

To integrate a new model into the evaluation framework:

Create a new class in voice_agent_bench/models/ that subclasses VoiceAssistant.
Implement:
- process_input(self, input) to build the model-ready inputs from one dataset item (e.g., format the instruction prompt, attach tool schemas, load/transcribe audio).
- generate_response(self, inputs, evaluator) to run the model and return (generated_text, parsed_response).
Use the shared parser so output matches the expected tool-call format:
- Call self.parse_response(generated_text, evaluator) to get parsed_response.
Register the model in voice_agent_bench/models/__init__.py under model_mapping.

Minimal skeleton:

from .base import VoiceAssistant

class MyModel(VoiceAssistant):
    def process_input(self, input):
        # build model inputs
        return inputs

    def generate_response(self, inputs, evaluator):
        # run the model and decode to plain text
        generated_text = "..."
        parsed_response = self.parse_response(generated_text, evaluator)
        return generated_text, parsed_response

Note: self.parse_response(...) routes to voice_agent_bench/utils/parser.py, which is shared across all models for consistent parsing.

Checklist:

file in voice_agent_bench/models/
class name exported in models/__init__.py
--model name matches model_mapping key
generate_response() returns (generated_text, parsed_response)

License

This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0

Citation

If you use VoiceAgentBench, please cite:

@article{jain2025voiceagentbench,
        title={VoiceAgentBench: Are Voice Assistants ready for agentic tasks?}, 
        author={Dhruv Jain and Harshit Shukla and Gautam Rajeev and Ashish Kulkarni and Chandra Khatri and Shubham Agarwal},
        year={2025},
        eprint={2510.07978},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2510.07978}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
voice_agent_bench		voice_agent_bench
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceAgentBench

Benchmark Data

Benchmark Subsets

Data Format

Scoring

Repository Structure

🚀 Quickstart

1. Run inference

2. Run evaluation

Supported Evaluators

Supported Models

Add a New Model

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoiceAgentBench

Benchmark Data

Benchmark Subsets

Data Format

Scoring

Repository Structure

🚀 Quickstart

1. Run inference

2. Run evaluation

Supported Evaluators

Supported Models

Add a New Model

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages