Skip to content

maximus-21/VoiceAgentBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

VoiceAgentBench

VoiceAgentBench is a large-scale benchmark for evaluating end-to-end speech-based agents on realistic, tool-driven tasks.

It is introduced in: “VoiceAgentBench: Are Voice Assistants Ready for Agentic Tasks?” https://arxiv.org/abs/2510.07978

Unlike ASR or intent-only benchmarks, VoiceAgentBench evaluates whether a voice system can:

  • Understand spoken requests
  • Select the correct tool(s)
  • Generate structured arguments
  • Execute multi-step workflows (sequential + parallel)
  • Handle multi-turn spoken dialogs
  • Correctly refuse unsafe requests

This repository contains the evaluation framework. The benchmark data is hosted separately.


Benchmark Data

Download the dataset from Hugging Face:

Hugging Face: https://huggingface.co/datasets/krutrim-ai-labs/VoiceAgentBench

After downloading, you should have:

  • JSON files (examples)
  • .wav audio files

Each example contains a path field pointing to the corresponding audio file. Paths can be:

  • absolute, or
  • relative to your current working directory

Benchmark Subsets

Subset Description
single_tool Single tool call with argument filling
single_tool_retrieval Select correct tool from a list + fill arguments
parallel_tool Multiple independent tool calls
seqdep_tool Sequential dependent tool calls
multi_turn Multi-turn dialogs before tool call
safety Unsafe/adversarial queries requiring refusal

Data Format

Each example may contain:

  • id: example ID

  • query / user_request: text form of spoken request

  • functions: tool schemas (or tool-name list for safety)

  • expected_tool_call: reference tool invocation(s)

  • path: relative path to .wav file

  • duration: audio length in seconds

  • instruction: system prompt template

  • chat_history (multi_turn only): list of dialog turns

    • user turns include path and duration

Scoring

Evaluation is performed using an LLM-as-a-judge to assess:

  • Parameter filling correctness (whether tool arguments match the reference)
  • Refusal behavior for unsafe or adversarial spoken inputs

The judge compares the model’s predicted tool call(s) with the reference expected_tool_call, allowing for minor formatting differences while checking semantic correctness.

To enable evaluation, set your OpenAI API key:

export OPENAI_API_KEY=your_key_here

Repository Structure

voice_agent_bench/
  ├── inference.py   # run models on dataset
  └── evaluate.py    # score model outputs

This repo does not contain the dataset itself.


🚀 Quickstart

1. Run inference

python voice_agent_bench/inference.py \
  --data /path/to/single_tool_english.json \
  --evaluator single \
  --model qwen_omni \
  --device cuda

This produces:

single_tool_english_qwen_omni_responses.json

Note: run commands from the repo root so voice_agent_bench/ is on the Python path.


2. Run evaluation

python voice_agent_bench/evaluate.py \
  --src_file /path/to/single_tool_english_qwen_omni_responses.json \
  --evaluator single

This writes:

*_results.json

Supported Evaluators

Use with voice_agent_bench/evaluate.py --evaluator:

  • single → single tool call (also for single_tool_retrieval)
  • multiple → multiple independent calls (parallel_tool)
  • dependent → chained tool calls (seqdep_tool)
  • multiturn → multi-turn dialog
  • safety → refusal & safety behavior

Supported Models

Use with voice_agent_bench/inference.py --model:

  • qwen_omni
  • kimi_audio
  • whisper_gemma
  • whisper_qwen
  • whisper_llama

Add a New Model

To integrate a new model into the evaluation framework:

  1. Create a new class in voice_agent_bench/models/ that subclasses VoiceAssistant.
  2. Implement:
    • process_input(self, input) to build the model-ready inputs from one dataset item (e.g., format the instruction prompt, attach tool schemas, load/transcribe audio).
    • generate_response(self, inputs, evaluator) to run the model and return (generated_text, parsed_response).
  3. Use the shared parser so output matches the expected tool-call format:
    • Call self.parse_response(generated_text, evaluator) to get parsed_response.
  4. Register the model in voice_agent_bench/models/__init__.py under model_mapping.

Minimal skeleton:

from .base import VoiceAssistant

class MyModel(VoiceAssistant):
    def process_input(self, input):
        # build model inputs
        return inputs

    def generate_response(self, inputs, evaluator):
        # run the model and decode to plain text
        generated_text = "..."
        parsed_response = self.parse_response(generated_text, evaluator)
        return generated_text, parsed_response

Note: self.parse_response(...) routes to voice_agent_bench/utils/parser.py, which is shared across all models for consistent parsing.

Checklist:

  • file in voice_agent_bench/models/
  • class name exported in models/__init__.py
  • --model name matches model_mapping key
  • generate_response() returns (generated_text, parsed_response)

License

This code repository and the model weights are licensed under the Krutrim Community License Agreement Version 1.0


Citation

If you use VoiceAgentBench, please cite:

@article{jain2025voiceagentbench,
        title={VoiceAgentBench: Are Voice Assistants ready for agentic tasks?}, 
        author={Dhruv Jain and Harshit Shukla and Gautam Rajeev and Ashish Kulkarni and Chandra Khatri and Shubham Agarwal},
        year={2025},
        eprint={2510.07978},
        archivePrefix={arXiv},
        primaryClass={cs.AI},
        url={https://arxiv.org/abs/2510.07978}, 
}

About

Code Repository for: VoiceAgentBench: Are Voice Assistants Ready for Agentic Tasks?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%