CostBench

This is the official repository for paper "CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents".

🎯 Project Overview

CostBench is a comprehensive benchmark for evaluating multi-turn cost-optimal planning and adaptation capabilities of large language models (LLMs) in tool-using scenarios.

The benchmark systematically assesses how LLM agents navigate complex tool-calling environments by testing their ability to:

📋 Cost-Optimal Planning: Plan cost-optimal multi-step tool invocation sequences in static environments
🔄 Dynamic Adaptation: Dynamically adapt their strategies when tool costs, availability, or preferences change during execution in dynamic environments

✨ Core Features

Hierarchical Tool System: Supports atomic and composite tools, each with clear input/output types and costs
Flexible Cost Assignment: Supports configurable cost ranges for atomic tools and composite tools with component-based cost calculation plus Gaussian noise, enabling customizable cost distributions for evaluation scenarios
Dynamic Blocking: Supports multiple blocking modes (cost changes, preference changes, tool disabling, etc.) to test model adaptation capabilities
Adjustable Difficulties: Supports different levels of task sequence to control task complexity
Reproducible Random System: Features a seed-controlled pseudo-random system that ensures reproducibility across runs while preventing data leakage through deterministic randomization

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/JiayuJeff/CostBench.git
cd CostBench

# Install the package in editable mode
pip install -e .

# Or install dependencies only
pip install -r requirements.txt

Search Database Setup

python env/domains/travel/generate_search_space.py \
    --generate_test \
    --test_config_path env/domains/travel/test_travel_config.yaml \
    --num-times 5000 \
    --num-time-combinations 5000 \
    --seed 42

This script will generate the search database for the search tools in our environment. Note: The seed value 42 is the random seed used in our experiments. Users can choose other random seeds as needed.

Environment Configuration

Local Model Deployment

Replace <your_base_url> in the model.endpoints field of env/config/travel_config.yaml with your own port.
API Calls

Replace <your_api_key_env> and <your_base_url> in the endpoints.base_url field of env/config/travel_config.yaml with your own API key environment variable name and port.

Then, configure your API key in env/.env:

# Example configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1

Check and modify the configuration in env/config/travel_config.yaml (if needed).

Running Examples

Main Results

Basic run command (Table 4 results):

python env/run.py \
    --refinement_level 2 \
    --ban_longest_tool \
    --model_name gpt-5 \
    --num_threads 10 \
    --output_dir outputs/

Run with dynamic blockings (could change cost_change to other types):

python env/run.py \
    --refinement_level 2 \
    --ban_longest_tool \
    --model_name gpt-5 \
    --num_threads 10 \
    --use_blocker \
    --block_num 1 \
    --block_mode cost_change \
    --output_dir outputs/

📖 Advanced Usage

For advanced configuration, please modify the configuration file to customize tool parameters, blocking behavior, model endpoints, and other settings.

Tool-Related Parameters

--tool_creation_seed: Random seed for tool generation, controlling the tool creation seed and cost change batches for each query
--refinement_level: Tool refinement level, controlling task complexity (defaults to maximum depth). For advanced usage, `task_sequence equals = refinement_level + 3'.
--max_tool_steps: Maximum number of tool-calling steps
--min_atomic_cost / --max_atomic_cost: Cost range for atomic tools
--noise_std: Noise scaling factor for composite tools
--ban_longest_tool: Whether to disable tools that complete the task in one step

Blocking-Related Parameters

--use_blocker: Enable dynamic blocking functionality
--block_mode: Blocking mode (preference_change, cost_change, steplen_change, ban_tool)
--block_num: Number of dynamic blocking events per query

Query-Related Parameters

--query_path: Path to query file (JSON format)
--start_index / --end_index: Index range of queries to process (-1 means process all remaining queries)

Model-Related Parameters

--model_name: Model name (must be defined in configuration file)
--temperature: Sampling temperature
--max_tokens: Maximum number of generated tokens

Runtime Parameters

--num_threads: Number of concurrent threads
--output_dir: Output directory for results
--require_goal_state: Still in construction. Unexpected behavior may happen if set to true.

Simulation-Related Parameters

--use_stimulation: Enable random strategy simulation
--stimulation_num: Number of simulation runs per query
--greedy: Use greedy selection strategy in simulation. Would use random policy if set to False.

You can specify a custom configuration file path through the COSTBENCH_TRAVEL_CONFIG environment variable.

🔧 Extending to New Domains or Larger Task Sequences

Still in construction.

🤝 Contributing

Contributions are welcome! Please feel free to submit Issues and Pull Requests.

📚 Citing this work

@article{liu2025costbench,
  title={CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents},
  author={Liu, Jiayu and Qian, Cheng and Su, Zhaochen and Zong, Qing and Huang, Shijue and He, Bingxiang and Fung, Yi R},
  journal={arXiv preprint arXiv:2511.02734},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
env		env
figures		figures
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CostBench

🎯 Project Overview

✨ Core Features

🚀 Quick Start

Installation

Search Database Setup

Environment Configuration

Running Examples

Main Results

📖 Advanced Usage

Tool-Related Parameters

Blocking-Related Parameters

Query-Related Parameters

Model-Related Parameters

Runtime Parameters

Simulation-Related Parameters

🔧 Extending to New Domains or Larger Task Sequences

🤝 Contributing

📚 Citing this work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CostBench

🎯 Project Overview

✨ Core Features

🚀 Quick Start

Installation

Search Database Setup

Environment Configuration

Running Examples

Main Results

📖 Advanced Usage

Tool-Related Parameters

Blocking-Related Parameters

Query-Related Parameters

Model-Related Parameters

Runtime Parameters

Simulation-Related Parameters

🔧 Extending to New Domains or Larger Task Sequences

🤝 Contributing

📚 Citing this work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages