Skip to content

JiayuJeff/CostBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CostBench

arXiv Hugging Face Daily Papers Hugging Face Dataset

This is the official repository for paper "CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents".

🎯 Project Overview

CostBench is a comprehensive benchmark for evaluating multi-turn cost-optimal planning and adaptation capabilities of large language models (LLMs) in tool-using scenarios.

The benchmark systematically assesses how LLM agents navigate complex tool-calling environments by testing their ability to:

  • 📋 Cost-Optimal Planning: Plan cost-optimal multi-step tool invocation sequences in static environments
  • 🔄 Dynamic Adaptation: Dynamically adapt their strategies when tool costs, availability, or preferences change during execution in dynamic environments

✨ Core Features

  • Hierarchical Tool System: Supports atomic and composite tools, each with clear input/output types and costs
  • Flexible Cost Assignment: Supports configurable cost ranges for atomic tools and composite tools with component-based cost calculation plus Gaussian noise, enabling customizable cost distributions for evaluation scenarios
  • Dynamic Blocking: Supports multiple blocking modes (cost changes, preference changes, tool disabling, etc.) to test model adaptation capabilities
  • Adjustable Difficulties: Supports different levels of task sequence to control task complexity
  • Reproducible Random System: Features a seed-controlled pseudo-random system that ensures reproducibility across runs while preventing data leakage through deterministic randomization

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/JiayuJeff/CostBench.git
cd CostBench

# Install the package in editable mode
pip install -e .

# Or install dependencies only
pip install -r requirements.txt

Search Database Setup

python env/domains/travel/generate_search_space.py \
    --generate_test \
    --test_config_path env/domains/travel/test_travel_config.yaml \
    --num-times 5000 \
    --num-time-combinations 5000 \
    --seed 42

This script will generate the search database for the search tools in our environment. Note: The seed value 42 is the random seed used in our experiments. Users can choose other random seeds as needed.

Environment Configuration

  1. Local Model Deployment

    Replace <your_base_url> in the model.endpoints field of env/config/travel_config.yaml with your own port.

  2. API Calls

    Replace <your_api_key_env> and <your_base_url> in the endpoints.base_url field of env/config/travel_config.yaml with your own API key environment variable name and port.

    Then, configure your API key in env/.env:

# Example configuration
OPENAI_API_KEY=your_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
  1. Check and modify the configuration in env/config/travel_config.yaml (if needed).

Running Examples

Main Results

Basic run command (Table 4 results):

python env/run.py \
    --refinement_level 2 \
    --ban_longest_tool \
    --model_name gpt-5 \
    --num_threads 10 \
    --output_dir outputs/

Run with dynamic blockings (could change cost_change to other types):

python env/run.py \
    --refinement_level 2 \
    --ban_longest_tool \
    --model_name gpt-5 \
    --num_threads 10 \
    --use_blocker \
    --block_num 1 \
    --block_mode cost_change \
    --output_dir outputs/

📖 Advanced Usage

For advanced configuration, please modify the configuration file to customize tool parameters, blocking behavior, model endpoints, and other settings.

Tool-Related Parameters

  • --tool_creation_seed: Random seed for tool generation, controlling the tool creation seed and cost change batches for each query
  • --refinement_level: Tool refinement level, controlling task complexity (defaults to maximum depth). For advanced usage, `task_sequence equals = refinement_level + 3'.
  • --max_tool_steps: Maximum number of tool-calling steps
  • --min_atomic_cost / --max_atomic_cost: Cost range for atomic tools
  • --noise_std: Noise scaling factor for composite tools
  • --ban_longest_tool: Whether to disable tools that complete the task in one step

Blocking-Related Parameters

  • --use_blocker: Enable dynamic blocking functionality
  • --block_mode: Blocking mode (preference_change, cost_change, steplen_change, ban_tool)
  • --block_num: Number of dynamic blocking events per query

Query-Related Parameters

  • --query_path: Path to query file (JSON format)
  • --start_index / --end_index: Index range of queries to process (-1 means process all remaining queries)

Model-Related Parameters

  • --model_name: Model name (must be defined in configuration file)
  • --temperature: Sampling temperature
  • --max_tokens: Maximum number of generated tokens

Runtime Parameters

  • --num_threads: Number of concurrent threads
  • --output_dir: Output directory for results
  • --require_goal_state: Still in construction. Unexpected behavior may happen if set to true.

Simulation-Related Parameters

  • --use_stimulation: Enable random strategy simulation
  • --stimulation_num: Number of simulation runs per query
  • --greedy: Use greedy selection strategy in simulation. Would use random policy if set to False.

You can specify a custom configuration file path through the COSTBENCH_TRAVEL_CONFIG environment variable.

🔧 Extending to New Domains or Larger Task Sequences

Still in construction.

🤝 Contributing

Contributions are welcome! Please feel free to submit Issues and Pull Requests.

📚 Citing this work

@article{liu2025costbench,
  title={CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents},
  author={Liu, Jiayu and Qian, Cheng and Su, Zhaochen and Zong, Qing and Huang, Shijue and He, Bingxiang and Fung, Yi R},
  journal={arXiv preprint arXiv:2511.02734},
  year={2025}
}

About

The official code repository for the paper "CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages