BAGEN

Are LLM agents budget-aware?

BAGEN studies whether agents can estimate token, time, money, and storage costs mid-completing a task.

We provide rollout logging, budget-estimation benchmarks, experiment trajectories, and SFT/RL for training budget-aware agents.

BAGEN evaluates whether agents can estimate budget cost from partial rollouts, on token and multi-resource settings.

News

2026.05.27. We are excited to release BAGEN!

About

BAGEN targets progressive budget estimation in long-horizon agent rollouts. At each prefix of an interaction, a model is asked to estimate whether the agent can still finish within the remaining resource budget and, when feasible, how much budget is needed.

The current public code focuses on four settings:

Sokoban: token-budget estimation over interactive puzzle rollouts.
Search-R1: token-budget estimation for search-agent trajectories.
SWE-bench-style coding: token-budget estimation over coding-agent logs.
Warehouse: multi-resource estimation over time, storage occupancy, and cumulative cost.

BAGEN builds on the RAGEN/verl codebase. The Python package directory is still named ragen to preserve import paths, Hydra configs, wrappers, and training code compatibility.

Dataset

The public BAGEN dataset contains the artifacts used to build and evaluate the budget-estimation benchmark. It is intended for reproducing the reported offline evaluation results, inspecting agent rollouts, and preparing SFT/GRPO training data for budget-aware agents.

The hosted dataset is available at:

Hugging Face: https://huggingface.co/datasets/MLL-Lab/BAGEN

The dataset is organized into two main directories:

origin/: original rollout artifacts from Sokoban, Search-R1, SWE-bench-style coding, and anonymized Warehouse-style tasks. These files are the source trajectories, dialogues, and logs used to construct prefix budget-estimation prompts.
estimation/: derived offline budget-estimation files, including prompt/target pairs, evaluator outputs, model predictions, and aggregate records used for benchmark scoring or downstream budget-RL data preparation.

On the project machine, the current staging copy is:

/u/ylin30/database/origin
/u/ylin30/database/estimation

The Hugging Face repository also includes manifest.jsonl, a file index with paths, sizes, and direct download URLs for the uploaded artifacts. Local environment/training datasets used by the codebase live under data/, but large benchmark artifacts should remain outside Git and be downloaded from the dataset repository when needed.

Method

The benchmark is organized as a two-pass pipeline:

Original rollout collection. Run a task model in an environment and save both rollout artifacts and dialogue JSON logs.
Offline budget estimation. Replay rollout prefixes and ask an evaluator model to output a remaining-budget interval or impossible.
Budget-RL training. Convert estimation data into SFT/GRPO datasets and train a budget estimator with local-model rollout support.

Benchmark Summary

Budget-estimation results across external and internal benchmarks.

Getting Started

git clone --recurse-submodules <repo-url>
cd BAGEN
conda create -n bagen python=3.12 -y
conda activate bagen
bash scripts/setup_bagen.sh
export PYTHONPATH="$PWD:$PWD/verl"

For Search-R1 retrieval experiments, download or build the search index:

python scripts/download_search_index.py

API Keys

API-based evaluation uses the provider selected by MODEL_NAME. Export only the key required for the model you run:

export OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export OPENROUTER_API_KEY=...
export GEMINI_API_KEY=...
export TOGETHER_API_KEY=...
export DEEPSEEK_API_KEY=...

Set DRY_RUN=1 to build prompts and validate inputs without calling an API.

Run Budget Estimation

Sokoban

INPUT_JSON="$PWD/results/estimation/sokoban-origin-gpt5.2-instant-128-main/sokoban_api_eval_estimation_eval_estimation_dialogues.json" \
MODEL_NAME=qwen/qwen3-235b-a22b-2507 \
MAX_CONTEXT_WINDOW_TOKENS=2500 \
bash scripts/evaluation-scripts/eval/sokoban.sh

Search-R1

INPUT_JSON="$PWD/results/estimation/searchr1-origin-gpt5.2-instant-128-main/search_r1_api_eval_estimation_eval_estimation_dialogues.json" \
MODEL_NAME=qwen/qwen3-235b-a22b-2507 \
MAX_CONTEXT_WINDOW_TOKENS=3500 \
bash scripts/evaluation-scripts/eval/searchr1.sh

SWE-bench-style coding

INPUT_SOURCE=/path/to/swebench-origin-rollouts \
MODEL_NAME=Claude-Opus-4.7-low-thinking \
bash scripts/evaluation-scripts/eval/swebench.sh

Warehouse

INPUT_SOURCE=/path/to/warehouse_rollouts.json \
MODEL_NAME=qwen/qwen3-235b-a22b-2507 \
BUDGET_PRESET=half-reachable \
bash scripts/evaluation-scripts/eval/warehouse.sh

Each eval script writes:

OUTPUT_JSON: predictions, ground truth, API usage, and aggregate metrics
TEMP_JSON: prompt/target pairs for inspection

Smoke-test an eval path without API calls:

DRY_RUN=1 MAX_SAMPLES=5 INPUT_JSON=/path/to/dialogues.json \
bash scripts/evaluation-scripts/eval/sokoban.sh

Budget-RL Training

The SFT/GRPO pipeline for training a budget estimator lives under scripts/budget-rl.

DRY_RUN=1 bash scripts/budget-rl/run_budget_rl_pipeline.sh prepare,sft,rl

For a real run, remove DRY_RUN=1 and set the model, data, GPU, and checkpoint variables:

TASK=sokoban \
ROLLOUT_MODEL=Qwen/Qwen3-8B \
LEARNER_MODEL=Qwen/Qwen2.5-7B-Instruct \
NUM_TRAJECTORIES=128 \
NGPUS=8 \
bash scripts/budget-rl/run_budget_rl_pipeline.sh all

Public Release Notes

Do not commit local experiment outputs or private manuscript drafts. The release expects these to stay outside Git:

results/, logs/, wandb/, outputs/, model_saving/
data/, search_data/, downloaded search indices, and raw Warehouse data
local PDFs such as Budget_NeurIPS_2026*.pdf
API keys, .env files, and machine-specific absolute paths

The Warehouse data used by the paper should be released only in anonymized form. Do not add raw enterprise records to this repository.

Citation

If you find this work useful, please cite:

@misc{lin2026bagen,
  title={BAGEN: Are LLM Agents Budget-Aware?},
  author={Yuxiang Lin and Zihan Wang and Mengyang Liu and Yuxuan Shan and Longju Bai and Junyao Zhang and Xing Jin and Boshan Chen and Jinyan Su and Xingyao Wang and Jiaxin Pei and Manling Li},
  year={2026},
  note={Preprint},
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 631 Commits
_smoke		_smoke
cases		cases
config		config
external		external
gradient_analysis		gradient_analysis
patches/verl_checkpoint_resharding		patches/verl_checkpoint_resharding
public		public
ragen		ragen
scripts		scripts
tests		tests
verl @ d62da49		verl @ d62da49
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
bagen.pdf		bagen.pdf
pytest.ini		pytest.ini
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BAGEN

Are LLM agents budget-aware?

News

About

Dataset

Method

Benchmark Summary

Getting Started

API Keys

Run Budget Estimation

Budget-RL Training

Public Release Notes

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BAGEN

Are LLM agents budget-aware?

News

About

Dataset

Method

Benchmark Summary

Getting Started

API Keys

Run Budget Estimation

Budget-RL Training

Public Release Notes

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages