GitHub - microsoft/STATE-Bench: Benchmark AI Agents on Enterprise Workflows

Main Track · Agent Learning Track

STATE-Bench evaluates AI agents on realistic, multi-step enterprise workflows across three domains: travel, customer support, and shopping assistant.

Each task gives the agent a task-local sandbox database, domain-specific tools, and a simulated user. To pass a task, the agent must do multi-step reasoning by gathering the right information with domain tools, applying the correct policy, taking actions to update the database to the right final state when needed, and following the required procedure in conversation.

Overview

STATE-Bench includes 450 challenging enterprise tasks across three domains.

Domain	Tasks	Description
Travel	150	Flight, hotel, and car rental bookings; cancellations, updates, fee and policy reasoning, cross-product trip planning
Customer Support	150	Returns, refunds, exchanges, warranty claims, cancellations, shipping issues, and order changes
Shopping Assistant	150	Product search, cart updates, applying promos, loyalty redemption, shipping options, and compatibility checks

Choose Your Benchmark Track

Start with the track that matches what you want to evaluate. Each track guide links to the setup and reference docs only when you need them.

Goal	Start here
Evaluate an agent or model directly on the provided enterprise benchmark tasks	Main Track
Evaluate agentic memory, skills, or prompt optimization	Agent Learning Track

The Main Track is the default benchmark path. The Agent Learning Track uses the same simulator, domain tools, judges, and metrics, but adds train trajectories and a retrieval hook for reusable learnings such as memories, skills, or prompt optimizations.

Sample task trajectory from the Travel domain.

Metrics

STATE-Bench reports four headline metrics:

Metric	What it measures
Task Completion pass@1	Average task completion rate across five runs per task.
Task Completion pass^5	Percentage of tasks completed successfully on all five runs.
UX Score	LLM-judged conversation quality on a 1-5 scale.
Cost Per Task	Average agent cost from user-reported token usage and pricing.

License

STATE-Bench is released under the MIT License. See LICENSE.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Disclosures

Datasets provided in this benchmark were synthetically generated using large language models. The benchmark is intended for research purposes and users should exercise caution and consider the limitations of synthetic data when interpreting results.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
.leaderboard		.leaderboard
assets		assets
datasets/train_task_trajectories		datasets/train_task_trajectories
docs		docs
state_bench		state_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
Screenshot 2026-05-28 212138.png		Screenshot 2026-05-28 212138.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Choose Your Benchmark Track

Metrics

License

Trademarks

Disclosures

About

Uh oh!

Releases 4

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Choose Your Benchmark Track

Metrics

License

Trademarks

Disclosures

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages