Skip to content

microsoft/AgentRx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

AgentRx 🩺

Diagnosing AI Agent Failures from Execution Trajectories

[Paper] [Dataset]

AI agents often fail in ways that are difficult to localize β€” executions are probabilistic, long-horizon, multi-agent, and mediated by noisy tool outputs. AgentRx is an automated, domain-agnostic diagnostic framework that pinpoints the critical failure step in a failed agent trajectory. It synthesizes constraints (invariants), evaluates them step-by-step, and produces an auditable validation log of constraint violations with associated evidence. An LLM-based judge uses this log to localize the critical step and classify the failure into a grounded 10-category taxonomy.

AgentRx improves step localization and failure attribution over existing baselines across three domains: structured API workflows (Tau-bench), incident management (Flash), and open-ended web/file tasks (Magentic-One).

Raw logs ──▢ Trajectory IR ──▢ Invariants ──▢ Checker ──▢ Judge ──▢ Reports

Quick Start

pip install -r requirements.txt

# Run the full pipeline end-to-end
python run.py trajectory.json

# Specify domain explicitly
python run.py trajectory.json --domain tau

All outputs are saved to runs/<run_name>/.


Step-by-Step Usage

You can run each stage individually and inspect the results between stages:

# 1. Normalize raw logs into Trajectory IR
python run.py trajectory.json --stage ir --run-name my_run

# 2. Generate static invariants
python run.py trajectory.json --stage static --run-dir runs/my_run

# 3. Generate dynamic (per-step) invariants
python run.py trajectory.json --stage dynamic --run-dir runs/my_run

# 4. Check all invariants against the trajectory
python run.py trajectory.json --stage check --run-dir runs/my_run

# 5. Run LLM judge for root-cause classification
python run.py trajectory.json --stage judge --run-dir runs/my_run

# 6. Generate report plots
python run.py trajectory.json --stage report --run-dir runs/my_run

Pipeline Stages

# Stage Output
1 IR β€” Normalize raw logs into canonical Trajectory IR trajectory_ir.json
2 Static β€” Generate policy/tool/structure invariants static_invariants.json
3 Dynamic β€” Generate per-step context-aware invariants dynamic_invariants/
4 Check β€” Evaluate invariants, record violations checker_results/
5 Judge β€” LLM classifies root-cause failure (10-category taxonomy) judge_output/
6 Report β€” Failure frequency plots plots/

Directory Structure

agentverify/
β”œβ”€β”€ run.py                       # CLI entry point
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ir/                      # Trajectory IR normalization
β”‚   β”œβ”€β”€ invariants/              # Invariant generation & checking
β”‚   β”œβ”€β”€ judge/                   # LLM-as-a-Judge evaluation
β”‚   β”œβ”€β”€ llm_clients/             # Azure OpenAI & TRAPI clients
β”‚   β”œβ”€β”€ pipeline/                # Config (globals.py), utilities
β”‚   └── reports/                 # Analysis & visualization
β”œβ”€β”€ data/                        # Domain policies, tool schemas, ground truth
β”œβ”€β”€ trajectories/                # Sample trajectories (tau, magentic, test)
└── runs/                        # Pipeline outputs (one folder per run)

Supported Domains

Domain Flag Description
tau --domain tau Tau-bench retail customer service
magentic --domain magentic Magentic-One multi-agent
flash --domain flash Flash/orchestrator incident traces
(auto) (default) Auto-detected; unknown formats use LLM-based IR fallback

Configuration

LLM settings are loaded from environment variables (via .env or shell):

# TRAPI (default endpoint)
AGENT_VERIFY_TRAPI_INSTANCE=          # e.g., "my-instance/my-pool"
AGENT_VERIFY_TRAPI_DEPLOYMENT_NAME=   # e.g., "my-deployment-name"

# Azure OpenAI (use --endpoint azure)
AGENT_VERIFY_ENDPOINT=                # e.g., "https://my-resource.openai.azure.com/"
AGENT_VERIFY_DEPLOYMENT=              # e.g., "gpt-5"

Both endpoints use Azure AD token-based auth (az login or Managed Identity).


Failure Taxonomy

# Category Description
1 Instruction/Plan Adherence Failure Skips steps or adds unnecessary actions
2 Invention of New Information Fabricates or omits ungrounded facts
3 Invalid Invocation Malformed tool call (wrong args/types/schema)
4 Misinterpretation of Tool Output Incorrect reasoning about tool results
5 Intent-Plan Misalignment Pursues wrong objective
6 Underspecified User Intent Missing information to proceed
7 Intent Not Supported Action can't be performed with available tools
8 Guardrails Triggered Blocked by safety/RAI/access policies
9 System Failure Infra errors (timeouts, unreachable endpoints)
10 Inconclusive Insufficient evidence to classify

Running Individual Modules

Each module can also be run standalone from src/:

Static Invariant Generator β€” generate policy/tool invariants:

python src/invariants/static_invariant_generator.py --input-path trajectory.json --domain tau --endpoint trapi

Dynamic Invariant Generator β€” generate per-step context-aware invariants:

python src/invariants/dynamic_invariant_generator.py --input-path trajectory.json --domain tau --mode stepbystep

Checker β€” evaluate invariants against a trajectory:

python src/invariants/checker.py --input-path trajectory.json --static-invariants static_inv.json --dynamic-invariants-dir dyn_inv/

Judge β€” run LLM-as-a-Judge classification:

python src/judge/judge.py --domain tau --log_file trajectory.json --endpoint trapi --mode combined

Third-Party Code

This project uses the following third-party open source packages (installed via requirements.txt):

  • openai β€” OpenAI Python client (MIT License)
  • azure-identity / azure-core β€” Azure SDK authentication (MIT License)
  • matplotlib β€” Plotting and visualization (PSF-based License)
  • tiktoken β€” Token counting (MIT License)
  • httpx β€” HTTP client (BSD License)

See requirements.txt for the full list of dependencies.


Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

Please see CONTRIBUTING.md for detailed guidelines.


Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.


License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT license.


Citation

If you use AgentRx, please cite:

@article{barke2026agentrx,
  title={AgentRx: Diagnosing AI Agent Failures from Execution Trajectories},
  author={Barke, Shraddha and Goyal, Arnav and Khare, Alind and Singh, Avaljot and Nath, Suman and Bansal, Chetan},
  journal={arXiv preprint arXiv:2602.02475},
  year={2026}
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages