Awesome AgentOps

A curated guide to operating AI agents in production: observability, evaluation, tracing, guardrails, deployment, security, governance, and incident response.

AgentOps is the production operating layer for AI agents. It is broader than agent monitoring: it covers how teams ship, observe, evaluate, debug, secure, control, and improve autonomous or semi-autonomous AI systems after they leave a demo environment.

This list is intentionally focused on engineering resources for production agents. It favors tools, papers, patterns, and references that help teams answer operational questions:

What did the agent do, and why?
Did it complete the task correctly, safely, and within policy?
Can we replay, debug, and evaluate failures?
Can humans approve, interrupt, or override high-impact actions?
Can we manage cost, latency, secrets, identity, and permissions?
Can we respond to agent incidents with evidence instead of guesswork?

Scope
Conceptual Map
AgentOps vs DevOps vs MLOps
Observability and Tracing
Evaluation and Testing
Replay and Debugging
Guardrails and Runtime Controls
Security, Identity, and Access Control
Human Approval and Workflow Control
Deployment and Runtime Infrastructure
Cloud AgentOps Platforms
Cost, Latency, and Reliability
Incident Response and Governance
Multi-Agent Operations
Standards and Protocols
Research and References
Contributing

Scope

Included:

Production agent observability, tracing, evaluation, testing, and replay.
Runtime guardrails, policy checks, approvals, and control planes.
Agent security, identity, permissions, sandboxing, and secret handling.
Deployment, reliability, cost, latency, and incident response practices.
Multi-agent coordination and operational failure modes.

Not included by default:

Generic LLM chat apps without an operational angle.
Prompt collections with no evaluation or production discipline.
Broad AI tool directories that do not focus on running agents.
Vendor marketing pages without useful engineering substance.

Conceptual Map

AgentOps spans the full lifecycle of an agent system:

Design: define tasks, tools, permissions, policies, and success criteria.
Build: instrument traces, model calls, tool calls, memory, and state transitions.
Test: run unit tests, scenario tests, adversarial tests, and regression evals.
Deploy: manage environments, secrets, rollout, rate limits, and fallbacks.
Operate: monitor correctness, safety, cost, latency, drift, and incidents.
Improve: replay failures, evaluate fixes, update policies, and govern change.

AgentOps vs DevOps vs MLOps

AgentOps overlaps with DevOps and MLOps, but it is not the same discipline.

DevOps focuses on shipping and operating software systems: infrastructure, CI/CD, deployment, monitoring, reliability, and incident response.

MLOps focuses on the machine learning lifecycle: datasets, training pipelines, model registries, model deployment, drift, performance monitoring, and retraining.

AgentOps focuses on the operational behaviour of AI agents after they are given goals, tools, memory, context, permissions, workflows, and the ability to take actions.

Discipline	Primary concern	Typical operational objects	Key questions
DevOps	Software delivery and infrastructure reliability	Services, containers, APIs, databases, networks, deployments	Is the system available, scalable, secure, and deployable?
MLOps	Model and data lifecycle management	Datasets, features, models, training jobs, model endpoints, drift metrics	Is the model trained, evaluated, deployed, monitored, and updated correctly?
AgentOps	Agent behaviour in production	Agent runs, tool calls, traces, memory, plans, policies, approvals, permissions, outcomes	Did the agent act correctly, safely, within policy, and with enough evidence to debug or govern it?

AgentOps extends operational practice into areas that traditional DevOps and MLOps do not fully cover:

agent trajectories and step-by-step execution traces
tool-use monitoring and permission boundaries
memory, context, and retrieval behaviour
runtime policy checks and guardrails
human approval for high-impact actions
replay and debugging of agent failures
evaluation of task completion, behaviour, and safety
auditability of autonomous or semi-autonomous decisions

A useful shorthand:

DevOps keeps the software running.
MLOps keeps the model lifecycle controlled.
AgentOps keeps agent behaviour observable, evaluable, constrained, and governable.

Observability and Tracing

OpenTelemetry - Vendor-neutral observability framework for traces, metrics, and logs.
OpenLLMetry - OpenTelemetry-based instrumentation for LLM applications.
Langfuse - Open-source LLM engineering platform with tracing, prompt management, evals, and metrics.
Arize Phoenix - Open-source observability and evaluation for LLM applications.
LangSmith - Platform for tracing, debugging, evaluating, and monitoring LangChain and agent applications.
Weights & Biases Weave - Tracking and evaluation for LLM applications.
Helicone - Open-source observability platform for LLM usage, cost, latency, and requests.
AgentOps - Session replay, analytics, and observability for AI agents.

Evaluation and Testing

OpenAI Evals - Framework and registry for evaluating language model behaviour.
DeepEval - Evaluation framework for LLM applications with regression testing support.
Ragas - Evaluation framework for retrieval-augmented generation and LLM pipelines.
promptfoo - CLI and framework for testing prompts, models, and LLM applications.
Giskard - Testing and risk scanning for AI systems.
Inspect AI - Framework for large language model evaluations.
Braintrust - Evaluation, logging, and prompt iteration platform for AI products.

Replay and Debugging

LangSmith Tracing - Trace inspection, dataset creation, and regression workflows for agent runs.
Langfuse Tracing - Traces for LLM calls, tool calls, chains, and agent sessions.
Phoenix Tracing - OpenTelemetry-based tracing for LLM application debugging.
Weave Tracing - Tracing and interactive debugging for model and agent workflows.

Guardrails and Runtime Controls

Guardrails AI - Validation and guardrail framework for LLM inputs and outputs.
NVIDIA NeMo Guardrails - Toolkit for programmable guardrails around LLM applications.
Llama Guard - Meta's safety classification model family for policy enforcement.
Rebuff - Prompt injection detection and mitigation framework.
Lakera Guard - Runtime protection for LLM applications against prompt injection and unsafe content.
OpenAI Moderation - Content safety models and moderation patterns.

Security, Identity, and Access Control

OWASP Top 10 for LLM Applications - Security risks for LLM and agent systems.
OWASP Agentic Security Initiative - Security work focused on agentic AI systems.
NIST AI Risk Management Framework - Risk management framework for AI systems.
PyRIT - Microsoft framework for red teaming generative AI systems.
garak - LLM vulnerability scanner and red-teaming tool.
Invariant - Testing and guardrails for agent behaviour and tool use.

Operational topics to cover in production reviews:

Tool permission boundaries and least privilege.
Secret isolation and credential rotation.
User, agent, service, and tool identity.
Sandboxing for code execution, browser use, and filesystem access.
Audit logs for privileged or irreversible actions.

Human Approval and Workflow Control

Temporal - Durable execution platform for long-running workflows, retries, and human-in-the-loop steps.
Inngest - Durable functions and event-driven workflows for reliable background execution.
Hatchet - Distributed task queue and workflow engine.
HumanLayer - Human approval workflows for AI agents and tool calls.

Useful approval patterns:

Require approval for external side effects such as sending email, spending money, merging code, or changing infrastructure.
Store the proposed action, context, risk level, approver, and final decision.
Make approvals replayable and auditable, not transient chat messages.

Deployment and Runtime Infrastructure

LiteLLM - LLM gateway for model routing, budgets, retries, keys, and provider abstraction.
Portkey - AI gateway for observability, caching, routing, guardrails, and reliability.
Ray Serve - Scalable model and application serving for Python workloads.
BentoML - Framework for building and deploying AI applications.
Modal - Serverless infrastructure for AI and data workloads.
Fly.io - Application runtime useful for globally deployed agent services.

Cloud AgentOps Platforms

Major cloud platforms are beginning to expose AgentOps capabilities through managed agent runtimes, observability tools, tracing, evaluations, guardrails, identity controls, and governance features.

This section tracks cloud-native services that help teams build, deploy, monitor, evaluate, secure, and govern AI agents in production.

Microsoft Azure and Microsoft Foundry

Microsoft Foundry Agent Service - Managed platform for building, deploying, and scaling AI agents across prompt agents, workflow agents, and hosted agents.
Microsoft Foundry Control Plane - Centralised management and observability for agent inventory, agent health, and lifecycle operations.
Microsoft Foundry Playgrounds - Agent development environment with tracing and evaluation data for agent responses.
Azure AI Agent Design Patterns - Architecture guidance for multi-agent orchestration patterns.
Azure AI Agent Adoption Process - Organisational guidance for building agents consistently and securely.

Operational capabilities to track:

agent inventory and lifecycle management
managed agent hosting
tracing and evaluation
multi-agent orchestration patterns
security and governance controls
organisational adoption processes

Google Cloud and Gemini Enterprise Agent Platform

Gemini Enterprise Agent Platform - Unified platform for building, deploying, governing, and optimising enterprise-grade AI agents.
Scale your agents - Production guidance for deploying, managing, tracing, logging, monitoring, and scaling agents.
Agent Platform Runtime - Managed runtime services for deploying, managing, and scaling AI agents in production.
Agent Development Kit - Open-source framework for building and orchestrating agents.
Agent identity and IAM - Identity and access management patterns for deployed agents.

Operational capabilities to track:

managed serverless agent runtime
tracing, logging, monitoring, and alerts
IAM-based agent identity
session and memory management
secure connectivity through Agent Gateway
production scaling controls

AWS, Amazon Bedrock and Amazon Bedrock AgentCore

Amazon Bedrock AgentCore - Managed service for deploying and operating AI agents securely at scale across frameworks and models.
Amazon Bedrock AgentCore Observability - Production observability for tracing, debugging, monitoring, and investigating agent performance.
Amazon Bedrock AgentCore Identity - Identity and credential management for agent applications and automated workloads.
Amazon Bedrock AgentCore Memory - Managed memory for agent applications that need session context, user preferences, and longer-running continuity.
Amazon Bedrock Agents - Managed agent capability for orchestrating foundation models, knowledge bases, APIs, and user interactions.
Amazon Bedrock Agent Traces - Step-by-step traces for understanding agent orchestration and behaviour.
Amazon Bedrock Observability - Observability guidance for tracking performance, resources, and operational behaviour.
Monitor Amazon Bedrock Agents with CloudWatch - Runtime metrics for monitoring agent invocations and performance.
Amazon Bedrock Guardrails - Configurable safeguards for generative AI applications and agent workflows.
Amazon Bedrock Security, Guardrails, and Observability - Security and compliance guidance for Bedrock-based systems.

Operational capabilities to track:

agent orchestration and API action execution
traces for agent reasoning and tool use
CloudWatch metrics and logs
CloudTrail auditability
guardrails and policy enforcement
knowledge base and runtime monitoring

What to compare across cloud platforms

When evaluating cloud AgentOps capabilities, compare:

Capability	What to inspect
Managed runtime	Can agents be hosted, scaled, isolated, and versioned in production?
Tracing	Can teams inspect agent steps, tool calls, retrieval, memory, and reasoning paths?
Evaluation	Can outputs, trajectories, and task outcomes be evaluated continuously?
Identity	Can agents have distinct identities, permissions, and credential boundaries?
Guardrails	Can runtime policy, safety, and action constraints be enforced?
Monitoring	Can cost, latency, errors, usage, token consumption, and reliability be tracked?
Governance	Can teams audit lifecycle, approvals, access, incidents, and compliance evidence?
Portability	Can agents use external frameworks, APIs, tools, and model providers without deep lock-in?

Cost, Latency, and Reliability

OpenCost - Open-source cost monitoring for Kubernetes infrastructure.
Grafana - Dashboards and alerting for metrics, logs, and traces.
Prometheus - Metrics and alerting toolkit.
Sentry - Application error monitoring and performance tracing.
Vercel AI Gateway - Gateway for model routing, observability, and usage controls.

Agent-specific signals worth tracking:

Task success rate and policy violation rate.
Tool call count, tool error rate, and tool latency.
Model fallback rate and retry rate.
Cost per task, cost per successful task, and cost per user.
Human escalation rate and approval rejection rate.
Context length, memory growth, and retrieval quality.

Incident Response and Governance

PagerDuty Incident Response - Practical incident response concepts and lifecycle.
Google SRE Book - Foundational reliability practices.
NIST AI RMF Playbook - Practical guide for applying the NIST AI Risk Management Framework.
Partnership on AI: AI Incident Database - Database of AI-related incidents and harms.

Agent incident checklist:

Preserve traces, prompts, tool inputs, tool outputs, retrieved context, and approval records.
Identify whether the failure came from model behaviour, retrieval, tool execution, policy, permissions, or orchestration.
Reproduce the run with the same inputs where possible.
Add regression evals before changing prompts, tools, or policies.
Record user impact, safety impact, cost impact, and data exposure.

Multi-Agent Operations

AutoGen - Framework for building multi-agent AI applications.
CrewAI - Framework for orchestrating role-based AI agents.
LangGraph - Framework for stateful, controllable agent workflows.
Semantic Kernel - SDK for building agents and AI orchestration into applications.
OpenAI Swarm - Educational framework for lightweight multi-agent orchestration.

Operational concerns for multi-agent systems:

Shared state ownership and conflict resolution.
Message visibility, routing, and provenance.
Tool access per agent role.
Runaway loops, deadlocks, and duplicated work.
Evaluation at both individual-agent and system levels.

Standards and Protocols

Model Context Protocol - Protocol for connecting AI applications to tools and data sources.
OpenAPI - Standard for describing HTTP APIs exposed to agents.
AsyncAPI - Standard for event-driven API definitions.
CloudEvents - Specification for event data interoperability.
OpenTelemetry Semantic Conventions - Shared conventions for telemetry data.

Research and References

ReAct: Synergizing Reasoning and Acting in Language Models - Introduced reasoning plus acting patterns used by many agents.
Toolformer - Research on language models learning to use external tools.
Voyager - Example of lifelong learning and skill acquisition in an embodied agent setting.
SWE-bench - Benchmark for evaluating agents on real software engineering issues.
AgentBench - Benchmark for evaluating LLMs as agents across environments.
AI Incident Database - Public incident database useful for governance and risk analysis.

Contributing

Thrilled to have you here.
Whether it's a quick typo fix, a fresh resource,
a doc polish, or a sweeping overhaul — every contribution helps this list grow.
Jump in and join the community — PRs of every size are welcome.

📝 Read the contributing guide · 🐛 good first issues

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome AgentOps

Contents

Scope

Conceptual Map

AgentOps vs DevOps vs MLOps

Observability and Tracing

Evaluation and Testing

Replay and Debugging

Guardrails and Runtime Controls

Security, Identity, and Access Control

Human Approval and Workflow Control

Deployment and Runtime Infrastructure

Cloud AgentOps Platforms

Microsoft Azure and Microsoft Foundry

Google Cloud and Gemini Enterprise Agent Platform

AWS, Amazon Bedrock and Amazon Bedrock AgentCore

What to compare across cloud platforms

Cost, Latency, and Reliability

Incident Response and Governance

Multi-Agent Operations

Standards and Protocols

Research and References

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome AgentOps

Contents

Scope

Conceptual Map

AgentOps vs DevOps vs MLOps

Observability and Tracing

Evaluation and Testing

Replay and Debugging

Guardrails and Runtime Controls

Security, Identity, and Access Control

Human Approval and Workflow Control

Deployment and Runtime Infrastructure

Cloud AgentOps Platforms

Microsoft Azure and Microsoft Foundry

Google Cloud and Gemini Enterprise Agent Platform

AWS, Amazon Bedrock and Amazon Bedrock AgentCore

What to compare across cloud platforms

Cost, Latency, and Reliability

Incident Response and Governance

Multi-Agent Operations

Standards and Protocols

Research and References

Contributing

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages