A curated guide to operating AI agents in production: observability, evaluation, tracing, guardrails, deployment, security, governance, and incident response.
AgentOps is the production operating layer for AI agents. It is broader than agent monitoring: it covers how teams ship, observe, evaluate, debug, secure, control, and improve autonomous or semi-autonomous AI systems after they leave a demo environment.
This list is intentionally focused on engineering resources for production agents. It favors tools, papers, patterns, and references that help teams answer operational questions:
- What did the agent do, and why?
- Did it complete the task correctly, safely, and within policy?
- Can we replay, debug, and evaluate failures?
- Can humans approve, interrupt, or override high-impact actions?
- Can we manage cost, latency, secrets, identity, and permissions?
- Can we respond to agent incidents with evidence instead of guesswork?
- Scope
- Conceptual Map
- AgentOps vs DevOps vs MLOps
- Observability and Tracing
- Evaluation and Testing
- Replay and Debugging
- Guardrails and Runtime Controls
- Security, Identity, and Access Control
- Human Approval and Workflow Control
- Deployment and Runtime Infrastructure
- Cloud AgentOps Platforms
- Cost, Latency, and Reliability
- Incident Response and Governance
- Multi-Agent Operations
- Standards and Protocols
- Research and References
- Contributing
Included:
- Production agent observability, tracing, evaluation, testing, and replay.
- Runtime guardrails, policy checks, approvals, and control planes.
- Agent security, identity, permissions, sandboxing, and secret handling.
- Deployment, reliability, cost, latency, and incident response practices.
- Multi-agent coordination and operational failure modes.
Not included by default:
- Generic LLM chat apps without an operational angle.
- Prompt collections with no evaluation or production discipline.
- Broad AI tool directories that do not focus on running agents.
- Vendor marketing pages without useful engineering substance.
AgentOps spans the full lifecycle of an agent system:
- Design: define tasks, tools, permissions, policies, and success criteria.
- Build: instrument traces, model calls, tool calls, memory, and state transitions.
- Test: run unit tests, scenario tests, adversarial tests, and regression evals.
- Deploy: manage environments, secrets, rollout, rate limits, and fallbacks.
- Operate: monitor correctness, safety, cost, latency, drift, and incidents.
- Improve: replay failures, evaluate fixes, update policies, and govern change.
AgentOps overlaps with DevOps and MLOps, but it is not the same discipline.
DevOps focuses on shipping and operating software systems: infrastructure, CI/CD, deployment, monitoring, reliability, and incident response.
MLOps focuses on the machine learning lifecycle: datasets, training pipelines, model registries, model deployment, drift, performance monitoring, and retraining.
AgentOps focuses on the operational behaviour of AI agents after they are given goals, tools, memory, context, permissions, workflows, and the ability to take actions.
| Discipline | Primary concern | Typical operational objects | Key questions |
|---|---|---|---|
| DevOps | Software delivery and infrastructure reliability | Services, containers, APIs, databases, networks, deployments | Is the system available, scalable, secure, and deployable? |
| MLOps | Model and data lifecycle management | Datasets, features, models, training jobs, model endpoints, drift metrics | Is the model trained, evaluated, deployed, monitored, and updated correctly? |
| AgentOps | Agent behaviour in production | Agent runs, tool calls, traces, memory, plans, policies, approvals, permissions, outcomes | Did the agent act correctly, safely, within policy, and with enough evidence to debug or govern it? |
AgentOps extends operational practice into areas that traditional DevOps and MLOps do not fully cover:
- agent trajectories and step-by-step execution traces
- tool-use monitoring and permission boundaries
- memory, context, and retrieval behaviour
- runtime policy checks and guardrails
- human approval for high-impact actions
- replay and debugging of agent failures
- evaluation of task completion, behaviour, and safety
- auditability of autonomous or semi-autonomous decisions
A useful shorthand:
DevOps keeps the software running.
MLOps keeps the model lifecycle controlled.
AgentOps keeps agent behaviour observable, evaluable, constrained, and governable.
- OpenTelemetry - Vendor-neutral observability framework for traces, metrics, and logs.
- OpenLLMetry - OpenTelemetry-based instrumentation for LLM applications.
- Langfuse - Open-source LLM engineering platform with tracing, prompt management, evals, and metrics.
- Arize Phoenix - Open-source observability and evaluation for LLM applications.
- LangSmith - Platform for tracing, debugging, evaluating, and monitoring LangChain and agent applications.
- Weights & Biases Weave - Tracking and evaluation for LLM applications.
- Helicone - Open-source observability platform for LLM usage, cost, latency, and requests.
- AgentOps - Session replay, analytics, and observability for AI agents.
- OpenAI Evals - Framework and registry for evaluating language model behaviour.
- DeepEval - Evaluation framework for LLM applications with regression testing support.
- Ragas - Evaluation framework for retrieval-augmented generation and LLM pipelines.
- promptfoo - CLI and framework for testing prompts, models, and LLM applications.
- Giskard - Testing and risk scanning for AI systems.
- Inspect AI - Framework for large language model evaluations.
- Braintrust - Evaluation, logging, and prompt iteration platform for AI products.
- LangSmith Tracing - Trace inspection, dataset creation, and regression workflows for agent runs.
- Langfuse Tracing - Traces for LLM calls, tool calls, chains, and agent sessions.
- Phoenix Tracing - OpenTelemetry-based tracing for LLM application debugging.
- Weave Tracing - Tracing and interactive debugging for model and agent workflows.
- Guardrails AI - Validation and guardrail framework for LLM inputs and outputs.
- NVIDIA NeMo Guardrails - Toolkit for programmable guardrails around LLM applications.
- Llama Guard - Meta's safety classification model family for policy enforcement.
- Rebuff - Prompt injection detection and mitigation framework.
- Lakera Guard - Runtime protection for LLM applications against prompt injection and unsafe content.
- OpenAI Moderation - Content safety models and moderation patterns.
- OWASP Top 10 for LLM Applications - Security risks for LLM and agent systems.
- OWASP Agentic Security Initiative - Security work focused on agentic AI systems.
- NIST AI Risk Management Framework - Risk management framework for AI systems.
- PyRIT - Microsoft framework for red teaming generative AI systems.
- garak - LLM vulnerability scanner and red-teaming tool.
- Invariant - Testing and guardrails for agent behaviour and tool use.
Operational topics to cover in production reviews:
- Tool permission boundaries and least privilege.
- Secret isolation and credential rotation.
- User, agent, service, and tool identity.
- Sandboxing for code execution, browser use, and filesystem access.
- Audit logs for privileged or irreversible actions.
- Temporal - Durable execution platform for long-running workflows, retries, and human-in-the-loop steps.
- Inngest - Durable functions and event-driven workflows for reliable background execution.
- Hatchet - Distributed task queue and workflow engine.
- HumanLayer - Human approval workflows for AI agents and tool calls.
Useful approval patterns:
- Require approval for external side effects such as sending email, spending money, merging code, or changing infrastructure.
- Store the proposed action, context, risk level, approver, and final decision.
- Make approvals replayable and auditable, not transient chat messages.
- LiteLLM - LLM gateway for model routing, budgets, retries, keys, and provider abstraction.
- Portkey - AI gateway for observability, caching, routing, guardrails, and reliability.
- Ray Serve - Scalable model and application serving for Python workloads.
- BentoML - Framework for building and deploying AI applications.
- Modal - Serverless infrastructure for AI and data workloads.
- Fly.io - Application runtime useful for globally deployed agent services.
Major cloud platforms are beginning to expose AgentOps capabilities through managed agent runtimes, observability tools, tracing, evaluations, guardrails, identity controls, and governance features.
This section tracks cloud-native services that help teams build, deploy, monitor, evaluate, secure, and govern AI agents in production.
- Microsoft Foundry Agent Service - Managed platform for building, deploying, and scaling AI agents across prompt agents, workflow agents, and hosted agents.
- Microsoft Foundry Control Plane - Centralised management and observability for agent inventory, agent health, and lifecycle operations.
- Microsoft Foundry Playgrounds - Agent development environment with tracing and evaluation data for agent responses.
- Azure AI Agent Design Patterns - Architecture guidance for multi-agent orchestration patterns.
- Azure AI Agent Adoption Process - Organisational guidance for building agents consistently and securely.
Operational capabilities to track:
- agent inventory and lifecycle management
- managed agent hosting
- tracing and evaluation
- multi-agent orchestration patterns
- security and governance controls
- organisational adoption processes
- Gemini Enterprise Agent Platform - Unified platform for building, deploying, governing, and optimising enterprise-grade AI agents.
- Scale your agents - Production guidance for deploying, managing, tracing, logging, monitoring, and scaling agents.
- Agent Platform Runtime - Managed runtime services for deploying, managing, and scaling AI agents in production.
- Agent Development Kit - Open-source framework for building and orchestrating agents.
- Agent identity and IAM - Identity and access management patterns for deployed agents.
Operational capabilities to track:
- managed serverless agent runtime
- tracing, logging, monitoring, and alerts
- IAM-based agent identity
- session and memory management
- secure connectivity through Agent Gateway
- production scaling controls
- Amazon Bedrock AgentCore - Managed service for deploying and operating AI agents securely at scale across frameworks and models.
- Amazon Bedrock AgentCore Observability - Production observability for tracing, debugging, monitoring, and investigating agent performance.
- Amazon Bedrock AgentCore Identity - Identity and credential management for agent applications and automated workloads.
- Amazon Bedrock AgentCore Memory - Managed memory for agent applications that need session context, user preferences, and longer-running continuity.
- Amazon Bedrock Agents - Managed agent capability for orchestrating foundation models, knowledge bases, APIs, and user interactions.
- Amazon Bedrock Agent Traces - Step-by-step traces for understanding agent orchestration and behaviour.
- Amazon Bedrock Observability - Observability guidance for tracking performance, resources, and operational behaviour.
- Monitor Amazon Bedrock Agents with CloudWatch - Runtime metrics for monitoring agent invocations and performance.
- Amazon Bedrock Guardrails - Configurable safeguards for generative AI applications and agent workflows.
- Amazon Bedrock Security, Guardrails, and Observability - Security and compliance guidance for Bedrock-based systems.
Operational capabilities to track:
- agent orchestration and API action execution
- traces for agent reasoning and tool use
- CloudWatch metrics and logs
- CloudTrail auditability
- guardrails and policy enforcement
- knowledge base and runtime monitoring
When evaluating cloud AgentOps capabilities, compare:
| Capability | What to inspect |
|---|---|
| Managed runtime | Can agents be hosted, scaled, isolated, and versioned in production? |
| Tracing | Can teams inspect agent steps, tool calls, retrieval, memory, and reasoning paths? |
| Evaluation | Can outputs, trajectories, and task outcomes be evaluated continuously? |
| Identity | Can agents have distinct identities, permissions, and credential boundaries? |
| Guardrails | Can runtime policy, safety, and action constraints be enforced? |
| Monitoring | Can cost, latency, errors, usage, token consumption, and reliability be tracked? |
| Governance | Can teams audit lifecycle, approvals, access, incidents, and compliance evidence? |
| Portability | Can agents use external frameworks, APIs, tools, and model providers without deep lock-in? |
- OpenCost - Open-source cost monitoring for Kubernetes infrastructure.
- Grafana - Dashboards and alerting for metrics, logs, and traces.
- Prometheus - Metrics and alerting toolkit.
- Sentry - Application error monitoring and performance tracing.
- Vercel AI Gateway - Gateway for model routing, observability, and usage controls.
Agent-specific signals worth tracking:
- Task success rate and policy violation rate.
- Tool call count, tool error rate, and tool latency.
- Model fallback rate and retry rate.
- Cost per task, cost per successful task, and cost per user.
- Human escalation rate and approval rejection rate.
- Context length, memory growth, and retrieval quality.
- PagerDuty Incident Response - Practical incident response concepts and lifecycle.
- Google SRE Book - Foundational reliability practices.
- NIST AI RMF Playbook - Practical guide for applying the NIST AI Risk Management Framework.
- Partnership on AI: AI Incident Database - Database of AI-related incidents and harms.
Agent incident checklist:
- Preserve traces, prompts, tool inputs, tool outputs, retrieved context, and approval records.
- Identify whether the failure came from model behaviour, retrieval, tool execution, policy, permissions, or orchestration.
- Reproduce the run with the same inputs where possible.
- Add regression evals before changing prompts, tools, or policies.
- Record user impact, safety impact, cost impact, and data exposure.
- AutoGen - Framework for building multi-agent AI applications.
- CrewAI - Framework for orchestrating role-based AI agents.
- LangGraph - Framework for stateful, controllable agent workflows.
- Semantic Kernel - SDK for building agents and AI orchestration into applications.
- OpenAI Swarm - Educational framework for lightweight multi-agent orchestration.
Operational concerns for multi-agent systems:
- Shared state ownership and conflict resolution.
- Message visibility, routing, and provenance.
- Tool access per agent role.
- Runaway loops, deadlocks, and duplicated work.
- Evaluation at both individual-agent and system levels.
- Model Context Protocol - Protocol for connecting AI applications to tools and data sources.
- OpenAPI - Standard for describing HTTP APIs exposed to agents.
- AsyncAPI - Standard for event-driven API definitions.
- CloudEvents - Specification for event data interoperability.
- OpenTelemetry Semantic Conventions - Shared conventions for telemetry data.
- ReAct: Synergizing Reasoning and Acting in Language Models - Introduced reasoning plus acting patterns used by many agents.
- Toolformer - Research on language models learning to use external tools.
- Voyager - Example of lifelong learning and skill acquisition in an embodied agent setting.
- SWE-bench - Benchmark for evaluating agents on real software engineering issues.
- AgentBench - Benchmark for evaluating LLMs as agents across environments.
- AI Incident Database - Public incident database useful for governance and risk analysis.
Thrilled to have you here.
Whether it's a quick typo fix, a fresh resource,
a doc polish, or a sweeping overhaul — every contribution helps this list grow.
Jump in and join the community — PRs of every size are welcome.
📝 Read the contributing guide · 🐛 good first issues
MIT. See LICENSE.
