Skip to content

natnew/awesome-agentops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome AgentOps

A curated guide to operating AI agents in production: observability, evaluation, tracing, guardrails, deployment, security, governance, and incident response.

Awesome License: MIT PRs Welcome Links Checked Focus: AgentOps Status: v0.1

AgentOps is the production operating layer for AI agents. It is broader than agent monitoring: it covers how teams ship, observe, evaluate, debug, secure, control, and improve autonomous or semi-autonomous AI systems after they leave a demo environment.

This list is intentionally focused on engineering resources for production agents. It favors tools, papers, patterns, and references that help teams answer operational questions:

  • What did the agent do, and why?
  • Did it complete the task correctly, safely, and within policy?
  • Can we replay, debug, and evaluate failures?
  • Can humans approve, interrupt, or override high-impact actions?
  • Can we manage cost, latency, secrets, identity, and permissions?
  • Can we respond to agent incidents with evidence instead of guesswork?

Contents

Scope

Included:

  • Production agent observability, tracing, evaluation, testing, and replay.
  • Runtime guardrails, policy checks, approvals, and control planes.
  • Agent security, identity, permissions, sandboxing, and secret handling.
  • Deployment, reliability, cost, latency, and incident response practices.
  • Multi-agent coordination and operational failure modes.

Not included by default:

  • Generic LLM chat apps without an operational angle.
  • Prompt collections with no evaluation or production discipline.
  • Broad AI tool directories that do not focus on running agents.
  • Vendor marketing pages without useful engineering substance.

Conceptual Map

AgentOps spans the full lifecycle of an agent system:

  1. Design: define tasks, tools, permissions, policies, and success criteria.
  2. Build: instrument traces, model calls, tool calls, memory, and state transitions.
  3. Test: run unit tests, scenario tests, adversarial tests, and regression evals.
  4. Deploy: manage environments, secrets, rollout, rate limits, and fallbacks.
  5. Operate: monitor correctness, safety, cost, latency, drift, and incidents.
  6. Improve: replay failures, evaluate fixes, update policies, and govern change.

AgentOps vs DevOps vs MLOps

AgentOps overlaps with DevOps and MLOps, but it is not the same discipline.

DevOps focuses on shipping and operating software systems: infrastructure, CI/CD, deployment, monitoring, reliability, and incident response.

MLOps focuses on the machine learning lifecycle: datasets, training pipelines, model registries, model deployment, drift, performance monitoring, and retraining.

AgentOps focuses on the operational behaviour of AI agents after they are given goals, tools, memory, context, permissions, workflows, and the ability to take actions.

Discipline Primary concern Typical operational objects Key questions
DevOps Software delivery and infrastructure reliability Services, containers, APIs, databases, networks, deployments Is the system available, scalable, secure, and deployable?
MLOps Model and data lifecycle management Datasets, features, models, training jobs, model endpoints, drift metrics Is the model trained, evaluated, deployed, monitored, and updated correctly?
AgentOps Agent behaviour in production Agent runs, tool calls, traces, memory, plans, policies, approvals, permissions, outcomes Did the agent act correctly, safely, within policy, and with enough evidence to debug or govern it?

AgentOps extends operational practice into areas that traditional DevOps and MLOps do not fully cover:

  • agent trajectories and step-by-step execution traces
  • tool-use monitoring and permission boundaries
  • memory, context, and retrieval behaviour
  • runtime policy checks and guardrails
  • human approval for high-impact actions
  • replay and debugging of agent failures
  • evaluation of task completion, behaviour, and safety
  • auditability of autonomous or semi-autonomous decisions

A useful shorthand:

DevOps keeps the software running.
MLOps keeps the model lifecycle controlled.
AgentOps keeps agent behaviour observable, evaluable, constrained, and governable.

Observability and Tracing

  • OpenTelemetry - Vendor-neutral observability framework for traces, metrics, and logs.
  • OpenLLMetry - OpenTelemetry-based instrumentation for LLM applications.
  • Langfuse - Open-source LLM engineering platform with tracing, prompt management, evals, and metrics.
  • Arize Phoenix - Open-source observability and evaluation for LLM applications.
  • LangSmith - Platform for tracing, debugging, evaluating, and monitoring LangChain and agent applications.
  • Weights & Biases Weave - Tracking and evaluation for LLM applications.
  • Helicone - Open-source observability platform for LLM usage, cost, latency, and requests.
  • AgentOps - Session replay, analytics, and observability for AI agents.

Evaluation and Testing

  • OpenAI Evals - Framework and registry for evaluating language model behaviour.
  • DeepEval - Evaluation framework for LLM applications with regression testing support.
  • Ragas - Evaluation framework for retrieval-augmented generation and LLM pipelines.
  • promptfoo - CLI and framework for testing prompts, models, and LLM applications.
  • Giskard - Testing and risk scanning for AI systems.
  • Inspect AI - Framework for large language model evaluations.
  • Braintrust - Evaluation, logging, and prompt iteration platform for AI products.

Replay and Debugging

  • LangSmith Tracing - Trace inspection, dataset creation, and regression workflows for agent runs.
  • Langfuse Tracing - Traces for LLM calls, tool calls, chains, and agent sessions.
  • Phoenix Tracing - OpenTelemetry-based tracing for LLM application debugging.
  • Weave Tracing - Tracing and interactive debugging for model and agent workflows.

Guardrails and Runtime Controls

  • Guardrails AI - Validation and guardrail framework for LLM inputs and outputs.
  • NVIDIA NeMo Guardrails - Toolkit for programmable guardrails around LLM applications.
  • Llama Guard - Meta's safety classification model family for policy enforcement.
  • Rebuff - Prompt injection detection and mitigation framework.
  • Lakera Guard - Runtime protection for LLM applications against prompt injection and unsafe content.
  • OpenAI Moderation - Content safety models and moderation patterns.

Security, Identity, and Access Control

Operational topics to cover in production reviews:

  • Tool permission boundaries and least privilege.
  • Secret isolation and credential rotation.
  • User, agent, service, and tool identity.
  • Sandboxing for code execution, browser use, and filesystem access.
  • Audit logs for privileged or irreversible actions.

Human Approval and Workflow Control

  • Temporal - Durable execution platform for long-running workflows, retries, and human-in-the-loop steps.
  • Inngest - Durable functions and event-driven workflows for reliable background execution.
  • Hatchet - Distributed task queue and workflow engine.
  • HumanLayer - Human approval workflows for AI agents and tool calls.

Useful approval patterns:

  • Require approval for external side effects such as sending email, spending money, merging code, or changing infrastructure.
  • Store the proposed action, context, risk level, approver, and final decision.
  • Make approvals replayable and auditable, not transient chat messages.

Deployment and Runtime Infrastructure

  • LiteLLM - LLM gateway for model routing, budgets, retries, keys, and provider abstraction.
  • Portkey - AI gateway for observability, caching, routing, guardrails, and reliability.
  • Ray Serve - Scalable model and application serving for Python workloads.
  • BentoML - Framework for building and deploying AI applications.
  • Modal - Serverless infrastructure for AI and data workloads.
  • Fly.io - Application runtime useful for globally deployed agent services.

Cloud AgentOps Platforms

Major cloud platforms are beginning to expose AgentOps capabilities through managed agent runtimes, observability tools, tracing, evaluations, guardrails, identity controls, and governance features.

This section tracks cloud-native services that help teams build, deploy, monitor, evaluate, secure, and govern AI agents in production.

Microsoft Azure and Microsoft Foundry

Operational capabilities to track:

  • agent inventory and lifecycle management
  • managed agent hosting
  • tracing and evaluation
  • multi-agent orchestration patterns
  • security and governance controls
  • organisational adoption processes

Google Cloud and Gemini Enterprise Agent Platform

Operational capabilities to track:

  • managed serverless agent runtime
  • tracing, logging, monitoring, and alerts
  • IAM-based agent identity
  • session and memory management
  • secure connectivity through Agent Gateway
  • production scaling controls

AWS, Amazon Bedrock and Amazon Bedrock AgentCore

Operational capabilities to track:

  • agent orchestration and API action execution
  • traces for agent reasoning and tool use
  • CloudWatch metrics and logs
  • CloudTrail auditability
  • guardrails and policy enforcement
  • knowledge base and runtime monitoring

What to compare across cloud platforms

When evaluating cloud AgentOps capabilities, compare:

Capability What to inspect
Managed runtime Can agents be hosted, scaled, isolated, and versioned in production?
Tracing Can teams inspect agent steps, tool calls, retrieval, memory, and reasoning paths?
Evaluation Can outputs, trajectories, and task outcomes be evaluated continuously?
Identity Can agents have distinct identities, permissions, and credential boundaries?
Guardrails Can runtime policy, safety, and action constraints be enforced?
Monitoring Can cost, latency, errors, usage, token consumption, and reliability be tracked?
Governance Can teams audit lifecycle, approvals, access, incidents, and compliance evidence?
Portability Can agents use external frameworks, APIs, tools, and model providers without deep lock-in?

Cost, Latency, and Reliability

  • OpenCost - Open-source cost monitoring for Kubernetes infrastructure.
  • Grafana - Dashboards and alerting for metrics, logs, and traces.
  • Prometheus - Metrics and alerting toolkit.
  • Sentry - Application error monitoring and performance tracing.
  • Vercel AI Gateway - Gateway for model routing, observability, and usage controls.

Agent-specific signals worth tracking:

  • Task success rate and policy violation rate.
  • Tool call count, tool error rate, and tool latency.
  • Model fallback rate and retry rate.
  • Cost per task, cost per successful task, and cost per user.
  • Human escalation rate and approval rejection rate.
  • Context length, memory growth, and retrieval quality.

Incident Response and Governance

Agent incident checklist:

  • Preserve traces, prompts, tool inputs, tool outputs, retrieved context, and approval records.
  • Identify whether the failure came from model behaviour, retrieval, tool execution, policy, permissions, or orchestration.
  • Reproduce the run with the same inputs where possible.
  • Add regression evals before changing prompts, tools, or policies.
  • Record user impact, safety impact, cost impact, and data exposure.

Multi-Agent Operations

  • AutoGen - Framework for building multi-agent AI applications.
  • CrewAI - Framework for orchestrating role-based AI agents.
  • LangGraph - Framework for stateful, controllable agent workflows.
  • Semantic Kernel - SDK for building agents and AI orchestration into applications.
  • OpenAI Swarm - Educational framework for lightweight multi-agent orchestration.

Operational concerns for multi-agent systems:

  • Shared state ownership and conflict resolution.
  • Message visibility, routing, and provenance.
  • Tool access per agent role.
  • Runaway loops, deadlocks, and duplicated work.
  • Evaluation at both individual-agent and system levels.

Standards and Protocols

Research and References

Contributing

We love Contributors

Thrilled to have you here.
Whether it's a quick typo fix, a fresh resource,
a doc polish, or a sweeping overhaul — every contribution helps this list grow.
Jump in and join the community — PRs of every size are welcome.

📝 Read the contributing guide · 🐛 good first issues

License

MIT. See LICENSE.

About

A curated guide to operating AI agents in production: observability, evaluation, tracing, guardrails, deployment, security, and incident response.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors