Observability for multi-agent systems. Track heartbeats, trace cross-agent actions, detect cascade failures, and replay what went wrong.
Built for teams running fleets of AI agents (CrewAI, AutoGen, LangGraph, PocketFlow, custom) who need to understand why Agent B failed after Agent A timed out.
No install needed. Run this and see a full cascade failure traced across 5 agents:
npx @nicofains1/agentwatch demoOutput:
AgentWatch Fleet Dashboard
============================================================
Agents: 5 total | 3 healthy | 1 degraded | 1 error | 0 offline
Cascade Failure (4 steps, root cause: scheduler/dispatch-batch)
============================================================
[ROOT] scheduler/dispatch-batch [ok] 15ms
|
[ 1 ] fetcher/call-api [error] 30000ms
TIMEOUT after 30000ms
|
[ 2 ] processor/transform [error] 120ms
Error: input is null - expected array from fetcher
|
[FAIL] notifier/send-alert [error] 8ms
Error: no processed data to report
npm install @nicofains1/agentwatchimport { AgentWatch } from '@nicofains1/agentwatch';
const aw = new AgentWatch(); // creates agentwatch.db
// 1. Report heartbeats from your agents
aw.report('agent-a', 'healthy');
aw.report('agent-b', 'healthy');
// 2. Trace actions across agents
const traceId = aw.createTraceId();
const e1 = aw.trace(traceId, 'agent-a', 'fetch-data',
'url=https://api.example.com', 'rows=150');
const e2 = aw.trace(traceId, 'agent-b', 'process',
JSON.stringify({ rows: 150 }), 'Error: out of memory', {
parentEventId: e1.id,
status: 'error',
durationMs: 4200,
});
// 3. Find the root cause
const chain = aw.correlate(e2.id);
console.log(chain?.root_cause);
// -> { agent: 'agent-a', action: 'fetch-data', ... }
// 4. Fleet dashboard
console.log(aw.dashboardText());Heartbeat registration - Track agent health status over time. Detect stale or offline agents based on configurable thresholds.
Cross-agent tracing - Link actions across agents with trace IDs and parent event references. When agent-c fails because agent-b sent bad data that it got from agent-a, the trace shows the full chain.
Cascade failure detection - Walk backward from any failure to find the root cause across your agent fleet. correlate(failureEventId) returns the full chain from root cause to final failure.
Alert de-duplication - Same alert type from the same agent within a time window gets collapsed into one alert with an incrementing count. Severity auto-escalates: info (1x) -> warning (3x) -> critical (10x).
Fleet dashboard - One-line summary of your entire fleet: which agents are healthy, degraded, erroring, or offline. Uptime percentages and active alert counts per agent.
Forensic replay - Given a trace ID, replay all cascade chains to understand the full failure sequence.
OpenTelemetry export - Export traces as OTEL spans with GenAI semantic conventions. Plug into Jaeger, Grafana, or any OTEL-compatible backend.
AgentWatch works as an MCP server, so any MCP-compatible editor (Claude Code, Cursor, etc.) can use it as a tool. Add it to your MCP config:
{
"mcpServers": {
"agentwatch": {
"command": "npx",
"args": ["@nicofains1/agentwatch", "mcp"],
"env": {
"AGENTWATCH_DB": "/path/to/agentwatch.db"
}
}
}
}This exposes 13 tools: agentwatch_dashboard, agentwatch_report_heartbeat, agentwatch_trace, agentwatch_cascade, agentwatch_replay, agentwatch_get_alerts, agentwatch_get_failures, agentwatch_get_trace, agentwatch_fleet_health, agentwatch_create_trace_id, agentwatch_alert, agentwatch_resolve_alert, and agentwatch_dashboard_text.
npx @nicofains1/agentwatch demo # See it in action with sample data
npx @nicofains1/agentwatch dashboard # Fleet health overview
npx @nicofains1/agentwatch cascade <event-id> # Trace cascade from a failure
npx @nicofains1/agentwatch failures [agent] # List recent failures
npx @nicofains1/agentwatch alerts [agent] # List active alerts
npx @nicofains1/agentwatch replay <trace-id> # Replay all cascades in a trace
npx @nicofains1/agentwatch mcp # Start MCP server (stdio)Set AGENTWATCH_DB to point to your database file (default: agentwatch.db).
const aw = new AgentWatch({
db_path: 'agentwatch.db', // SQLite file path
alert_window_minutes: 30, // De-dup window for alerts
heartbeat_stale_minutes: 30, // When to mark agents as offline
});aw.report(agent, status, context?) // status: 'healthy' | 'degraded' | 'error' | 'offline'
aw.getLatestHeartbeat(agent) // -> Heartbeat | undefined
aw.getFleetHealth() // -> AgentHealth[]aw.createTraceId() // -> string (UUID)
aw.trace(traceId, agent, action, input, output, {
parentEventId?: number, // link to parent event
status?: 'ok' | 'error', // default: 'ok'
durationMs?: number, // execution time
}) // -> TraceEvent
aw.getTraceEvents(traceId) // -> TraceEvent[]
aw.getRecentFailures(agent?, limit?) // -> TraceEvent[]aw.correlate(failureEventId) // -> CascadeChain | null (walk back to root cause)
aw.replay(traceId) // -> CascadeChain[] (all cascades in a trace)aw.alert(agent, alertType, message) // auto-deduplicates within window
aw.resolveAlert(alertId)
aw.activeAlerts(agent?) // -> Alert[]aw.dashboard() // -> DashboardOutput (structured)
aw.dashboardText() // -> string (formatted for terminal)// Requires optional peer deps: @opentelemetry/api, @opentelemetry/sdk-trace-base
await aw.exportTraceToOtel(traceId, { serviceName: 'my-agents' });
await aw.exportRecentToOtel(1); // last 1 hourUses SQLite via better-sqlite3. The database file is created automatically on first use. WAL mode is enabled for concurrent reads.
Tables: heartbeats, trace_events, alerts - all with proper indexes.
MIT