@nicofains1/agentwatch

Observability for multi-agent systems. Track heartbeats, trace cross-agent actions, detect cascade failures, and replay what went wrong.

Built for teams running fleets of AI agents (CrewAI, AutoGen, LangGraph, PocketFlow, custom) who need to understand why Agent B failed after Agent A timed out.

Try it in 30 seconds

No install needed. Run this and see a full cascade failure traced across 5 agents:

npx @nicofains1/agentwatch demo

Output:

AgentWatch Fleet Dashboard
============================================================
Agents: 5 total | 3 healthy | 1 degraded | 1 error | 0 offline

Cascade Failure (4 steps, root cause: scheduler/dispatch-batch)
============================================================
[ROOT] scheduler/dispatch-batch [ok] 15ms
       |
[  1 ] fetcher/call-api [error] 30000ms
       TIMEOUT after 30000ms
       |
[  2 ] processor/transform [error] 120ms
       Error: input is null - expected array from fetcher
       |
[FAIL] notifier/send-alert [error] 8ms
       Error: no processed data to report

Install

npm install @nicofains1/agentwatch

Quick Start

import { AgentWatch } from '@nicofains1/agentwatch';

const aw = new AgentWatch(); // creates agentwatch.db

// 1. Report heartbeats from your agents
aw.report('agent-a', 'healthy');
aw.report('agent-b', 'healthy');

// 2. Trace actions across agents
const traceId = aw.createTraceId();

const e1 = aw.trace(traceId, 'agent-a', 'fetch-data',
  'url=https://api.example.com', 'rows=150');

const e2 = aw.trace(traceId, 'agent-b', 'process',
  JSON.stringify({ rows: 150 }), 'Error: out of memory', {
    parentEventId: e1.id,
    status: 'error',
    durationMs: 4200,
  });

// 3. Find the root cause
const chain = aw.correlate(e2.id);
console.log(chain?.root_cause);
// -> { agent: 'agent-a', action: 'fetch-data', ... }

// 4. Fleet dashboard
console.log(aw.dashboardText());

Features

Heartbeat registration - Track agent health status over time. Detect stale or offline agents based on configurable thresholds.

Cross-agent tracing - Link actions across agents with trace IDs and parent event references. When agent-c fails because agent-b sent bad data that it got from agent-a, the trace shows the full chain.

Cascade failure detection - Walk backward from any failure to find the root cause across your agent fleet. correlate(failureEventId) returns the full chain from root cause to final failure.

Alert de-duplication - Same alert type from the same agent within a time window gets collapsed into one alert with an incrementing count. Severity auto-escalates: info (1x) -> warning (3x) -> critical (10x).

Fleet dashboard - One-line summary of your entire fleet: which agents are healthy, degraded, erroring, or offline. Uptime percentages and active alert counts per agent.

Forensic replay - Given a trace ID, replay all cascade chains to understand the full failure sequence.

OpenTelemetry export - Export traces as OTEL spans with GenAI semantic conventions. Plug into Jaeger, Grafana, or any OTEL-compatible backend.

MCP Server

AgentWatch works as an MCP server, so any MCP-compatible editor (Claude Code, Cursor, etc.) can use it as a tool. Add it to your MCP config:

{
  "mcpServers": {
    "agentwatch": {
      "command": "npx",
      "args": ["@nicofains1/agentwatch", "mcp"],
      "env": {
        "AGENTWATCH_DB": "/path/to/agentwatch.db"
      }
    }
  }
}

This exposes 13 tools: agentwatch_dashboard, agentwatch_report_heartbeat, agentwatch_trace, agentwatch_cascade, agentwatch_replay, agentwatch_get_alerts, agentwatch_get_failures, agentwatch_get_trace, agentwatch_fleet_health, agentwatch_create_trace_id, agentwatch_alert, agentwatch_resolve_alert, and agentwatch_dashboard_text.

CLI

npx @nicofains1/agentwatch demo                   # See it in action with sample data
npx @nicofains1/agentwatch dashboard              # Fleet health overview
npx @nicofains1/agentwatch cascade <event-id>     # Trace cascade from a failure
npx @nicofains1/agentwatch failures [agent]       # List recent failures
npx @nicofains1/agentwatch alerts [agent]         # List active alerts
npx @nicofains1/agentwatch replay <trace-id>      # Replay all cascades in a trace
npx @nicofains1/agentwatch mcp                    # Start MCP server (stdio)

Set AGENTWATCH_DB to point to your database file (default: agentwatch.db).

API

`new AgentWatch(config?)`

const aw = new AgentWatch({
  db_path: 'agentwatch.db',       // SQLite file path
  alert_window_minutes: 30,        // De-dup window for alerts
  heartbeat_stale_minutes: 30,     // When to mark agents as offline
});

Heartbeats

aw.report(agent, status, context?)     // status: 'healthy' | 'degraded' | 'error' | 'offline'
aw.getLatestHeartbeat(agent)           // -> Heartbeat | undefined
aw.getFleetHealth()                    // -> AgentHealth[]

Tracing

aw.createTraceId()                                // -> string (UUID)
aw.trace(traceId, agent, action, input, output, {
  parentEventId?: number,                         // link to parent event
  status?: 'ok' | 'error',                        // default: 'ok'
  durationMs?: number,                            // execution time
})                                                // -> TraceEvent
aw.getTraceEvents(traceId)                        // -> TraceEvent[]
aw.getRecentFailures(agent?, limit?)              // -> TraceEvent[]

Cascade Detection

aw.correlate(failureEventId)    // -> CascadeChain | null (walk back to root cause)
aw.replay(traceId)              // -> CascadeChain[] (all cascades in a trace)

Alerts

aw.alert(agent, alertType, message)    // auto-deduplicates within window
aw.resolveAlert(alertId)
aw.activeAlerts(agent?)                // -> Alert[]

Dashboard

aw.dashboard()      // -> DashboardOutput (structured)
aw.dashboardText()  // -> string (formatted for terminal)

OpenTelemetry Export

// Requires optional peer deps: @opentelemetry/api, @opentelemetry/sdk-trace-base
await aw.exportTraceToOtel(traceId, { serviceName: 'my-agents' });
await aw.exportRecentToOtel(1); // last 1 hour

Storage

Uses SQLite via better-sqlite3. The database file is created automatically on first use. WAL mode is enabled for concurrent reads.

Tables: heartbeats, trace_events, alerts - all with proper indexes.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cookbooks/pocketflow-monitor		cookbooks/pocketflow-monitor
docs/integrations		docs/integrations
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
glama.json		glama.json
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@nicofains1/agentwatch

Try it in 30 seconds

Install

Quick Start

Features

MCP Server

CLI

API

`new AgentWatch(config?)`

Heartbeats

Tracing

Cascade Detection

Alerts

Dashboard

OpenTelemetry Export

Storage

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@nicofains1/agentwatch

Try it in 30 seconds

Install

Quick Start

Features

MCP Server

CLI

API

new AgentWatch(config?)

Heartbeats

Tracing

Cascade Detection

Alerts

Dashboard

OpenTelemetry Export

Storage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`new AgentWatch(config?)`

Packages