An end-to-end Datadog observability stack for an LLM agent —
git cloneto fully-instrumented in 15 minutes — wired up the way I'd actually want to run one in production.
Lives at github.com/jonathan-major/DDFS.
A small but representative stack — FastAPI service, LangGraph triage agent, Postgres with pgvector, Redis, Celery worker, static HTML chat UI — instrumented with Datadog APM, Logs, and LLM Observability. The point is to see what "AI-era observability" actually looks like when you take it seriously: APM traces, log/trace correlation, LLM Observability spans, five custom metrics, a seven-tile dashboard, two monitors, and an SLO — all defined as JSON in the repo and pushed via a small Python script.
Most "LLM observability" content stops at one of two failure modes. Either it's a screenshot tour of a vendor dashboard with no working code behind it, or it's a thousand-line code dump with no story about what you'd actually look at when something goes wrong. I wanted to see the whole loop end to end:
- A non-trivial agent (multi-node LangGraph with two different Claude models, RAG retrieval, a self-eval step, and a human-escalation path)
- Real APM auto-instrumentation across the boring middle of the stack (FastAPI / psycopg / redis / Celery)
- Custom application metrics flowing via DogStatsD with bounded cardinality (5 nodes × 4 intent values + service/env constants — no user IDs, no question content in tag values)
- Dashboards / monitors / SLOs as code, not click-ops
This repo is the result. It tells two stories at once:
- AI-era observability. A LangGraph customer-support triage agent over a synthetic feature-flag product corpus ("Bramble," 29 chunks). Every node is wrapped with Datadog LLM Observability decorators (
@workflow/@task/@llm/@retrieval/@tool). The trace tree shows prompt, response, model, token counts, cost in dollars, and self-evaluation scores from the confidence-check node, all in one navigable view. - The boring-but-essential rest of the stack. APM auto-instrumentation across FastAPI, Celery, Postgres, Redis, and the Anthropic SDK. Structured JSON logs with
dd.trace_idcorrelation injected byddtrace. Five custom DogStatsD metrics (ddfs.agent.requests,ddfs.agent.cost_usd,ddfs.agent.confidence_score,ddfs.agent.escalations, and addfs.agent.node.duration_msdistribution with per-node p95 percentiles). Infrastructure metrics via Datadog Agent container labels.
git clone https://github.com/jonathan-major/DDFS
cd DDFS
cp .env.example .env # add DD_API_KEY, DD_APP_KEY, ANTHROPIC_API_KEY
make up # docker compose up -d (full stack + Datadog agent)
make seed-docs # embed the 29-chunk Bramble corpus into pgvector
make demo # fire 20 synthetic questions through the agent
make dd-apply # push dashboards/monitors/SLO to your org via APIOpen the DDFS Day One — Agent Overview dashboard in your Datadog org and you should see:
- Requests / min — live traffic from the demo run
- Median cost / conversation — around
$0.006-0.008with Sonnet+Haiku - Escalation rate — fraction of conversations the LLM-as-judge sent to a human
- Cost per conversation over time — a timeseries you can correlate with
make demoruns - p95 latency by LangGraph node — five curves; the
draft_answerspan (the graph's Sonnet draft node) runs ~3-4× slower than the Haiku spans (classify_intent,score_confidence). That one chart tells the "right model for the job" story without a paragraph. - Confidence score distribution — average and p10 of the self-evaluation score; the p10 line drops sharply on out-of-corpus questions
- Intents handled (top list) —
question,bug_report,feature_request,unknown, tagged by the classifier
Then drop into APM → ddfs-day-one-api → recent trace. One flame graph shows the entire conversation: FastAPI request → LangGraph workflow → five task spans → nested llm spans for the Haiku and Sonnet calls → retrieval span for pgvector → tool spans for the Redis enqueue and Postgres write. Click any log line; one click later you're at the trace that emitted it.
The first interesting surprise comes unprompted: Watchdog Insights flags draft_answer as a 6.3× p95 latency outlier (~5s vs 800ms baseline) without anyone configuring an anomaly rule. That's the kind of thing you don't get for free from a roll-your-own stack.
[ static HTML chat UI ]
│
▼
[ FastAPI service ]
│
┌───────────────┼────────────────┐
▼ ▼ ▼
[ LangGraph agent ] [ Postgres + [ Redis +
classify_intent pgvector ] Celery escalation queue ]
retrieve_docs │
draft ▼
score_confidence [ Celery worker:
dispatch index_doc,
│ drain_escalations ]
▼
[ Anthropic Claude ]
Haiku → classify_intent, score_confidence
Sonnet → draft (the user-facing generation)
Every Python service runs under ddtrace-run, so FastAPI / psycopg / redis / Celery / Anthropic SDK are auto-instrumented without any code changes. The LangGraph nodes are wrapped with the ddtrace.llmobs decorators so spans land in LLM Observability with the right semantic shape. Both Sonnet and Haiku are used deliberately — Haiku for cheap classification and self-evaluation, Sonnet for the user-facing draft — and the per-node latency widget makes the tradeoff visible.
DDFS/
├── README.md
├── apps/
│ ├── api/ FastAPI + LangGraph + Celery
│ │ ├── main.py ASGI entry
│ │ ├── agent/
│ │ │ ├── graph.py StateGraph definition
│ │ │ ├── nodes.py 5 nodes: classify / retrieve / draft / score / dispatch
│ │ │ ├── tools.py pgvector retriever, Redis escalation, Postgres record
│ │ │ ├── state.py TypedDict state
│ │ │ └── instrumentation.py DD LLM Obs + DogStatsD wrapper
│ │ ├── routes/agent.py POST /agent/ask
│ │ ├── db/pgvector_setup.py Schema + 29-chunk Bramble corpus
│ │ ├── tasks/index_docs.py Celery worker
│ │ ├── requirements.txt
│ │ └── Dockerfile
│ └── web/ static HTML chat UI served by nginx
├── monitoring/
│ ├── dashboards/agent-overview.json 7-tile dashboard
│ ├── monitors/ escalation-rate + LLM-cost-anomaly
│ └── slos/agent-availability.json 99.5% success-rate SLO
├── scripts/
│ └── apply_monitoring.py pushes monitoring/*.json to DD org via REST API
├── demo/questions.txt 8 synthetic customer questions
├── docs/
│ ├── DAY-ONE.md minute-by-minute narrative
│ └── LIVE-DATADOG-TOUR.md guided tour of the live Datadog views
├── docker-compose.yml
├── Makefile
└── .env.example
Every LangGraph node uses a @task decorator from agent/instrumentation.py. The decorator does two things in one wrap: it opens a ddtrace.llmobs task span (so the node shows up as a child of the top-level @workflow in the LLM Observability trace tree), and it times the function body so per-node p95 latency lands as a DogStatsD distribution metric tagged by node name.
@task(name="classify_intent")
def classify_intent(state):
...
metric_increment("ddfs.agent.requests", tags=[f"intent:{intent}"])
return {...}The same pattern: @llm on every Claude call, @retrieval on the pgvector query, @tool on Redis and Postgres writes. The agent code stays readable — none of the instrumentation requires touching the LangGraph state shape or the model invocation logic.
monitoring/*.json is the source of truth for the dashboards, monitors, and SLO. scripts/apply_monitoring.py reads each file, looks up the resource by name in your Datadog org, and either POSTs a new one or PUTs an update. The script is idempotent — re-running it never duplicates resources — and small enough to read in one sitting.
make dd-apply # pushes monitoring/*.json via the REST APIFor a team that wants state tracking and an audit trail, this same JSON drops cleanly into a Terraform datadog_dashboard_json / datadog_monitor / datadog_service_level_objective resource — the script is the lightweight starting point, not the ceiling.
- Not a production app. The agent is small and the corpus is fake. The point is the instrumentation pattern, not the agent logic.
- Not a Datadog tutorial. The Datadog docs already do that. This repo is a working stack you can extend.
Datadog's Free tier covers up to 5 infrastructure hosts with core collection and visualization only — APM, Logs, and LLM Observability are paid features. The 14-day Pro trial is enough to run this stack end-to-end. The dashboards, monitors, and SLO defined as JSON here persist regardless of the trial state.
Datadog does not publish LLM Observability pricing on its public pricing page (the SKU is listed under "AI Observability" but without rates) — check directly with Datadog before pointing this at a production workload.