Observability

Authoritative rule: agents/rules/observability.md · Logging companion: agents/rules/logging.md

TL;DR

NoETL is a distributed runtime: gateway → server → NATS → worker → tools → server (event log). When something misbehaves in production the only way to find it is the trace. This rule constrains what gets traced / measured and how IDs flow through the system.

Four principles

Principle 1 — Observability is part of the implementation

Every new feature ships with three artefacts in the same change set:

A span covering the operation (tracing::info_span!("command.claim", ...)).
At least one metric capturing throughput / latency / failure rate (noetl_*_total, noetl_*_duration_seconds, noetl_*_lag).
execution_id correlation as a span attribute / metric exemplar.

"We'll add observability later" = production debugging becomes guesswork.

Principle 2 — Metrics over logs

Floods of INFO logs are not observability. Counters / histograms / gauges scale; logs don't.

Want to know	Use	Not
Throughput	counter `noetl_*_total`	grep INFO
Slow tail	histogram `noetl_*_duration_seconds`	grep slow lines
Backlog	gauge `noetl_*_lag`	watch debug logs
Per-execution failure	structured event + WARN/ERROR with `execution_id`	spam INFO

Principle 3 — Snowflake IDs come from the application

Identifiers that need to exist before the row hits the database — execution_id, command_id, event_id — are generated by the application using a snowflake algorithm. The database's gen_snowflake() stays as a fallback for out-of-band writes.

Reasons:

Spans need the id at span-open time, not after the round-trip.
Retries are idempotent only if the id is stable.
Cross-component coordination needs the id before publish.
Test fixtures need deterministic ids.
Sharded / multi-cluster deployments can't share a DB sequence.

Principle 4 — `execution_id` is the trace key everywhere

HTTP requests: execution_id in path or query param, not just in body.
NATS messages: execution_id as a header / message attribute.
Tracing spans: execution_id as a span field.
Metrics: execution_id is NOT a label (cardinality bomb); it IS a span attribute for exemplar correlation.
Logs: every WARN/ERROR carries execution_id as a structured field.

Boundary: execution_id rides every wire format. Recipients missing it should WARN, generate a synthetic id, and continue.

Per-component metric inventory

Component	Endpoint	Metrics
Rust worker	`/metrics` on each pod	`noetl_worker_commands_`, `noetl_worker_tool_dispatch_`, `noetl_worker_nats_consumer__pending`, `noetl_worker_result_store_put_`, `noetl_worker_concurrent_dispatches`
Python server	`/metrics`	`noetl_server_*` (HTTP latency per route, event-log INSERT rate, command publish rate)
Projector	`/metrics`	`noetl_projector_inserted_total`, `noetl_projector_batch_size`, `noetl_projector_lag_seconds`
NATS supercluster	`:8222/jsz`	Stream + consumer lag (KEDA scaler reads these)

Secrets Wallet (#61) — SECURITY (design)
Rust Server Port (#49) — PRIMARY
Decoupled Context + Event Chain (#115) — RFC (design), reframes #101
Orchestrator Scaling (#101) — reframed by #115; consume side = #115 Phase 1
Event WAL + Derivable Storage (#104) — Round 01 (locator) PR open
WASM Plug-in Compilation (#105) — system-pool plug-in hot-reload (ADR Phase 4)
System Pool Design (#46) — PRIMARY
Regression Baseline Migration (#98) — e2e
Subscription / Listener Tool (#90) — RFC
Container Tool Callback (#43)
Rust Worker Parity Gaps (#47 · #48)
Event Envelope Reconciliation (#51 in TaskList)

Closed Umbrellas

Cursor Loop Mode (#100) — server v3.8.0 + tools v3.10.1, 2026-06-15
Transfer Tool Credentials (#99) — tools v3.10.0 + worker v5.22.0, 2026-06-14
Explicit Input Binding (#77) — v3.0.0 shipped 2026-06-09
Rust Worker Migration (#30)
Python Services → Rust (#45)

Conventions

Per-repo wikis

noetl/noetl wiki — app + DSL
noetl/server wiki — Rust control plane
noetl/worker wiki — Rust pull worker
noetl/tools wiki — tool registry crate
noetl/cli wiki — CLI + local mode
noetl/gateway wiki — gatekeeper
noetl/ops wiki — Helm + manifests
noetl/travel wiki — domain SPA reference
Docs site — engineer-facing architecture

Observability

Observability

TL;DR

Four principles

Principle 1 — Observability is part of the implementation

Principle 2 — Metrics over logs

Principle 3 — Snowflake IDs come from the application

Principle 4 — execution_id is the trace key everywhere

Per-component metric inventory

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally

Principle 4 — `execution_id` is the trace key everywhere