Skip to content

Observability

Kadyapam edited this page Jun 2, 2026 · 1 revision

Observability

Authoritative rule: agents/rules/observability.md · Logging companion: agents/rules/logging.md

TL;DR

NoETL is a distributed runtime: gateway → server → NATS → worker → tools → server (event log). When something misbehaves in production the only way to find it is the trace. This rule constrains what gets traced / measured and how IDs flow through the system.

Four principles

Principle 1 — Observability is part of the implementation

Every new feature ships with three artefacts in the same change set:

  1. A span covering the operation (tracing::info_span!("command.claim", ...)).
  2. At least one metric capturing throughput / latency / failure rate (noetl_*_total, noetl_*_duration_seconds, noetl_*_lag).
  3. execution_id correlation as a span attribute / metric exemplar.

"We'll add observability later" = production debugging becomes guesswork.

Principle 2 — Metrics over logs

Floods of INFO logs are not observability. Counters / histograms / gauges scale; logs don't.

Want to know Use Not
Throughput counter noetl_*_total grep INFO
Slow tail histogram noetl_*_duration_seconds grep slow lines
Backlog gauge noetl_*_lag watch debug logs
Per-execution failure structured event + WARN/ERROR with execution_id spam INFO

Principle 3 — Snowflake IDs come from the application

Identifiers that need to exist before the row hits the database — execution_id, command_id, event_id — are generated by the application using a snowflake algorithm. The database's gen_snowflake() stays as a fallback for out-of-band writes.

Reasons:

  • Spans need the id at span-open time, not after the round-trip.
  • Retries are idempotent only if the id is stable.
  • Cross-component coordination needs the id before publish.
  • Test fixtures need deterministic ids.
  • Sharded / multi-cluster deployments can't share a DB sequence.

Principle 4 — execution_id is the trace key everywhere

  • HTTP requests: execution_id in path or query param, not just in body.
  • NATS messages: execution_id as a header / message attribute.
  • Tracing spans: execution_id as a span field.
  • Metrics: execution_id is NOT a label (cardinality bomb); it IS a span attribute for exemplar correlation.
  • Logs: every WARN/ERROR carries execution_id as a structured field.

Boundary: execution_id rides every wire format. Recipients missing it should WARN, generate a synthetic id, and continue.

Per-component metric inventory

Component Endpoint Metrics
Rust worker /metrics on each pod noetl_worker_commands_*, noetl_worker_tool_dispatch_*, noetl_worker_nats_consumer_*_pending, noetl_worker_result_store_put_*, noetl_worker_concurrent_dispatches
Python server /metrics noetl_server_* (HTTP latency per route, event-log INSERT rate, command publish rate)
Projector /metrics noetl_projector_inserted_total, noetl_projector_batch_size, noetl_projector_lag_seconds
NATS supercluster :8222/jsz Stream + consumer lag (KEDA scaler reads these)

Related

NoETL Dashboard

Active Umbrellas

Closed Umbrellas

Conventions

Per-repo wikis

Clone this wiki locally