Skip to content

chore(logging): reduce noise, route only audit to OTel#913

Merged
alexluong merged 8 commits into
mainfrom
chore/logging-demotions
May 28, 2026
Merged

chore(logging): reduce noise, route only audit to OTel#913
alexluong merged 8 commits into
mainfrom
chore/logging-demotions

Conversation

@alexluong
Copy link
Copy Markdown
Collaborator

@alexluong alexluong commented May 26, 2026

Tighten Outpost logging. Three changes, each independently revertable.

Reduce noise. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and log retention demoted info→debug. API request logs only emit at 400+ (4xx info, 5xx error, 2xx/3xx debug). Webhook 5xx/timeouts no longer log as Error — they're expected operational outcomes already captured by the audit and ClickHouse log.

Adopt wide events for audit. Replace the multi-line audit pattern (processing event, delivery task enqueued, retry scheduled, scheduled retry canceled, delivery attempt completed) with one rich event per unit of work: event.received and delivery.attempted. Each carries the full outcome (status, timings, retry decision, IDs) so consumers don't have to join across lines or worry about ordering. References:

Split sinks. Only Audit() lines flow to the OTel logs SDK; regular Info/Debug/Warn/Error stay on stdout. Operator-facing diagnostics and debug noise no longer leak to the customer-visible OTel sink. Trace correlation in stdout logs is preserved manually via trace_id/span_id fields.

Example logs — event delivery scenarios

Happy path — event published, matched 1 destination, delivered successfully

INFO  "event.received"      { event_id, tenant_id, topic, matched_destination_ids: [d1], enqueued_destination_ids: [d1], matched_destination_count: 1, enqueued_destination_count: 1, duplicate: false, event_received_at, duration_ms: 7 }
DEBUG "processing delivery task" { event_id, tenant_id, destination_id: d1, attempt: 1 }
INFO  "delivery.attempted"  { attempt_id, event_id, tenant_id, topic, destination_id: d1, destination_type: "webhook", attempt_status: "success", attempt_code: "200", attempt_number: 1, attempt_max: 11, manual: false, eligible_for_retry: true, attempt_started_at, attempt_duration_ms: 142, retry_id, retry_scheduled: false, retry_canceled: false }
DEBUG "logmq handler"       { message_id }
DEBUG "added to batch"      { message_id, event_id, attempt_id, tenant_id }
DEBUG "processing batch"    { message_count: 1 }
INFO  "batch persisted"     { count: 1, insert_duration_ms: 8 }

Webhook returns 5xx, retry eligible

INFO  "event.received"      { ... }
DEBUG "processing delivery task" { ..., attempt: 1 }
INFO  "delivery.attempted"  { ..., attempt_status: "failed", attempt_code: "500", attempt_duration_ms: 1247, retry_scheduled: true, retry_backoff_ms: 30000, retry_canceled: false }
# logmq sequence as above

Webhook 5xx, retry budget exhausted

DEBUG "processing delivery task" { ..., attempt: 11 }
INFO  "delivery.attempted"  { ..., attempt_status: "failed", attempt_number: 11, attempt_max: 11, retry_scheduled: false, retry_canceled: false }

Manual retry, success

DEBUG "processing delivery task" { ..., manual: true }
INFO  "delivery.attempted"  { ..., attempt_status: "success", manual: true, retry_canceled: true }

Destination deleted / not found / disabled

DEBUG "processing delivery task" { ... }
DEBUG "destination deleted"      { event_id, tenant_id, destination_id }
# message ack'd, no delivery.attempted, no logmq entry

Idempotency hit (duplicate delivery task)

DEBUG "processing delivery task" { ... }
DEBUG "delivery task skipped (idempotent)" { ..., idempotency_key }
# no delivery.attempted, no logmq entry

No matching destinations

INFO  "event.received"      { ..., matched_destination_count: 0, enqueued_destination_count: 0 }
# nothing enqueued, no downstream activity

System failures (operator-actionable)

# publishmq, MatchEvent fails (DB/Redis down)
ERROR "failed to match event destinations" { error, event_id, tenant_id }
INFO  "event.received"      { ..., match_failed: true }

# publishmq, delivery MQ broken
ERROR "failed to enqueue delivery task" { error, event_id, tenant_id, destination_id }
INFO  "event.received"      { ..., enqueue_failed: true, enqueued_destination_count: 0 }

# deliverymq, scheduler unreachable
ERROR "failed to schedule retry"       { ... }
INFO  "delivery.attempted"  { ..., retry_schedule_failed: true }
ERROR "consumer handler error"         { name, error }

# deliverymq, logmq broken
ERROR "failed to publish attempt log"  { ... }
ERROR "consumer handler error"         { ... }

# logmq, ClickHouse insert fails
ERROR "failed to insert log entries"   { error, entry_count, insert_duration_ms }

@alexluong alexluong force-pushed the chore/logging-demotions branch from 93c7910 to 2c4ca90 Compare May 26, 2026 15:58
Comment thread internal/apirouter/logger_middleware.go
Comment thread internal/logging/logger.go Outdated
@alexluong alexluong force-pushed the chore/logging-demotions branch 2 times, most recently from bae7bd9 to 8512d86 Compare May 27, 2026 08:48
alexluong added 2 commits May 27, 2026 15:50
Reserve info for process lifecycle and meaningful aggregate outcomes.
Per-event/per-request handler lines in deliverymq, publishmq, logmq, and
logretention move to debug. Keep "batch persisted" at info.
Successful 2xx/3xx traffic is expected and high-volume on the hot path;
emitting an info line per request scales with RPS for no operational signal
that metrics/traces don't already provide. 4xx stays at info (client errors
worth surfacing), 5xx stays at error.

Note: outpost does not currently persist request metrics, so 4xx visibility
in production lives only in this log line. If metrics are added later, the
4xx case can be dropped to debug too.
@alexluong alexluong force-pushed the chore/logging-demotions branch from 8512d86 to ed53b67 Compare May 27, 2026 08:50
alexluong added 5 commits May 27, 2026 19:50
Webhook 5xx, timeouts, and connection refusals from destination publish
calls are expected operational outcomes — the retry scheduler handles
them, the audit log captures the outcome, and the per-attempt record is
written to ClickHouse via logmq. Logging them as Error (and again as
"consumer handler error" via the consumer wrapper) inflates the error
stream with no actionable signal.

- Drop the "failed to publish event" Error line in doHandle.
- In handleError, return nil for AttemptError wrapping
  ErrDestinationPublishAttempt so the consumer doesn't log it as an
  unexpected handler error. Ack/nack semantics are unchanged.
- Rename the audit message "event delivered" to "delivery attempt
  completed" since it fires for both success and failure outcomes.
Measure the wall time spent in publisher.PublishEvent and add it to the
"delivery attempt completed" audit line so operators and customers can
see per-attempt latency without joining against ClickHouse.
Replace the multi-line audit pattern in deliverymq and publishmq with a
single wide-event audit per unit of work. Consumers no longer have to
join across lines (or worry about ordering) to reconstruct what
happened: the full outcome is in one event.

deliverymq emits "delivery.attempted" once per attempt with attempt
result, timing (attempt_started_at, attempt_duration_ms), and retry
decision (retry_scheduled, retry_backoff_ms, retry_canceled, plus
retry_schedule_failed / retry_cancel_failed when relevant). Replaces
"delivery attempt completed", "retry scheduled", and "scheduled retry
canceled".

publishmq emits "event.received" once per Handle call with matched and
enqueued destination lists, duplicate flag, received_at, and duration_ms.
Replaces "processing event" and per-destination "delivery task enqueued".

System-failure ERROR lines (failed to schedule retry, failed to cancel
scheduled retry, failed to enqueue delivery task, failed to match event
destinations) remain — those are operator-actionable diagnostics that
the wide event also flags via boolean fields.
Add fields requested for observability use cases:

delivery.attempted:
- topic, attempt_code, retry_id, attempt_max, eligible_for_retry
- rename attempt -> attempt_number to disambiguate from attempt_id

event.received:
- match_failed flag for MatchEvent failures
- rename received_at -> event_received_at for *_at field consistency
Split the logger so regular Info/Debug/Warn/Error calls go through a
plain *zap.Logger (stdout only, no OTel export) and Audit() lines go
through a separate otelzap.Logger (stdout + OTel logs SDK). This keeps
operator diagnostics and debug noise out of the customer-visible OTel
sink while preserving the wide-event audit stream for downstream
consumers.

Trace correlation in stdout logs is preserved manually via a small
traceFields helper that pulls trace_id / span_id from the active span
into zap fields — otelzap did this automatically; with the OTel sink
gated to Audit() we attach them ourselves.

The AuditLog config option / env var is removed; audit logging is now
always enabled (audit lines were already a structural part of the
event lifecycle, and the wide-event refactor made them the customer-
facing source of truth).
@alexluong alexluong marked this pull request as ready for review May 27, 2026 15:43
Per team discussion: the event lifecycle wide events are
operator-relevant diagnostics, not customer-facing audit records.
Drop them from the OTel audit sink so we don't push per-event /
per-attempt records to the customer-visible stream. Stdout output
and field shape unchanged.
@alexluong alexluong force-pushed the chore/logging-demotions branch from 99ea523 to d9a61ca Compare May 27, 2026 18:46
@alexluong alexluong changed the title chore(logging): rethink log levels and infra-value exposure chore(logging): reduce noise, adopt wide events, gate OTel to audit May 27, 2026
@alexluong alexluong changed the title chore(logging): reduce noise, adopt wide events, gate OTel to audit chore(logging): reduce noise, route only audit to OTel May 27, 2026
@alexluong alexluong merged commit 05d984b into main May 28, 2026
2 checks passed
@alexluong alexluong deleted the chore/logging-demotions branch May 28, 2026 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants