chore(logging): reduce noise, route only audit to OTel by alexluong · Pull Request #913 · hookdeck/outpost

alexluong · 2026-05-26T15:19:23Z

Tighten Outpost logging. Three changes, each independently revertable.

Reduce noise. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and log retention demoted info→debug. API request logs only emit at 400+ (4xx info, 5xx error, 2xx/3xx debug). Webhook 5xx/timeouts no longer log as Error — they're expected operational outcomes already captured by the audit and ClickHouse log.

Adopt wide events for audit. Replace the multi-line audit pattern (processing event, delivery task enqueued, retry scheduled, scheduled retry canceled, delivery attempt completed) with one rich event per unit of work: event.received and delivery.attempted. Each carries the full outcome (status, timings, retry decision, IDs) so consumers don't have to join across lines or worry about ordering. References:

https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/
https://stripe.com/blog/canonical-log-lines\
Question: should we add an audit log for every API request?

Split sinks. Only Audit() lines flow to the OTel logs SDK; regular Info/Debug/Warn/Error stay on stdout. Operator-facing diagnostics and debug noise no longer leak to the customer-visible OTel sink. Trace correlation in stdout logs is preserved manually via trace_id/span_id fields.

Example logs — event delivery scenarios

Happy path — event published, matched 1 destination, delivered successfully

INFO  "event.received"      { event_id, tenant_id, topic, matched_destination_ids: [d1], enqueued_destination_ids: [d1], matched_destination_count: 1, enqueued_destination_count: 1, duplicate: false, event_received_at, duration_ms: 7 }
DEBUG "processing delivery task" { event_id, tenant_id, destination_id: d1, attempt: 1 }
INFO  "delivery.attempted"  { attempt_id, event_id, tenant_id, topic, destination_id: d1, destination_type: "webhook", attempt_status: "success", attempt_code: "200", attempt_number: 1, attempt_max: 11, manual: false, eligible_for_retry: true, attempt_started_at, attempt_duration_ms: 142, retry_id, retry_scheduled: false, retry_canceled: false }
DEBUG "logmq handler"       { message_id }
DEBUG "added to batch"      { message_id, event_id, attempt_id, tenant_id }
DEBUG "processing batch"    { message_count: 1 }
INFO  "batch persisted"     { count: 1, insert_duration_ms: 8 }

Webhook returns 5xx, retry eligible

INFO  "event.received"      { ... }
DEBUG "processing delivery task" { ..., attempt: 1 }
INFO  "delivery.attempted"  { ..., attempt_status: "failed", attempt_code: "500", attempt_duration_ms: 1247, retry_scheduled: true, retry_backoff_ms: 30000, retry_canceled: false }
# logmq sequence as above

Webhook 5xx, retry budget exhausted

DEBUG "processing delivery task" { ..., attempt: 11 }
INFO  "delivery.attempted"  { ..., attempt_status: "failed", attempt_number: 11, attempt_max: 11, retry_scheduled: false, retry_canceled: false }

Manual retry, success

DEBUG "processing delivery task" { ..., manual: true }
INFO  "delivery.attempted"  { ..., attempt_status: "success", manual: true, retry_canceled: true }

Destination deleted / not found / disabled

DEBUG "processing delivery task" { ... }
DEBUG "destination deleted"      { event_id, tenant_id, destination_id }
# message ack'd, no delivery.attempted, no logmq entry

Idempotency hit (duplicate delivery task)

DEBUG "processing delivery task" { ... }
DEBUG "delivery task skipped (idempotent)" { ..., idempotency_key }
# no delivery.attempted, no logmq entry

No matching destinations

INFO  "event.received"      { ..., matched_destination_count: 0, enqueued_destination_count: 0 }
# nothing enqueued, no downstream activity

System failures (operator-actionable)

# publishmq, MatchEvent fails (DB/Redis down)
ERROR "failed to match event destinations" { error, event_id, tenant_id }
INFO  "event.received"      { ..., match_failed: true }

# publishmq, delivery MQ broken
ERROR "failed to enqueue delivery task" { error, event_id, tenant_id, destination_id }
INFO  "event.received"      { ..., enqueue_failed: true, enqueued_destination_count: 0 }

# deliverymq, scheduler unreachable
ERROR "failed to schedule retry"       { ... }
INFO  "delivery.attempted"  { ..., retry_schedule_failed: true }
ERROR "consumer handler error"         { name, error }

# deliverymq, logmq broken
ERROR "failed to publish attempt log"  { ... }
ERROR "consumer handler error"         { ... }

# logmq, ClickHouse insert fails
ERROR "failed to insert log entries"   { error, entry_count, insert_duration_ms }

Reserve info for process lifecycle and meaningful aggregate outcomes. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and logretention move to debug. Keep "batch persisted" at info.

Successful 2xx/3xx traffic is expected and high-volume on the hot path; emitting an info line per request scales with RPS for no operational signal that metrics/traces don't already provide. 4xx stays at info (client errors worth surfacing), 5xx stays at error. Note: outpost does not currently persist request metrics, so 4xx visibility in production lives only in this log line. If metrics are added later, the 4xx case can be dropped to debug too.

Webhook 5xx, timeouts, and connection refusals from destination publish calls are expected operational outcomes — the retry scheduler handles them, the audit log captures the outcome, and the per-attempt record is written to ClickHouse via logmq. Logging them as Error (and again as "consumer handler error" via the consumer wrapper) inflates the error stream with no actionable signal. - Drop the "failed to publish event" Error line in doHandle. - In handleError, return nil for AttemptError wrapping ErrDestinationPublishAttempt so the consumer doesn't log it as an unexpected handler error. Ack/nack semantics are unchanged. - Rename the audit message "event delivered" to "delivery attempt completed" since it fires for both success and failure outcomes.

Measure the wall time spent in publisher.PublishEvent and add it to the "delivery attempt completed" audit line so operators and customers can see per-attempt latency without joining against ClickHouse.

Replace the multi-line audit pattern in deliverymq and publishmq with a single wide-event audit per unit of work. Consumers no longer have to join across lines (or worry about ordering) to reconstruct what happened: the full outcome is in one event. deliverymq emits "delivery.attempted" once per attempt with attempt result, timing (attempt_started_at, attempt_duration_ms), and retry decision (retry_scheduled, retry_backoff_ms, retry_canceled, plus retry_schedule_failed / retry_cancel_failed when relevant). Replaces "delivery attempt completed", "retry scheduled", and "scheduled retry canceled". publishmq emits "event.received" once per Handle call with matched and enqueued destination lists, duplicate flag, received_at, and duration_ms. Replaces "processing event" and per-destination "delivery task enqueued". System-failure ERROR lines (failed to schedule retry, failed to cancel scheduled retry, failed to enqueue delivery task, failed to match event destinations) remain — those are operator-actionable diagnostics that the wide event also flags via boolean fields.

Add fields requested for observability use cases: delivery.attempted: - topic, attempt_code, retry_id, attempt_max, eligible_for_retry - rename attempt -> attempt_number to disambiguate from attempt_id event.received: - match_failed flag for MatchEvent failures - rename received_at -> event_received_at for *_at field consistency

Split the logger so regular Info/Debug/Warn/Error calls go through a plain *zap.Logger (stdout only, no OTel export) and Audit() lines go through a separate otelzap.Logger (stdout + OTel logs SDK). This keeps operator diagnostics and debug noise out of the customer-visible OTel sink while preserving the wide-event audit stream for downstream consumers. Trace correlation in stdout logs is preserved manually via a small traceFields helper that pulls trace_id / span_id from the active span into zap fields — otelzap did this automatically; with the OTel sink gated to Audit() we attach them ourselves. The AuditLog config option / env var is removed; audit logging is now always enabled (audit lines were already a structural part of the event lifecycle, and the wide-event refactor made them the customer- facing source of truth).

Per team discussion: the event lifecycle wide events are operator-relevant diagnostics, not customer-facing audit records. Drop them from the OTel audit sink so we don't push per-event / per-attempt records to the customer-visible stream. Stdout output and field shape unchanged.

alexluong force-pushed the chore/logging-demotions branch from 93c7910 to 2c4ca90 Compare May 26, 2026 15:58

alexluong commented May 26, 2026

View reviewed changes

Comment thread internal/apirouter/logger_middleware.go

Comment thread internal/logging/logger.go Outdated

alexluong force-pushed the chore/logging-demotions branch 2 times, most recently from bae7bd9 to 8512d86 Compare May 27, 2026 08:48

alexluong added 2 commits May 27, 2026 15:50

chore(logging): demote most info logs to debug

ada72bf

Reserve info for process lifecycle and meaningful aggregate outcomes. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and logretention move to debug. Keep "batch persisted" at info.

alexluong force-pushed the chore/logging-demotions branch from 8512d86 to ed53b67 Compare May 27, 2026 08:50

alexluong added 5 commits May 27, 2026 19:50

chore(logging): include attempt_duration_ms in delivery attempt audit

2fe20d8

Measure the wall time spent in publisher.PublishEvent and add it to the "delivery attempt completed" audit line so operators and customers can see per-attempt latency without joining against ClickHouse.

alexluong marked this pull request as ready for review May 27, 2026 15:43

alexluong force-pushed the chore/logging-demotions branch from 99ea523 to d9a61ca Compare May 27, 2026 18:46

alexluong changed the title ~~chore(logging): rethink log levels and infra-value exposure~~ chore(logging): reduce noise, adopt wide events, gate OTel to audit May 27, 2026

alexluong changed the title ~~chore(logging): reduce noise, adopt wide events, gate OTel to audit~~ chore(logging): reduce noise, route only audit to OTel May 27, 2026

alexbouchardd approved these changes May 28, 2026

View reviewed changes

alexluong merged commit 05d984b into main May 28, 2026
2 checks passed

alexluong deleted the chore/logging-demotions branch May 28, 2026 16:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(logging): reduce noise, route only audit to OTel#913

chore(logging): reduce noise, route only audit to OTel#913
alexluong merged 8 commits into
mainfrom
chore/logging-demotions

alexluong commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexluong commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexluong commented May 26, 2026 •

edited

Loading