chore(logging): reduce noise, route only audit to OTel#913
Merged
Conversation
93c7910 to
2c4ca90
Compare
alexluong
commented
May 26, 2026
bae7bd9 to
8512d86
Compare
Reserve info for process lifecycle and meaningful aggregate outcomes. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and logretention move to debug. Keep "batch persisted" at info.
Successful 2xx/3xx traffic is expected and high-volume on the hot path; emitting an info line per request scales with RPS for no operational signal that metrics/traces don't already provide. 4xx stays at info (client errors worth surfacing), 5xx stays at error. Note: outpost does not currently persist request metrics, so 4xx visibility in production lives only in this log line. If metrics are added later, the 4xx case can be dropped to debug too.
8512d86 to
ed53b67
Compare
Webhook 5xx, timeouts, and connection refusals from destination publish calls are expected operational outcomes — the retry scheduler handles them, the audit log captures the outcome, and the per-attempt record is written to ClickHouse via logmq. Logging them as Error (and again as "consumer handler error" via the consumer wrapper) inflates the error stream with no actionable signal. - Drop the "failed to publish event" Error line in doHandle. - In handleError, return nil for AttemptError wrapping ErrDestinationPublishAttempt so the consumer doesn't log it as an unexpected handler error. Ack/nack semantics are unchanged. - Rename the audit message "event delivered" to "delivery attempt completed" since it fires for both success and failure outcomes.
Measure the wall time spent in publisher.PublishEvent and add it to the "delivery attempt completed" audit line so operators and customers can see per-attempt latency without joining against ClickHouse.
Replace the multi-line audit pattern in deliverymq and publishmq with a single wide-event audit per unit of work. Consumers no longer have to join across lines (or worry about ordering) to reconstruct what happened: the full outcome is in one event. deliverymq emits "delivery.attempted" once per attempt with attempt result, timing (attempt_started_at, attempt_duration_ms), and retry decision (retry_scheduled, retry_backoff_ms, retry_canceled, plus retry_schedule_failed / retry_cancel_failed when relevant). Replaces "delivery attempt completed", "retry scheduled", and "scheduled retry canceled". publishmq emits "event.received" once per Handle call with matched and enqueued destination lists, duplicate flag, received_at, and duration_ms. Replaces "processing event" and per-destination "delivery task enqueued". System-failure ERROR lines (failed to schedule retry, failed to cancel scheduled retry, failed to enqueue delivery task, failed to match event destinations) remain — those are operator-actionable diagnostics that the wide event also flags via boolean fields.
Add fields requested for observability use cases: delivery.attempted: - topic, attempt_code, retry_id, attempt_max, eligible_for_retry - rename attempt -> attempt_number to disambiguate from attempt_id event.received: - match_failed flag for MatchEvent failures - rename received_at -> event_received_at for *_at field consistency
Split the logger so regular Info/Debug/Warn/Error calls go through a plain *zap.Logger (stdout only, no OTel export) and Audit() lines go through a separate otelzap.Logger (stdout + OTel logs SDK). This keeps operator diagnostics and debug noise out of the customer-visible OTel sink while preserving the wide-event audit stream for downstream consumers. Trace correlation in stdout logs is preserved manually via a small traceFields helper that pulls trace_id / span_id from the active span into zap fields — otelzap did this automatically; with the OTel sink gated to Audit() we attach them ourselves. The AuditLog config option / env var is removed; audit logging is now always enabled (audit lines were already a structural part of the event lifecycle, and the wide-event refactor made them the customer- facing source of truth).
Per team discussion: the event lifecycle wide events are operator-relevant diagnostics, not customer-facing audit records. Drop them from the OTel audit sink so we don't push per-event / per-attempt records to the customer-visible stream. Stdout output and field shape unchanged.
99ea523 to
d9a61ca
Compare
alexbouchardd
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tighten Outpost logging. Three changes, each independently revertable.
Reduce noise. Per-event/per-request handler lines in deliverymq, publishmq, logmq, and log retention demoted info→debug. API request logs only emit at 400+ (4xx info, 5xx error, 2xx/3xx debug). Webhook 5xx/timeouts no longer log as Error — they're expected operational outcomes already captured by the audit and ClickHouse log.
Adopt wide events for audit. Replace the multi-line audit pattern (
processing event,delivery task enqueued,retry scheduled,scheduled retry canceled,delivery attempt completed) with one rich event per unit of work:event.receivedanddelivery.attempted. Each carries the full outcome (status, timings, retry decision, IDs) so consumers don't have to join across lines or worry about ordering. References:Split sinks. Only
Audit()lines flow to the OTel logs SDK; regular Info/Debug/Warn/Error stay on stdout. Operator-facing diagnostics and debug noise no longer leak to the customer-visible OTel sink. Trace correlation in stdout logs is preserved manually via trace_id/span_id fields.Example logs — event delivery scenarios
Happy path — event published, matched 1 destination, delivered successfully
Webhook returns 5xx, retry eligible
Webhook 5xx, retry budget exhausted
Manual retry, success
Destination deleted / not found / disabled
Idempotency hit (duplicate delivery task)
No matching destinations
System failures (operator-actionable)