Skip to content

DLQ retry reports false success and silently drops failed events (no real queue-failure tracking) #208

Description

@scotwells

Summary

The activity-processor's dead-letter retry path reports success without actually re-processing events, so policy-evaluation failures are silently lost instead of tracked. Activities for these events are never generated, but every metric and log says the retry "succeeded." We need to (1) make queue-failure tracking truthful and actionable, and (2) fix the underlying policy failures it's currently hiding.

Impact (product)

Audit events created via generateName (e.g. Connector, Project, Domain creates) fail policy summary evaluation on ingest and never produce an activity. The user-visible activity timeline is silently missing these create events. This matches the "activities silently not being generated" impact already noted in the DLQ runbook — it is actually happening in production today, not hypothetically.

What's actually happening

  1. Ingest evaluates the CEL summary template. For generateName creates, objectRef.name is absent (the assigned name lives in responseObject.metadata.name), so an unguarded link(audit.objectRef.name, audit.objectRef) errors with no such key: name. The event is published to the ACTIVITY_DEAD_LETTER queue.
  2. The periodic DLQ retry worker republishes the raw bytes back onto the audit stream and reports success as soon as the NATS publish ACKs — it never re-runs CEL. So Successfully retried DLQ event, totalFailed: 0, and the dlq_retry_attempts_total{result="succeeded"} metric only mean "the republish was accepted," not "evaluation passed."
  3. The republished event re-enters the same ingest path, fails identically, and is re-queued as a fresh dead-letter entry (retryCount resets to 0 each time). After a few bounces the queue's dedup window suppresses further entries and the event is dropped.

Evidence (prod, last 7 days)

  • 450 evaluation failures across only 125 distinct auditIDs — every event re-failed 2–8 times, and none ever succeeded.
  • ACTIVITY_DEAD_LETTER stream depth reads 0 not because events were resolved, but because they were acked + dedup-dropped.
  • Two failing templates observed: *-connector / *-project rule 0 (objectRef.name) and *-domain rule 4 (requestObject.spec.domainName).

Asks

Processor side (the core of this issue): make queue-failure tracking honest.

  • The retry path should distinguish "republished to NATS" from "successfully processed." A re-queued event that fails evaluation again must count as a failure, not a success.
  • Track and surface persistent / repeatedly-failing dead-letter events (e.g. a real failure counter, a terminal "give up" state, and a queue-depth/age signal) so alerting can fire on genuine loss instead of self-clearing on arrival volume.
  • Consider a poison-pill path so an event that has failed N times is parked/visible rather than silently dropped via dedup.

Policy side (stops the current bleeding):

  • Guard nested audit-field access with has() and prefer the populated name, e.g. has(audit.responseObject.metadata.name) ? link(audit.responseObject.metadata.name, audit.objectRef) : (has(audit.objectRef.name) ? link(audit.objectRef.name, audit.objectRef) : ''). Same guarding for the domain requestObject.spec.domainName case.
  • Optional defense-in-depth: backfill objectRef.name from responseObject.metadata.name when building the CEL audit vars, so generateName creates always carry a usable name.

The shipped example policies and docs/runbooks/dlq/policy-dlq-errors.md already use the has() guard pattern; the production policies do not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions