DLQ retry reports false success and silently drops failed events (no real queue-failure tracking)

## Summary

The activity-processor's dead-letter retry path reports success without actually re-processing events, so policy-evaluation failures are silently lost instead of tracked. Activities for these events are never generated, but every metric and log says the retry "succeeded." We need to (1) make queue-failure tracking truthful and actionable, and (2) fix the underlying policy failures it's currently hiding.

## Impact (product)

Audit events created via `generateName` (e.g. Connector, Project, Domain creates) fail policy summary evaluation on ingest and **never produce an activity**. The user-visible activity timeline is silently missing these create events. This matches the "activities silently not being generated" impact already noted in the DLQ runbook — it is actually happening in production today, not hypothetically.

## What's actually happening

1. Ingest evaluates the CEL summary template. For `generateName` creates, `objectRef.name` is absent (the assigned name lives in `responseObject.metadata.name`), so an unguarded `link(audit.objectRef.name, audit.objectRef)` errors with `no such key: name`. The event is published to the `ACTIVITY_DEAD_LETTER` queue.
2. The periodic DLQ retry worker **republishes the raw bytes back onto the audit stream and reports success as soon as the NATS publish ACKs — it never re-runs CEL.** So `Successfully retried DLQ event`, `totalFailed: 0`, and the `dlq_retry_attempts_total{result="succeeded"}` metric only mean "the republish was accepted," not "evaluation passed."
3. The republished event re-enters the same ingest path, fails identically, and is re-queued as a fresh dead-letter entry (retryCount resets to 0 each time). After a few bounces the queue's dedup window suppresses further entries and the event is dropped.

## Evidence (prod, last 7 days)

- 450 evaluation failures across only **125 distinct `auditID`s** — every event re-failed 2–8 times, and **none ever succeeded**.
- `ACTIVITY_DEAD_LETTER` stream depth reads 0 not because events were resolved, but because they were acked + dedup-dropped.
- Two failing templates observed: `*-connector` / `*-project` rule 0 (`objectRef.name`) and `*-domain` rule 4 (`requestObject.spec.domainName`).

## Asks

**Processor side (the core of this issue): make queue-failure tracking honest.**
- The retry path should distinguish "republished to NATS" from "successfully processed." A re-queued event that fails evaluation again must count as a failure, not a success.
- Track and surface persistent / repeatedly-failing dead-letter events (e.g. a real failure counter, a terminal "give up" state, and a queue-depth/age signal) so alerting can fire on genuine loss instead of self-clearing on arrival volume.
- Consider a poison-pill path so an event that has failed N times is parked/visible rather than silently dropped via dedup.

**Policy side (stops the current bleeding):**
- Guard nested audit-field access with `has()` and prefer the populated name, e.g. `has(audit.responseObject.metadata.name) ? link(audit.responseObject.metadata.name, audit.objectRef) : (has(audit.objectRef.name) ? link(audit.objectRef.name, audit.objectRef) : '')`. Same guarding for the domain `requestObject.spec.domainName` case.
- Optional defense-in-depth: backfill `objectRef.name` from `responseObject.metadata.name` when building the CEL audit vars, so `generateName` creates always carry a usable name.

The shipped example policies and `docs/runbooks/dlq/policy-dlq-errors.md` already use the `has()` guard pattern; the production policies do not.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLQ retry reports false success and silently drops failed events (no real queue-failure tracking) #208

Summary

Impact (product)

What's actually happening

Evidence (prod, last 7 days)

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DLQ retry reports false success and silently drops failed events (no real queue-failure tracking) #208

Description

Summary

Impact (product)

What's actually happening

Evidence (prod, last 7 days)

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions