Skip to content

fix: make DLQ retry outcome-aware so failures stop reporting false success#209

Merged
scotwells merged 1 commit into
mainfrom
fix/dlq-retry-failure-tracking
Jun 25, 2026
Merged

fix: make DLQ retry outcome-aware so failures stop reporting false success#209
scotwells merged 1 commit into
mainfrom
fix/dlq-retry-failure-tracking

Conversation

@scotwells

@scotwells scotwells commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Problem

The dead-letter retry path reported success without ever re-processing events. It republished the raw bytes and counted the NATS-publish ACK as succeeded — it never re-ran CEL. So persistent policy-evaluation failures (e.g. generateName creates for Connector / Project / Domain) looked resolved in every metric and log, while the events bounced back into ingest, failed identically, reset their retry count, and were eventually suppressed by the dedup window. The user-visible activity timeline silently dropped those create events, and alerting self-cleared on arrival volume instead of firing on real loss.

What changed

  • Retry worker is now outcome-aware. Before treating a retry as resolved, it re-runs audit policy evaluation in place using the existing EvaluateCompiledAuditRules path. Only events that actually pass evaluation get republished to generate an activity.
  • Re-failures are tracked, not hidden. An event that fails re-evaluation is counted as a genuine failure and its retry count is advanced via updateAndRepublishMetadata (so it accumulates instead of resetting to 0), rather than being republished into the dedup window where it would be silently dropped.
  • Honest metrics. The misleading dlq_retry_attempts_total{result="succeeded"} is split: evaluated-and-resolved = succeeded, put-back-without-observing = republished, and a new activity_processor_dlq_retry_failed_total{api_group,kind,policy_name,error_type} surfaces genuine persistent loss. The existing events_high_retry_total gauge now fires correctly because retry counts advance.
  • Alert metric-name fix. ActivityGenerationStalled and ActivityProcessorHighErrorRate referenced activity_processor_audit_events_{received,errored}_total, which the code never registers — the alerts matched nothing. Repointed to activity_processor_events_{received,errored}_total{source="audit_log"}.
  • Shared classifyEvaluationError helper across ingest and retry so error-type classification stays consistent.

Deferred / follow-up

  • Production ActivityPolicy has() guards (the policy-side fix that stops the current bleeding) — those *-connector / *-project / *-domain policies are deployed as CRs outside this repo, so they aren't changed here.
  • Terminal poison-pill / durable parked stream for events over AlertThreshold (needs new NATS stream infra that can't be safely validated in this change).
  • Event-keyed KV ledger and full retry-latency histogram suite.

Testing

  • go build ./..., go vet ./... clean.
  • go test ./... passes (full suite), including internal/activityprocessor and internal/processor.
  • Added unit tests for the new accounting: classifyEvaluationError table tests and reEvaluateDeadLetter outcome cases (unmarshal failure, no-policy-resolves).

Refs #208

The dead-letter retry worker republished raw bytes and counted the
NATS-publish ACK as success without ever re-running CEL. Persistent
policy-evaluation failures looked resolved, the events bounced back
into ingest, failed identically, and were eventually dedup-dropped, so
activities for those audit events were silently never generated.

Key changes:
- Re-run audit policy evaluation in place (via the existing
  EvaluateCompiledAuditRules path) before treating a retry as resolved.
  On eval failure the event is counted as a real failure and its retry
  count is advanced through updateAndRepublishMetadata instead of being
  republished into the dedup window; only resolved events are republished.
- Split the misleading attempts metric: re-evaluated successes count as
  "succeeded", events with no evaluator count as "republished", and the
  new activity_processor_dlq_retry_failed_total{api_group,kind,policy_name,
  error_type} counter surfaces genuine persistent loss.
- Fix the activity-processor alerts that referenced the non-existent
  activity_processor_audit_events_{received,errored}_total metrics; point
  them at activity_processor_events_{received,errored}_total{source="audit_log"}.
- Reuse a shared classifyEvaluationError helper across the ingest and
  retry paths.

Deferred (noted in PR): event-keyed KV ledger, terminal poison-pill
parked stream, and the production-side ActivityPolicy has() guards
(those policies live outside this repo).

Claude-Session: https://claude.ai/code/session_01KnYuL5Pf1R5ysZoxxNkKiu
@scotwells scotwells marked this pull request as ready for review June 25, 2026 14:51
@scotwells scotwells merged commit 4c427b5 into main Jun 25, 2026
6 checks passed
@scotwells scotwells deleted the fix/dlq-retry-failure-tracking branch June 25, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants