fix: make DLQ retry outcome-aware so failures stop reporting false success#209
Merged
Conversation
The dead-letter retry worker republished raw bytes and counted the
NATS-publish ACK as success without ever re-running CEL. Persistent
policy-evaluation failures looked resolved, the events bounced back
into ingest, failed identically, and were eventually dedup-dropped, so
activities for those audit events were silently never generated.
Key changes:
- Re-run audit policy evaluation in place (via the existing
EvaluateCompiledAuditRules path) before treating a retry as resolved.
On eval failure the event is counted as a real failure and its retry
count is advanced through updateAndRepublishMetadata instead of being
republished into the dedup window; only resolved events are republished.
- Split the misleading attempts metric: re-evaluated successes count as
"succeeded", events with no evaluator count as "republished", and the
new activity_processor_dlq_retry_failed_total{api_group,kind,policy_name,
error_type} counter surfaces genuine persistent loss.
- Fix the activity-processor alerts that referenced the non-existent
activity_processor_audit_events_{received,errored}_total metrics; point
them at activity_processor_events_{received,errored}_total{source="audit_log"}.
- Reuse a shared classifyEvaluationError helper across the ingest and
retry paths.
Deferred (noted in PR): event-keyed KV ledger, terminal poison-pill
parked stream, and the production-side ActivityPolicy has() guards
(those policies live outside this repo).
Claude-Session: https://claude.ai/code/session_01KnYuL5Pf1R5ysZoxxNkKiu
kevwilliams
approved these changes
Jun 25, 2026
mattdjenkinson
approved these changes
Jun 25, 2026
This was referenced Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The dead-letter retry path reported success without ever re-processing events. It republished the raw bytes and counted the NATS-publish ACK as
succeeded— it never re-ran CEL. So persistent policy-evaluation failures (e.g.generateNamecreates for Connector / Project / Domain) looked resolved in every metric and log, while the events bounced back into ingest, failed identically, reset their retry count, and were eventually suppressed by the dedup window. The user-visible activity timeline silently dropped those create events, and alerting self-cleared on arrival volume instead of firing on real loss.What changed
EvaluateCompiledAuditRulespath. Only events that actually pass evaluation get republished to generate an activity.updateAndRepublishMetadata(so it accumulates instead of resetting to 0), rather than being republished into the dedup window where it would be silently dropped.dlq_retry_attempts_total{result="succeeded"}is split: evaluated-and-resolved =succeeded, put-back-without-observing =republished, and a newactivity_processor_dlq_retry_failed_total{api_group,kind,policy_name,error_type}surfaces genuine persistent loss. The existingevents_high_retry_totalgauge now fires correctly because retry counts advance.ActivityGenerationStalledandActivityProcessorHighErrorRatereferencedactivity_processor_audit_events_{received,errored}_total, which the code never registers — the alerts matched nothing. Repointed toactivity_processor_events_{received,errored}_total{source="audit_log"}.classifyEvaluationErrorhelper across ingest and retry so error-type classification stays consistent.Deferred / follow-up
has()guards (the policy-side fix that stops the current bleeding) — those*-connector/*-project/*-domainpolicies are deployed as CRs outside this repo, so they aren't changed here.AlertThreshold(needs new NATS stream infra that can't be safely validated in this change).Testing
go build ./...,go vet ./...clean.go test ./...passes (full suite), includinginternal/activityprocessorandinternal/processor.classifyEvaluationErrortable tests andreEvaluateDeadLetteroutcome cases (unmarshal failure, no-policy-resolves).Refs #208