Summary
Cron job execution results that fail to deliver to messaging platforms (Slack/Feishu) are permanently lost with no retry mechanism. This affects transient failures (network errors, rate limits) that would likely succeed on retry.
Current Behavior
delivery.go calls the platform API once. On failure, it logs log.Error("result permanently lost, no retry mechanism") and discards the result. A successfully executed job is indistinguishable from one that never ran.
Proposed Solution
Add an in-memory retry queue to the Delivery struct with:
- Max 3 attempts with exponential backoff (30s → 1m → 2m)
- Retriable errors: 429 (rate limit), timeout, network errors
- Permanent failures: 404 (channel deleted), 403 (token revoked) — logged and discarded immediately
- Bounded queue: max 100 pending deliveries; oldest discarded on overflow
- Metrics:
cron_delivery_retry_total{status="success|exhausted|permanent"}
- Graceful shutdown: remaining items logged as permanently lost
Non-Goals
- Persistent retry queue (in-memory only; gateway restart discards pending)
- Dead letter queue
- Cross-gateway delivery
Spec
Full design: docs/specs/Cron-Delivery-Retry-Spec.md
Related
Summary
Cron job execution results that fail to deliver to messaging platforms (Slack/Feishu) are permanently lost with no retry mechanism. This affects transient failures (network errors, rate limits) that would likely succeed on retry.
Current Behavior
delivery.gocalls the platform API once. On failure, it logslog.Error("result permanently lost, no retry mechanism")and discards the result. A successfully executed job is indistinguishable from one that never ran.Proposed Solution
Add an in-memory retry queue to the
Deliverystruct with:cron_delivery_retry_total{status="success|exhausted|permanent"}Non-Goals
Spec
Full design:
docs/specs/Cron-Delivery-Retry-Spec.mdRelated
log.Errorwith "result permanently lost")