Skip to content

feat(cron): add delivery retry mechanism for transient failures #577

@hrygo

Description

@hrygo

Summary

Cron job execution results that fail to deliver to messaging platforms (Slack/Feishu) are permanently lost with no retry mechanism. This affects transient failures (network errors, rate limits) that would likely succeed on retry.

Current Behavior

delivery.go calls the platform API once. On failure, it logs log.Error("result permanently lost, no retry mechanism") and discards the result. A successfully executed job is indistinguishable from one that never ran.

Proposed Solution

Add an in-memory retry queue to the Delivery struct with:

  • Max 3 attempts with exponential backoff (30s → 1m → 2m)
  • Retriable errors: 429 (rate limit), timeout, network errors
  • Permanent failures: 404 (channel deleted), 403 (token revoked) — logged and discarded immediately
  • Bounded queue: max 100 pending deliveries; oldest discarded on overflow
  • Metrics: cron_delivery_retry_total{status="success|exhausted|permanent"}
  • Graceful shutdown: remaining items logged as permanently lost

Non-Goals

  • Persistent retry queue (in-memory only; gateway restart discards pending)
  • Dead letter queue
  • Cross-gateway delivery

Spec

Full design: docs/specs/Cron-Delivery-Retry-Spec.md

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2High: affects many users, daily occurrencesarea/cronScope: AI-native cronjob schedulerenhancementFeature: new capabilities or improvementsreliabilityDomain: availability, error handling, recoverability

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions