fix(cron): auto-disable cron jobs whose target channel is unreachable by LIU9293 · Pull Request #211 · odefun/ode

LIU9293 · 2026-05-31T19:07:07Z

What

When a cron job's send hits a permanent channel-access error (channel_not_found, not_in_channel, is_archived, token_revoked, Discord Missing Access/Unknown Channel, Lark chat_not_found, …), the scheduler now disables the job and records the reason in last_error. The cron row stops firing until a human re-enables it.

Why

Sentry ODE-DEAMON-7 — "IM slack send failed: channel_not_found" — captured 5 events over ~5 weeks, all from a single cron job whose bot had been removed from channel C0ATGCJ0YK0. Without intervention the job would keep firing every cron tick forever, each failure capturing a fresh Sentry event with the same fingerprint. This converts that infinite drip into a one-shot auto-disable + log.

This is a behaviour fix, not a Sentry-noise suppression. The Sentry event already correctly indicates "the bot may have been removed" (see existing isBenignDeliveryFailure test); the right product response is to stop firing the job, which is what this PR does.

Design notes

Cross-platform detection lives in packages/shared/delivery/permanent-error.ts alongside the existing isRateLimitError. It is intentionally narrow: transient failures (5xx, ECONNRESET, 429, socket hangups, Slack message_not_found) explicitly stay retryable.
Two call sites in runCronJob:
1. Success-path send: agent turn completed, but final delivery failed → if permanent, disable.
2. Failure-notification send: agent turn already failed → if the notify error is permanent, disable. Critically the check is on the notify error, not the original turn error, so a transient agent timeout doesn't disable a job whose channel is fine.
Disable bookkeeping (patchCronJob({ enabled: false }) + markCronJobFailed + agent_result failAgentResult) is wrapped in best-effort try/catch so a SQLite hiccup never shadows the original delivery error.

Tests

9 new unit tests in permanent-error.test.ts cover Slack message strings, Slack SDK data.error shape, Discord numeric codes (50001, 10003), Lark messages, and explicit negatives (rate limits, ECONNRESET, 5xx, message_not_found).
Full suite still green: bun test → 414 pass, 1 skip, 0 fail.

Out of scope

Auto re-enable when access is restored. The right re-enable signal is a human /invite event, which is a separate observability problem; for now ode cron enable <id> is the recovery path.
Auto-disabling tasks (ode task) — tasks are single-shot so the noise pattern is bounded; cron is the recurring case that needs this guard.

When a cron job's send hits a permanent channel-access error (channel_not_found, not_in_channel, is_archived, token_revoked, etc.), the scheduler now disables the job and records the reason. Previously the same broken job would re-fire on every cron tick and capture an identical Sentry event for the lifetime of the daemon — see Sentry ODE-DEAMON-7 (5 events over ~5 weeks from one stuck job). - Add @/shared/delivery/permanent-error with cross-platform detection (Slack error strings, Discord numeric codes, Lark chat_not_found) and intentionally narrow classification: transient failures stay retryable. - Wire it into both the success-path send and the failure-notification send in runCronJob. The check on the notify path uses the *notify* error, not the original turn error, so a transient agent timeout doesn't disable a job when the channel is fine. - Add unit tests covering Slack/Discord/Lark error shapes and ensuring rate-limits / 5xx / ECONNRESET / message_not_found stay retryable.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 566300571b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T19:09:17Z

+    }
+  }
+  try {
+    patchCronJob(job.id, { enabled: false });


Disable stale-channel cron rows without revalidating config

When a cron row points at a channel that has since been removed from the Ode config, this patchCronJob call re-enters updateCronJob(), which calls getChannelSnapshot() and throws before enabled is flipped. In that stale-config/unreachable-channel case the new auto-disable path only logs the failure, leaves the row enabled, and the scheduler continues firing and producing the same permanent channel_not_found/missing-access error on every tick; the same pattern is duplicated in the failure-notification path below.

Useful? React with 👍 / 👎.

Good catch — fixed in 46fe172. Added disableCronJob(id) direct-SQL helper that flips enabled = 0 without re-validating channel config via getChannelSnapshot, so the stale-channel case (which is exactly the scenario this PR targets) now actually disables the row instead of silently swallowing the error. Both call sites switched. New test asserts patchCronJob throws while disableCronJob succeeds when the channel was removed from ode.json.

Addresses Codex P2 review on PR #211. The previous patch routed auto-disable through patchCronJob -> updateCronJob -> getChannelSnapshot, which throws "Channel not found in configured workspaces" when the cron row's destination channel has since been removed from ode.json. In that exact case (stale local config, bot kicked, workspace re-onboarded) the auto-disable was silently swallowed and the row stayed enabled — the scheduler would keep firing and capturing the same channel_not_found Sentry event every tick, defeating the point of the PR. Add a direct-SQL disableCronJob(id) helper that flips enabled=0 without re-validating channel config, and use it from both call sites (success-path send and failure-notification send). Tests: new disableCronJob suite covers - happy path (enabled true -> false, returns true) - idempotent re-disable (returns false) - missing row (returns false) - stale-channel case where patchCronJob throws but disableCronJob still succeeds (the regression Codex flagged) bun test: 418 pass, 1 skip, 0 fail.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46fe1725ca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T19:10:47Z

+    const code = (err as { code?: unknown }).code;
+    if (typeof code === "number" && PERMANENT_DISCORD_CODES.has(code)) {
+      return true;


Handle wrapped Discord channel failures

For Discord cron sends, this numeric-code check usually never sees the DiscordAPIError: sendChannelMessage calls resolveTextChannel, which catches channels.fetch errors and then rethrows a generic Error("Discord channel ... is not text-based or inaccessible") (packages/ims/discord/client.ts:143-157). In the deleted/private-channel cases this change is meant to auto-disable, the error arriving here has no code, so the job stays enabled and keeps retrying instead of being disabled.

Useful? React with 👍 / 👎.

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): auto-disable cron jobs whose target channel is unreachable#211

fix(cron): auto-disable cron jobs whose target channel is unreachable#211
LIU9293 wants to merge 2 commits into
mainfrom
fix/cron-auto-disable-channel-not-found-04526a90

LIU9293 commented May 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

LIU9293 Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LIU9293 commented May 31, 2026

What

Why

Design notes

Tests

Out of scope

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

LIU9293 Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant