Improving reliability of scheduled GitHub Actions runs (cron is best-effort) #656

don-petry · 2026-06-13T22:34:38Z

don-petry
Jun 13, 2026
Maintainer

Problem
Our scheduled (cron) workflows are not firing reliably. GitHub's schedule event is explicitly best-effort, not a guarantee, and we're now hitting that limitation in production.
Observed in petry-projects/.github (2026-06-13):
After merging #445 (compliance-retrigger 0 5 * * * → hourly 0 * * * *), the hourly cron produced zero scheduled runs in the ~2 hours we watched. We had to fire it manually via workflow_dispatch.
Historically the "05:00 UTC" daily run actually executed at 08:19 / 08:55 / 09:10 UTC on consecutive days — 3–4 hours late.
This matters because the compliance re-trigger sweep is what drains stale compliance-audit findings across the fleet; if it silently doesn't run, backlogs sit untouched and "successful but never ran" is indistinguishable from healthy.
Why this happens (per GitHub docs)
"The schedule event can be delayed during periods of high loads of GitHub Actions workflow runs. High load times include the start of every hour. If the load is sufficiently high enough, some queued jobs may be dropped."
"To decrease the chance of delay, schedule your workflow to run at a different time of the hour."
Other relevant caveats:
Shortest interval is every 5 minutes.
Scheduled workflows run only on the default branch's latest commit (we satisfy this).
In public repos, scheduled workflows are auto-disabled after 60 days of no repo activity.
The community reports routine 29–60 minute delays, and 0 * * * * (top of the hour) is the single most congested slot.
Candidate solutions — ranked by cost & complexity

Move crons off the top of the hour ⭐ (lowest cost, do first)
One-line change: 0 * * * * → e.g. 17 * * * * / 23 * * * *. Directly targets the test issue from agent #1 documented delay cause. Zero infra, no secrets, applies to every scheduled workflow we own. Recommended as an immediate, standalone win.
Lean into idempotency + slightly higher frequency
compliance-retrigger.sh is already idempotent and throttled (≤1 engagement per repo per run, skips active repos). That means extra ticks are harmless and missed ticks self-heal on the next tick. Running every ~15–20 min at an odd offset (e.g. 8,28,48 * * * *) makes any single dropped tick a non-event. Each run is ~15s and cheap. Low cost; the design already supports it.
External heartbeat → workflow_dispatch (most reliable backstop)
A free external scheduler (e.g. cron-job.org, or any always-on box) calls the REST workflow_dispatch endpoint on a fine-grained PAT. Fires independently of GitHub's queue, so it's immune to the congestion above. Low monetary cost; moderate complexity (a secret/token lives off-platform and must be rotated).
Event-chaining instead of cron (workflow_run)
Trigger the sweep from the completion of an already-frequent, activity-driven workflow (e.g. CI on main) via workflow_run. Piggybacks on events that fire from real activity rather than the unreliable scheduler. No new infra; complexity is in picking a workflow that runs often enough.
Watchdog / self-heal workflow
A small workflow that asks "did the expected run happen in the last N minutes? if not, re-dispatch." Caveat: a scheduled watchdog inherits the same unreliability — so it only helps if it's itself off-peak (test issue from agent #1) or event-driven (Optimize review: small-PR and incremental fast paths #4). Medium complexity; some value as defense-in-depth.
Out of scope (not low-cost)
Self-hosted runners give exact-time control but add infra, security surface, and maintenance — explicitly not low cost/complexity, noted only for completeness.
Recommendation
Adopt test issue from agent #1 (off-peak minute) fleet-wide now as the standard for all schedule: triggers (cheap, immediate, codify in standards/ci-standards.md), and add Add @claude delegation, auto-merge, and rebase handling #3 (external heartbeat) or Optimize review: small-PR and incremental fast paths #4 (event-chaining) as a backstop for the workflows we actually depend on (compliance-retrigger first). Go-live improvements for PR review agent #2 is essentially free given the sweep's existing idempotent+throttled design and pairs well with test issue from agent #1.
Open questions
Do we standardize a single off-peak minute (e.g. :17) across the org, or stagger per workflow to avoid self-inflicted bursts?
For the backstop, do we prefer keeping everything in-platform (Optimize review: small-PR and incremental fast paths #4) or accept an external dependency for stronger guarantees (Add @claude delegation, auto-merge, and rebase handling #3)?
Should ci-standards.md outright ban 0 * * * * for new workflows?
Sources
GitHub Docs — Events that trigger workflows (schedule)
community/discussions #156282 — Unexpected delay in scheduled workflows
community/discussions #147369 — cron not running at the specified time
Predicting GitHub Cron Delays — lowlysre
Now let me re-establish the watch — the container restart killed the 22:20 timer. Checking the current time and whether the 22:00 run fired.

don-petry · 2026-06-14T20:52:26Z

don-petry
Jun 14, 2026
Maintainer Author

Let's start with option one and two. Let's implement this offset as part of our organizational standard for workflows. And adjust all existing workflows across the org.
There should also be some follow-up analysis to report on the success rate of scheduled workflow runs. Let's incorporate that into the fleet health daily check.

0 replies

2026-06-14T21:00:42Z

github-actions[bot]
Bot Jun 14, 2026

📋 Initiative planned by the BMAD Scrum Master (Bob).

Epic #722 — Reliable scheduled Actions: off-peak cron offsets, scheduling standard, and schedule-reliability reporting

4 stories created (inert — labelled initiative, NOT initiative:auto):

[Phase 1] Move all scheduled workflows off the top of the hour #723 (M) — [Phase 1] Move all scheduled workflows off the top of the hour
[Phase 1] Codify the off-peak scheduling standard + CI compliance check #724 (M) — [Phase 1] Codify the off-peak scheduling standard + CI compliance check
[Phase 2] Add scheduled-run reliability analysis to the fleet health daily check #725 (L) — [Phase 2] Add scheduled-run reliability analysis to the fleet health daily check
[Phase 3] Roll the off-peak scheduling standard out org-wide #726 (M) — [Phase 3] Roll the off-peak scheduling standard out org-wide

Open questions for review:

Standardize a single org-wide off-peak minute (e.g. :17) versus stagger per workflow? Story 1 defaults to staggered distinct minutes to avoid a self-inflicted burst; confirm the preference before the org-wide rollout (Story 4), since a single shared minute would re-cluster load.
Should the schedule-lint in Story 2 hard-fail CI or warn-only on a reintroduced 0 * * * * cron (idea open question 3)? Defaulting to a check that reports/flags; the fail-vs-warn severity is a policy call.
Backstop options Add @claude delegation, auto-merge, and rebase handling #3 (external heartbeat -> workflow_dispatch) and Optimize review: small-PR and incremental fast paths #4 (event-chaining via workflow_run) were NOT selected for this initiative (owner chose test issue from agent #1 + Go-live improvements for PR review agent #2). They are deferred to a future idea if off-peak offsetting plus idempotency proves insufficient for the workflows we depend on.

Review the epic and its sub-issue DAG, adjust as needed, then add initiative:auto to epic #722 to hand it to initiative-driver for auto-implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving reliability of scheduled GitHub Actions runs (cron is best-effort) #656

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Improving reliability of scheduled GitHub Actions runs (cron is best-effort) #656

Uh oh!

don-petry Jun 13, 2026 Maintainer

Replies: 2 comments

Uh oh!

don-petry Jun 14, 2026 Maintainer Author

Uh oh!

github-actions[bot] Bot Jun 14, 2026

don-petry
Jun 13, 2026
Maintainer

don-petry
Jun 14, 2026
Maintainer Author

github-actions[bot]
Bot Jun 14, 2026