Replies: 2 comments
-
|
Let's start with option one and two. Let's implement this offset as part of our organizational standard for workflows. And adjust all existing workflows across the org. |
Beta Was this translation helpful? Give feedback.
-
|
📋 Initiative planned by the BMAD Scrum Master (Bob). Epic #722 — Reliable scheduled Actions: off-peak cron offsets, scheduling standard, and schedule-reliability reporting 4 stories created (inert — labelled
Open questions for review:
Review the epic and its sub-issue DAG, adjust as needed, then add |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem
Our scheduled (cron) workflows are not firing reliably. GitHub's schedule event is explicitly best-effort, not a guarantee, and we're now hitting that limitation in production.
Observed in petry-projects/.github (2026-06-13):
After merging #445 (compliance-retrigger 0 5 * * * → hourly 0 * * * *), the hourly cron produced zero scheduled runs in the ~2 hours we watched. We had to fire it manually via workflow_dispatch.
Historically the "05:00 UTC" daily run actually executed at 08:19 / 08:55 / 09:10 UTC on consecutive days — 3–4 hours late.
This matters because the compliance re-trigger sweep is what drains stale compliance-audit findings across the fleet; if it silently doesn't run, backlogs sit untouched and "successful but never ran" is indistinguishable from healthy.
Why this happens (per GitHub docs)
"The schedule event can be delayed during periods of high loads of GitHub Actions workflow runs. High load times include the start of every hour. If the load is sufficiently high enough, some queued jobs may be dropped."
"To decrease the chance of delay, schedule your workflow to run at a different time of the hour."
Other relevant caveats:
Shortest interval is every 5 minutes.
Scheduled workflows run only on the default branch's latest commit (we satisfy this).
In public repos, scheduled workflows are auto-disabled after 60 days of no repo activity.
The community reports routine 29–60 minute delays, and 0 * * * * (top of the hour) is the single most congested slot.
Candidate solutions — ranked by cost & complexity
One-line change: 0 * * * * → e.g. 17 * * * * / 23 * * * *. Directly targets the test issue from agent #1 documented delay cause. Zero infra, no secrets, applies to every scheduled workflow we own. Recommended as an immediate, standalone win.
compliance-retrigger.sh is already idempotent and throttled (≤1 engagement per repo per run, skips active repos). That means extra ticks are harmless and missed ticks self-heal on the next tick. Running every ~15–20 min at an odd offset (e.g. 8,28,48 * * * *) makes any single dropped tick a non-event. Each run is ~15s and cheap. Low cost; the design already supports it.
A free external scheduler (e.g. cron-job.org, or any always-on box) calls the REST workflow_dispatch endpoint on a fine-grained PAT. Fires independently of GitHub's queue, so it's immune to the congestion above. Low monetary cost; moderate complexity (a secret/token lives off-platform and must be rotated).
Trigger the sweep from the completion of an already-frequent, activity-driven workflow (e.g. CI on main) via workflow_run. Piggybacks on events that fire from real activity rather than the unreliable scheduler. No new infra; complexity is in picking a workflow that runs often enough.
A small workflow that asks "did the expected run happen in the last N minutes? if not, re-dispatch." Caveat: a scheduled watchdog inherits the same unreliability — so it only helps if it's itself off-peak (test issue from agent #1) or event-driven (Optimize review: small-PR and incremental fast paths #4). Medium complexity; some value as defense-in-depth.
Out of scope (not low-cost)
Self-hosted runners give exact-time control but add infra, security surface, and maintenance — explicitly not low cost/complexity, noted only for completeness.
Recommendation
Adopt test issue from agent #1 (off-peak minute) fleet-wide now as the standard for all schedule: triggers (cheap, immediate, codify in standards/ci-standards.md), and add Add @claude delegation, auto-merge, and rebase handling #3 (external heartbeat) or Optimize review: small-PR and incremental fast paths #4 (event-chaining) as a backstop for the workflows we actually depend on (compliance-retrigger first). Go-live improvements for PR review agent #2 is essentially free given the sweep's existing idempotent+throttled design and pairs well with test issue from agent #1.
Open questions
Do we standardize a single off-peak minute (e.g. :17) across the org, or stagger per workflow to avoid self-inflicted bursts?
For the backstop, do we prefer keeping everything in-platform (Optimize review: small-PR and incremental fast paths #4) or accept an external dependency for stronger guarantees (Add @claude delegation, auto-merge, and rebase handling #3)?
Should ci-standards.md outright ban 0 * * * * for new workflows?
Sources
GitHub Docs — Events that trigger workflows (schedule)
community/discussions #156282 — Unexpected delay in scheduled workflows
community/discussions #147369 — cron not running at the specified time
Predicting GitHub Cron Delays — lowlysre
Now let me re-establish the watch — the container restart killed the 22:20 timer. Checking the current time and whether the 22:00 run fired.
Beta Was this translation helpful? Give feedback.
All reactions