feat(cron): per-schedule pause / resume#320
Conversation
Adds paused_at + paused_by columns to awa.cron_jobs so individual cron
schedules can be paused without deleting them. The evaluator skips
paused rows up front and the atomic_enqueue CTE re-checks
paused_at IS NULL inside the same UPDATE, so a pause asserted between
the leader's read and CAS still takes effect. last_enqueued_at is left
untouched while paused, so the schedule's existing missed_fire_policy
decides catch-up behaviour on resume. Manual trigger_cron_job bypasses
pause; pause stops automatic fires only.
HTTP: POST /api/cron/{name}/pause (optional paused_by body) and
/api/cron/{name}/resume. /api/cron list response carries paused_at and
paused_by.
UI: per-row Pause / Resume controls next to Trigger now; "paused" badge
on the row; "queue paused" badge inline next to the target queue when
the queue itself is paused (uses /api/queues). Expanded panel calls out
both states and explains that manual triggers still work and that fires
enqueued into a paused queue dispatch on queue resume.
TLA: AwaCron gains paused, Pause, and Resume. AtomicEnqueue requires
~paused; CASFail extended to model the paused-row UPDATE-zero-rows case
so a stale snapshot does not deadlock. New PausedBlocksEnqueue temporal
property asserts no step from a paused state increments jobCount.
FairSpec adds weak fairness on Resume so liveness still holds.
Coverage: 9 new model/DB tests, 5 new UI API tests, e2e pause/resume
round-trip, ADR-007 / architecture.md / ui-design.md / test-plan.md /
CHANGELOG / READMEs updated.
|
Warning Review limit reached
More reviews will be available in 53 minutes and 23 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (21)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
TLC found a Pause/Resume storm that starves AtomicEnqueue:
AcquireLeader → ReadCronState → CASFail (paused)
→ Resume → Pause → ...
The leader's snapshot is dropped on CASFail and Pause re-fires before
ReadCronState can refresh it, so (hasSnapshot ∧ ~paused) is never
simultaneously enabled. Weak fairness on Resume does not help — the
fairness antecedent ("continuously enabled" for WF, "infinitely often
enabled" for SF) is satisfied by the loop, but AtomicEnqueue's
preconditions are not.
Restrict StableNext to omit Pause (but keep Resume + WF, so any initial
paused state still clears). Pause / Resume remain in the full Next used
by the safety spec, so PausedBlocksEnqueue is still verified across
storms. Matches the existing convention of omitting Crash / Recover
from liveness — progress claims hold only when the environment is not
adversarial.
Verified locally via correctness/run-tlc.sh:
- AwaCron.cfg (safety): 9248 states, 1658 distinct, depth 17, OK
- AwaCronLiveness.cfg (liveness): 1180 states, 360 distinct, depth 14, OK
ADR-007 model section updated to explain the restriction.
Summary
paused_at+paused_bytoawa.cron_jobs(mirrors thequeue_metashape). The evaluator skips paused rows andatomic_enqueue's CTE re-checkspaused_at IS NULLinside the same UPDATE — a pause asserted between the leader's read and the CAS still takes effect.last_enqueued_atis left untouched while paused, so the schedule's existingmissed_fire_policydecides catch-up behaviour on resume. Manualtrigger_cron_jobbypasses pause (pause stops automatic fires only).POST /api/cron/{name}/pause(optional{paused_by}body) and/resume. List response carriespaused_at/paused_by.AwaCrongainspaused,Pause,Resume.AtomicEnqueuerequires~paused;CASFailextended to model the paused-row UPDATE-zero-rows case. NewPausedBlocksEnqueuetemporal property asserts no step from a paused state incrementsjobCount. LivenessStableNextincludesResume(with WF) but omitsPause.TLA+ model checking
Run locally via
./correctness/run-tlc.shagainstcorrectness/Dockerfile(TLA 1.8.0, Eclipse Temurin 21).AwaCron.cfg(safety) — 9,248 states, 1,658 distinct, depth 17.TypeOK,NoDuplicateFire,SnapshotNeverAheadOfDB,SnapshotRequiresAlive,LeaderAlive, and the newPausedBlocksEnqueuetemporal property all hold across the fullNext(includingPause/Resume/Crash/Recover).AwaCronLiveness.cfg— initially failed. TLC found a Pause/Resume storm that starvesAtomicEnqueue:The leader's snapshot is dropped on
CASFailandPausere-fires beforeReadCronStatecan refresh it, so(hasSnapshot ∧ ~paused)is never simultaneously enabled. Weak fairness onResumedidn't help — the fairness antecedent was satisfied by the loop butAtomicEnqueue's preconditions weren't.Fix in
de3fb84:StableNext(the liveness Next) keepsResume(with WF, so any initial paused state still clears) but omitsPause.Pause/Resumeremain in the full safetyNext, soPausedBlocksEnqueueis still verified against storms. This matches the existing convention of omittingCrash/Recoverfrom liveness — progress claims hold only when the environment is not adversarial.After the fix: liveness passes — 1,180 states, 360 distinct, depth 14. Both
CoalescedLatestFireEventuallyEnqueuedandCatchUpFiresEventuallyEnqueuedhold.Test plan
cargo fmt --all --checkSQLX_OFFLINE=true cargo clippy --all-targets --all-features -- -D warningsSQLX_OFFLINE=true cargo build --workspacecargo test -p awa --test cron_test— 21 passing (9 new pause/resume tests + existing)cargo test -p awa-ui --test cron_api_test— 6 passing (5 new pause endpoint tests + existing)cargo test -p awa --test migration_test— 37 passing (v026 migration applies cleanly on existing DB)npm run lint(tsc --noEmit) inawa-ui/frontend/— cleancorrectness/races/AwaCron.tla— safety + liveness pass (see above)