Skip to content

feat(cron): per-schedule pause / resume#320

Merged
hardbyte merged 2 commits into
mainfrom
feature/cron-pause
Jun 5, 2026
Merged

feat(cron): per-schedule pause / resume#320
hardbyte merged 2 commits into
mainfrom
feature/cron-pause

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented Jun 5, 2026

Summary

  • Adds paused_at + paused_by to awa.cron_jobs (mirrors the queue_meta shape). The evaluator skips paused rows and atomic_enqueue's CTE re-checks paused_at IS NULL inside the same UPDATE — a pause asserted between the leader's read and the CAS still takes effect.
  • last_enqueued_at is left untouched while paused, so the schedule's existing missed_fire_policy decides catch-up behaviour on resume. Manual trigger_cron_job bypasses pause (pause stops automatic fires only).
  • HTTP: POST /api/cron/{name}/pause (optional {paused_by} body) and /resume. List response carries paused_at / paused_by.
  • UI: per-row Pause / Resume controls; "paused" badge; "queue paused" badge inline next to the target queue when the queue is itself paused. Expanded panel calls out both states.
  • TLA+: AwaCron gains paused, Pause, Resume. AtomicEnqueue requires ~paused; CASFail extended to model the paused-row UPDATE-zero-rows case. New PausedBlocksEnqueue temporal property asserts no step from a paused state increments jobCount. Liveness StableNext includes Resume (with WF) but omits Pause.
  • Cron pause and queue pause are intentionally independent — one stops scheduling, the other stops dispatch. The UI surfaces both so an operator looking at a quiet schedule can see which side is stopped. ADR-007 documents the matrix and the in-flight-job semantics (pause does not affect already-enqueued jobs).

TLA+ model checking

Run locally via ./correctness/run-tlc.sh against correctness/Dockerfile (TLA 1.8.0, Eclipse Temurin 21).

AwaCron.cfg (safety) — 9,248 states, 1,658 distinct, depth 17. TypeOK, NoDuplicateFire, SnapshotNeverAheadOfDB, SnapshotRequiresAlive, LeaderAlive, and the new PausedBlocksEnqueue temporal property all hold across the full Next (including Pause/Resume/Crash/Recover).

AwaCronLiveness.cfg — initially failed. TLC found a Pause/Resume storm that starves AtomicEnqueue:

AcquireLeader → ReadCronState → CASFail(paused) → Resume → Pause → ...

The leader's snapshot is dropped on CASFail and Pause re-fires before ReadCronState can refresh it, so (hasSnapshot ∧ ~paused) is never simultaneously enabled. Weak fairness on Resume didn't help — the fairness antecedent was satisfied by the loop but AtomicEnqueue's preconditions weren't.

Fix in de3fb84: StableNext (the liveness Next) keeps Resume (with WF, so any initial paused state still clears) but omits Pause. Pause/Resume remain in the full safety Next, so PausedBlocksEnqueue is still verified against storms. This matches the existing convention of omitting Crash/Recover from liveness — progress claims hold only when the environment is not adversarial.

After the fix: liveness passes — 1,180 states, 360 distinct, depth 14. Both CoalescedLatestFireEventuallyEnqueued and CatchUpFiresEventuallyEnqueued hold.

Test plan

  • cargo fmt --all --check
  • SQLX_OFFLINE=true cargo clippy --all-targets --all-features -- -D warnings
  • SQLX_OFFLINE=true cargo build --workspace
  • cargo test -p awa --test cron_test — 21 passing (9 new pause/resume tests + existing)
  • cargo test -p awa-ui --test cron_api_test — 6 passing (5 new pause endpoint tests + existing)
  • cargo test -p awa --test migration_test — 37 passing (v026 migration applies cleanly on existing DB)
  • npm run lint (tsc --noEmit) in awa-ui/frontend/ — clean
  • TLC model check correctness/races/AwaCron.tla — safety + liveness pass (see above)
  • Manual smoke in browser: pause a schedule, confirm badge + Resume button appear, confirm no fires; resume, confirm fires resume; pause a target queue, confirm "queue paused" badge appears on the cron page.

Adds paused_at + paused_by columns to awa.cron_jobs so individual cron
schedules can be paused without deleting them. The evaluator skips
paused rows up front and the atomic_enqueue CTE re-checks
paused_at IS NULL inside the same UPDATE, so a pause asserted between
the leader's read and CAS still takes effect. last_enqueued_at is left
untouched while paused, so the schedule's existing missed_fire_policy
decides catch-up behaviour on resume. Manual trigger_cron_job bypasses
pause; pause stops automatic fires only.

HTTP: POST /api/cron/{name}/pause (optional paused_by body) and
/api/cron/{name}/resume. /api/cron list response carries paused_at and
paused_by.

UI: per-row Pause / Resume controls next to Trigger now; "paused" badge
on the row; "queue paused" badge inline next to the target queue when
the queue itself is paused (uses /api/queues). Expanded panel calls out
both states and explains that manual triggers still work and that fires
enqueued into a paused queue dispatch on queue resume.

TLA: AwaCron gains paused, Pause, and Resume. AtomicEnqueue requires
~paused; CASFail extended to model the paused-row UPDATE-zero-rows case
so a stale snapshot does not deadlock. New PausedBlocksEnqueue temporal
property asserts no step from a paused state increments jobCount.
FairSpec adds weak fairness on Resume so liveness still holds.

Coverage: 9 new model/DB tests, 5 new UI API tests, e2e pause/resume
round-trip, ADR-007 / architecture.md / ui-design.md / test-plan.md /
CHANGELOG / READMEs updated.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

Warning

Review limit reached

@hardbyte, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 23 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a62fa3e4-9691-4935-a388-cf6cd3ed7d82

📥 Commits

Reviewing files that changed from the base of the PR and between cc9ba62 and de3fb84.

📒 Files selected for processing (21)
  • CHANGELOG.md
  • README.md
  • awa-model/README.md
  • awa-model/migrations/v026_cron_jobs_pause.sql
  • awa-model/src/cron.rs
  • awa-model/src/migrations.rs
  • awa-ui/README.md
  • awa-ui/frontend/e2e/cron.spec.ts
  • awa-ui/frontend/src/lib/api.ts
  • awa-ui/frontend/src/routes/cron.tsx
  • awa-ui/src/handlers/cron.rs
  • awa-ui/src/lib.rs
  • awa-ui/tests/cron_api_test.rs
  • awa-worker/src/maintenance.rs
  • awa/tests/cron_test.rs
  • correctness/races/AwaCron.cfg
  • correctness/races/AwaCron.tla
  • docs/adr/007-periodic-cron-jobs.md
  • docs/architecture.md
  • docs/test-plan.md
  • docs/ui-design.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/cron-pause

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

TLC found a Pause/Resume storm that starves AtomicEnqueue:

    AcquireLeader → ReadCronState → CASFail (paused)
                  → Resume → Pause → ...

The leader's snapshot is dropped on CASFail and Pause re-fires before
ReadCronState can refresh it, so (hasSnapshot ∧ ~paused) is never
simultaneously enabled. Weak fairness on Resume does not help — the
fairness antecedent ("continuously enabled" for WF, "infinitely often
enabled" for SF) is satisfied by the loop, but AtomicEnqueue's
preconditions are not.

Restrict StableNext to omit Pause (but keep Resume + WF, so any initial
paused state still clears). Pause / Resume remain in the full Next used
by the safety spec, so PausedBlocksEnqueue is still verified across
storms. Matches the existing convention of omitting Crash / Recover
from liveness — progress claims hold only when the environment is not
adversarial.

Verified locally via correctness/run-tlc.sh:
  - AwaCron.cfg (safety): 9248 states, 1658 distinct, depth 17, OK
  - AwaCronLiveness.cfg (liveness): 1180 states, 360 distinct, depth 14, OK

ADR-007 model section updated to explain the restriction.
@hardbyte hardbyte merged commit 879dead into main Jun 5, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant