Skip to content

Reduce claimer heartbeat churn#213

Merged
hardbyte merged 4 commits into
mainfrom
codex/coordination-architecture
May 2, 2026
Merged

Reduce claimer heartbeat churn#213
hardbyte merged 4 commits into
mainfrom
codex/coordination-architecture

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented May 2, 2026

Summary

  • reduce queue-storage bounded-claimer write churn by carrying the acquired lease freshness internally and skipping mark_queue_claimer_active while the lease is still safely fresh
  • keep the public QueueClaimerLease shape unchanged; freshness timestamps stay in an internal row type
  • add an integration regression test that proves a fresh owned claimer can claim without rewriting queue_claimer_leases, while a half-stale idle heartbeat refreshes before another process can steal it
  • fold the reviewed architecture deck into docs/architecture.md, correcting table names and recovery claims against the current implementation and TLA docs

Architecture review notes

Corrections made while folding in the diagrams:

  • done_entries is the terminal table name in queue storage; the old terminal_entries wording was removed
  • leases is the materialized live-attempt table; active_leases is only a logical/model term in older prose
  • heartbeat refresh is every-worker, but stale-heartbeat and deadline rescue scans are maintenance-leader tasks
  • panic'd workers do not receive an in-memory cancellation signal; recovery is storage-side via rescue and stale-writer rejection, while still-live local handlers are signalled when the rescuing process has them registered
  • claim serialization is queue_claim_heads ... FOR UPDATE SKIP LOCKED; ring CAS belongs to rotate/prune safety, not the claim serializer
  • TLA model mapping remains accurate for the lifecycle/receipt/lease claims used in the diagrams; queue-claimer lease freshness itself is below the current model abstraction

Benchmarks

Production-durability runs on the portable benchmark harness, PRODUCER_BATCH_MAX=128 PRODUCER_BATCH_MS=25.

Variant Shape Completion median Enqueue median Backlog median Subscriber p99 median Run
main reference 1 replica x 64 workers, 5k/s 4,817/s 4,949/s 573 724 ms custom-20260502T035801Z-ab7c8a
this PR 1 replica x 64 workers, 5k/s 4,988/s 4,976/s 125 50 ms custom-20260502T053743Z-cd752d
this PR 2 replicas x 32 workers, 5k/s aggregate 4,974/s 4,988/s 143 301 ms custom-20260502T053958Z-1235c0

A SQL-only throttle variant was tested and rejected before this PR because it regressed badly (custom-20260502T053416Z-2853e7: 4,634/s, p99 2.07s). Statement fusion was also kept out of this PR: it improved the 5k producer-capped case but regressed overload tail latency.

Verification

  • cargo test -p awa --test queue_storage_runtime_test test_queue_storage_claimer_heartbeat_skips_fresh_lease -- --nocapture
  • cargo test -p awa --test queue_storage_runtime_test test_queue_storage_bounded_claimers -- --nocapture
  • cargo clippy -p awa-model --all-targets -- -D warnings
  • cargo fmt --check
  • ./correctness/run-tlc.sh storage/AwaSegmentedStorage.tla
  • grepped benchmark result dirs for stale-completion / flush-failure warnings: no hits

Summary by CodeRabbit

  • Bug Fixes

    • Optimized queue claimer logic to avoid unnecessary lease refreshes when leases are still fresh.
  • Documentation

    • Expanded architecture docs with detailed job lifecycle, dispatcher, crash recovery, maintenance, and callback orchestration guidance.
  • Tests

    • Added a queue claimer heartbeat test to validate lease refresh behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 2, 2026

Warning

Rate limit exceeded

@hardbyte has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 36 minutes and 2 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 18e14cfe-718d-480a-99a8-e0d1029630fe

📥 Commits

Reviewing files that changed from the base of the PR and between fd38608 and 411d12d.

📒 Files selected for processing (2)
  • awa-model/src/queue_storage.rs
  • awa/tests/chaos_suite_test.rs
📝 Walkthrough

Walkthrough

Introduced an internal QueueClaimerLeaseRow with lease timing fields and a needs_refresh helper. Refactored claimer acquisition to return richer lease data and made heartbeat refresh conditional: mark_queue_claimer_active is now called only when claims exist and the lease is considered stale. Added a test and updated architecture docs.

Changes

Conditional Claimer Lease Refresh

Layer / File(s) Summary
Data Shape
awa-model/src/queue_storage.rs
Added internal QueueClaimerLeaseRow containing last_claimed_at and expires_at. Added needs_refresh(now, lease_ttl, idle_threshold) helper implementing refresh heuristics.
Core Acquisition
awa-model/src/queue_storage.rs
Introduced internal acquire_queue_claimer_row that returns QueueClaimerLeaseRow. Updated public acquire_queue_claimer to delegate and map to QueueClaimerLease.
DB Query Wiring
awa-model/src/queue_storage.rs
Updated SELECT/UPDATE/INSERT flows to RETURNING claimer_slot, lease_epoch, last_claimed_at, and expires_at to surface timing data to callers.
Control Flow
awa-model/src/queue_storage.rs
Changed claim_runtime_batch_with_aging_for_instance to call mark_queue_claimer_active only when claimed is non-empty and lease.needs_refresh(…) is true; removed prior unconditional refresh on non-empty claims.
Tests / Validation
awa/tests/queue_storage_runtime_test.rs
Updated chrono imports to use chrono::{DateTime, Utc};. Added test_queue_storage_claimer_heartbeat_skips_fresh_lease verifying heartbeat skip on fresh leases and refresh on stale leases.
Documentation
docs/architecture.md
Large rewrite to L1–L7 architecture structure; terminology and lifecycle/claim/lease sections updated (done_entries, leases), plus new sections for lifecycle, dispatcher, crash recovery, maintenance leader, and callbacks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰
A claimant hummed by the data door,
I peeked the lease and sighed no more.
Fresh heartbeats sleep, stale ones wake,
A gentle hop—for fewer shakes.
Queue dances lightly, paws on floor.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Reduce claimer heartbeat churn' accurately summarizes the primary code change—reducing write churn by skipping unnecessary heartbeat updates for fresh leases.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/coordination-architecture

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 36 minutes and 2 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@awa/tests/queue_storage_runtime_test.rs`:
- Around line 4144-4152: The test is backdating the claimer lease by 300ms which
is less than the idle_threshold (500ms), so update the SQL bind that sets
last_claimed_at (the Utc::now() - chrono::Duration::milliseconds(300)
expression) to use a duration greater than idle_threshold (e.g., >500ms or use
idle_threshold + some margin) so the lease becomes stale and the refresh branch
is exercised when running the assertions against queue_claimer_leases for
lease.claimer_slot and queue.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7a3d68d-dab0-4f24-a853-02b41bf2616b

📥 Commits

Reviewing files that changed from the base of the PR and between fcb0eef and 69fdc52.

📒 Files selected for processing (3)
  • awa-model/src/queue_storage.rs
  • awa/tests/queue_storage_runtime_test.rs
  • docs/architecture.md

Comment thread awa/tests/queue_storage_runtime_test.rs
@hardbyte hardbyte merged commit 5c79fcc into main May 2, 2026
13 checks passed
@hardbyte hardbyte deleted the codex/coordination-architecture branch May 2, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant