v016: drop queue_lanes.available_count; derive from heads#251
Conversation
Removes the redundant available_count cache on queue_lanes that the 2026-05-10 investigation traced as the dominant bloat source under pinned xmin (~40k dead tuples in a 4 minute window, ratio 2,500x over live row count, driving the idle_in_tx scenario to 69% of clean throughput in the local repro). available_count is mathematically equal to queue_enqueue_heads.next_seq - queue_claim_heads.claim_seq for the same (queue, priority). The two head tables already track every enqueue and every claim; the cache was a third counter that had to be UPDATEd on every claim, enqueue, and completion batch, tripling the UPDATE rate on queue_lanes versus its siblings and making it the worst MVCC pressure source on the hot path. Hot path (queue_claimer_signal) now derives the count cheaply from `sum(GREATEST(next_seq - claim_seq, 0))` over the head tables — two PK reads per lane, O(few rows). Tolerates a transient over-count after admin DELETE of non-head lanes: the dispatcher's gap-recovery branch in claim_ready_runtime absorbs the drift on the next claim attempt. Admin API (queue_counts_exact) scans `ready_entries WHERE lane_seq >= claim_seq` for the exact count — what tests assert, what the UI reads. Not on any hot path. Code changes: - v016 migration: ALTER TABLE DROP COLUMN; CREATE OR REPLACE the insert_job_compat / delete_job_compat functions byte-identically to v013 minus their available_count UPDATEs. delete_job_compat gains a head-advance UPDATE for the head-lane delete case so the cheap hot-path approximation doesn't have to wait for gap-recovery. - queue_storage.rs: drop the column from prepare_schema's CREATE TABLE; remove the post-install backfill; remove the queue_lanes UPDATE from claim_ready_runtime's PL/pgSQL (keeping the gap-recovery branch); delete the adjust_lane_counts / adjust_lane_counts_batch helpers (all callers passed pruned_completed_delta=0 so the family was dead code after available_count was dropped); rewrite queue_claimer_signal to use the head-difference formula; rewrite queue_counts_exact to scan ready_entries; cancel_job_tx gains the same head-advance for the head-lane case as delete_job_compat. - migrations.rs: CURRENT_VERSION 15 -> 16; new V16_UP constant. Tests: - queue_storage_runtime_test::test_available_count_matches_ready_entries_scan reframed to two contracts: the hot-path approximation must never under-count vs scan, and the admin API must exactly equal scan. The previous "scan == cache" assertion no longer applies — there is no cache. - test_queue_storage_queue_counts_reads_legacy_lane_rollups_and_backfills_them: remove available_count from the seed INSERT. - 4 sites in queue_storage_copy_test.rs and 2 in awa-python tests: rewrite the inline SQL to use the head-difference formula. - 3 sites (test_dlq.py x2, test_cli.py x1): delete the lane_fix CTE that updated available_count. - awa-ui seed.ts: drop the column from the INSERT. All 57 queue_storage_runtime tests pass on pg18. Workspace builds clean with -D warnings. Bench A/B in flight to measure the idle_in_tx ratio recovery.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (13)
📝 WalkthroughWalkthroughMigration v016 drops the ChangesQueue Availability Refactoring: Drop Lane Cache, Derive from Cursors
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
CI surfaced two issues: 1. v016 migration didn't record itself in awa.schema_version, so migrations::run kept thinking version 16 was pending and the idempotency test failed (current_version returned 15, not 16). Every prior migration ends with an `INSERT INTO awa.schema_version (version, description) ... ON CONFLICT DO NOTHING`; v016 was missing it. Added. 2. test_dlq.py x2 and test_cli.py x1: when I removed the `lane_fix` CTE that UPDATEd queue_lanes.available_count, I left the trailing comma after the previous `released AS (...)` block, producing `WITH ... released AS (...), INSERT INTO ...` — syntax error. Replaced `),` with `)` at the three sites. Local verification: test_c13_migration_idempotent passes against fresh pg18. Pushing for CI to confirm the Python suite recovers.
Strip "v016:" prefixes, date stamps, and "after the X investigation" phrasing from comments added with the migration. Promote `queue_claimer_signal` and `queue_counts_exact` to /// doc comments describing the head-difference approximation vs the ready_entries scan, when each is used, and the drift behaviour under admin deletes. Rewrite the drift-detection test's docstring around the head-table invariants (enqueue bumps next_seq, claim/cancel bump claim_seq, gap-recovery closes mid-ring gaps) so the test reads as a coverage table for the lifecycle rather than a counter-maintenance audit. No behaviour changes.
- ADR-019 §lane_state: replace the stale paragraph that described the cache as the current implementation; document the two-grade availability scheme (head-difference for the hot path, ready_entries scan for the admin API) and capture the cache as a historical iteration superseded by v016. - ADR-008: rewrite the "lane counters remain online" caveat to reflect that only the next_seq cursor bump is on the enqueue hot path now; cross-link to ADR-019. - CHANGELOG: add an Unreleased Performance entry for v016. Bench-dated ADRs under docs/adr/bench/ are left as-is — they are historical artifacts of specific benchmark runs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e257e539e7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| EXECUTE format( | ||
| 'ALTER TABLE %I.queue_lanes DROP COLUMN IF EXISTS available_count', | ||
| v_schema |
There was a problem hiding this comment.
Avoid dropping the column in a rolling migration
When this migration is applied while any v15 workers are still running, their enqueue/claim/delete paths and the old compat functions still issue UPDATE queue_lanes SET available_count = ..., so this ALTER TABLE ... DROP COLUMN makes those workers fail until every process is upgraded and functions are replaced. This also contradicts the migration policy in awa-model/src/migrations.rs that incremental migrations must be additive-only to avoid breaking running workers; either keep the column through a compatibility window or make this a documented stop-the-world/major upgrade.
Useful? React with 👍 / 👎.
| SELECT COALESCE(sum(available_count), 0)::bigint | ||
| FROM {schema}.queue_lanes | ||
| WHERE queue = ANY($1) | ||
| SELECT COALESCE(sum(GREATEST(qe.next_seq - qc.claim_seq, 0)), 0)::bigint |
There was a problem hiding this comment.
Close deleted tail gaps before using the head-derived signal
For queues where admins cancel/delete all unclaimed rows at or beyond the next claim head after an earlier head row remains, this derived signal can stay positive forever: after the remaining head is claimed, claim_seq points at a deleted gap while next_seq is still higher, but claim_ready_runtime first joins through a LATERAL ready_entries candidate and returns before its gap-recovery branch when no row exists. Previously available_count was decremented by those deletes; now the dispatcher will keep waking/claiming an empty queue until a future enqueue happens to advance the cursor.
Useful? React with 👍 / 👎.
Summary
Removes the redundant `available_count` cache from `queue_lanes`. The 2026-05-10 investigation traced this column as the dominant bloat source under pinned xmin — every claim, enqueue, and completion batch was UPDATEing it, tripling `queue_lanes`'s mutation rate vs its sibling counter tables. The value is mathematically identical to `queue_enqueue_heads.next_seq − queue_claim_heads.claim_seq`, which the two head tables already maintain; storing a third counter just to denormalise a difference was pure cost.
Results — noise-controlled A/B
Three iterations of two cells (idle_in_tx; W=64 steady), interleaving prefix and v016 so any time-correlated host noise hits both alike. Local docker pg18.
queue_lanes dead-tuple peak during idle_in_tx — deterministic
The bloat source is gone. Live row count is ~16 and stays that way.
Throughput — v016 stable, prefix unstable
Prefix hits a "0 jobs/s" stall in 4 of 6 cells. v016 never does. In the one cell where prefix didn't stall (iter 3 W=64), the two are equivalent (~1,420). The bench harness sampling is the same on both sides — the stalls are real production-style stalls in the prefix code, made visible by macOS docker's higher MVCC pressure.
The v016 numbers themselves vary across iterations (740–1,000 clean, 651–789 idle, 1,505–1,657 W=64), which is normal docker-on-Mac noise. The point is v016 doesn't stall.
Code changes
Hot path (`queue_storage.rs`):
Migration (`v016_drop_queue_lanes_available_count.sql`):
Tests (no production code path uses the dropped column):
Trade
The cheap hot-path approximation (`next_seq − claim_seq`) over-counts by 1 each time an admin operation deletes an unclaimed non-head lane, until the next claim attempt on that lane hits the gap-recovery branch in `claim_ready_runtime` and catches `claim_seq` up. The dispatcher's wake-up signal is tolerant of brief over-counts — an over-eager wake-up is harmless. Admin tools that need exactness route through `queue_counts_exact`, which scans `ready_entries`.
Test plan
Summary by CodeRabbit
Performance
Tests
Documentation