v016: drop queue_lanes.available_count; derive from heads by hardbyte · Pull Request #251 · hardbyte/awa

hardbyte · 2026-05-10T22:49:33Z

Summary

Removes the redundant `available_count` cache from `queue_lanes`. The 2026-05-10 investigation traced this column as the dominant bloat source under pinned xmin — every claim, enqueue, and completion batch was UPDATEing it, tripling `queue_lanes`'s mutation rate vs its sibling counter tables. The value is mathematically identical to `queue_enqueue_heads.next_seq − queue_claim_heads.claim_seq`, which the two head tables already maintain; storing a third counter just to denormalise a difference was pure cost.

Results — noise-controlled A/B

Three iterations of two cells (idle_in_tx; W=64 steady), interleaving prefix and v016 so any time-correlated host noise hits both alike. Local docker pg18.

queue_lanes dead-tuple peak during idle_in_tx — deterministic

iter	prefix	v016
1	30,926	0
2	35,956	0
3	35,442	0

The bloat source is gone. Live row count is ~16 and stays that way.

Throughput — v016 stable, prefix unstable

iter	shape	prefix clean	v016 clean	prefix idle	v016 idle	v016 idle/clean
1	idle_in_tx	592	740	524	707	96 %
2	idle_in_tx	0	1,000	0	651	65 %
3	idle_in_tx	0	914	574	789	86 %
1	W=64 steady	0	1,657	—	—	—
2	W=64 steady	0	1,505	—	—	—
3	W=64 steady	1,424	1,422	—	—	—

Prefix hits a "0 jobs/s" stall in 4 of 6 cells. v016 never does. In the one cell where prefix didn't stall (iter 3 W=64), the two are equivalent (~1,420). The bench harness sampling is the same on both sides — the stalls are real production-style stalls in the prefix code, made visible by macOS docker's higher MVCC pressure.

The v016 numbers themselves vary across iterations (740–1,000 clean, 651–789 idle, 1,505–1,657 W=64), which is normal docker-on-Mac noise. The point is v016 doesn't stall.

Code changes

Hot path (`queue_storage.rs`):

Drop `available_count` column from `queue_lanes` CREATE TABLE.
Remove the column's backfill from `prepare_schema`.
Remove the `UPDATE queue_lanes SET available_count = …` from `claim_ready_runtime` PL/pgSQL (the gap-recovery branch stays).
Delete `adjust_lane_counts` and `adjust_lane_counts_batch` — every historical caller passed `pruned_completed_delta = 0`, so the entire helper collapsed to a no-op once `available_count` was dropped.
Rewrite `queue_claimer_signal` to derive the count from the head-table difference (two PK reads per lane).
Rewrite `queue_counts_exact` (admin-grade API) to scan `ready_entries WHERE lane_seq >= claim_seq` for the exact count.
`cancel_job_tx` and the v016 `delete_job_compat` each gain a small head-advance UPDATE for the head-lane delete case, so the cheap hot-path approximation doesn't have to wait for gap-recovery.

Migration (`v016_drop_queue_lanes_available_count.sql`):

`ALTER TABLE queue_lanes DROP COLUMN available_count`.
`CREATE OR REPLACE FUNCTION` for `insert_job_compat` and `delete_job_compat`, byte-identical to v013's signatures, bodies minus their `available_count` UPDATEs.
`migrations.rs` bumps `CURRENT_VERSION 15 → 16`.

Tests (no production code path uses the dropped column):

`test_available_count_matches_ready_entries_scan` reframed against two contracts: hot-path approximation never under-counts vs scan; admin API exactly equals scan.
7 inline-SQL sites in Rust + Python tests + the awa-ui seed: rewrite to head-difference or just drop the column.
3 lane_fix CTE sites in Python tests: deleted (no column to update).

Trade

The cheap hot-path approximation (`next_seq − claim_seq`) over-counts by 1 each time an admin operation deletes an unclaimed non-head lane, until the next claim attempt on that lane hits the gap-recovery branch in `claim_ready_runtime` and catches `claim_seq` up. The dispatcher's wake-up signal is tolerant of brief over-counts — an over-eager wake-up is harmless. Admin tools that need exactness route through `queue_counts_exact`, which scans `ready_entries`.

Test plan

`cargo build --workspace`
`cargo clippy --workspace --tests -- -D warnings` clean
`cargo fmt --all`
`cargo test -p awa --test queue_storage_runtime_test` — 57/57 pass on pg18 local
Bloat A/B: `queue_lanes` idle dead-tuple peak 30k+ → 0, deterministic over 3 iterations
Throughput A/B: v016 stable; prefix stalls in 4 of 6 cells under macOS docker pressure
Confirmation on the NixOS sweep box (more stable hardware) — optional given the bloat metric is deterministic and v016 was strictly stabler in every cell

Summary by CodeRabbit

Performance
- Optimized queue availability calculation by eliminating a cached counter and deriving availability from sequence cursors. Reduces database dead-tuple overhead in high-load workloads without affecting the public API.
Tests
- Updated queue availability assertions across test suites to reflect the new calculation method.
Documentation
- Updated architectural decision records documenting queue availability computation approach.

Removes the redundant available_count cache on queue_lanes that the 2026-05-10 investigation traced as the dominant bloat source under pinned xmin (~40k dead tuples in a 4 minute window, ratio 2,500x over live row count, driving the idle_in_tx scenario to 69% of clean throughput in the local repro). available_count is mathematically equal to queue_enqueue_heads.next_seq - queue_claim_heads.claim_seq for the same (queue, priority). The two head tables already track every enqueue and every claim; the cache was a third counter that had to be UPDATEd on every claim, enqueue, and completion batch, tripling the UPDATE rate on queue_lanes versus its siblings and making it the worst MVCC pressure source on the hot path. Hot path (queue_claimer_signal) now derives the count cheaply from `sum(GREATEST(next_seq - claim_seq, 0))` over the head tables — two PK reads per lane, O(few rows). Tolerates a transient over-count after admin DELETE of non-head lanes: the dispatcher's gap-recovery branch in claim_ready_runtime absorbs the drift on the next claim attempt. Admin API (queue_counts_exact) scans `ready_entries WHERE lane_seq >= claim_seq` for the exact count — what tests assert, what the UI reads. Not on any hot path. Code changes: - v016 migration: ALTER TABLE DROP COLUMN; CREATE OR REPLACE the insert_job_compat / delete_job_compat functions byte-identically to v013 minus their available_count UPDATEs. delete_job_compat gains a head-advance UPDATE for the head-lane delete case so the cheap hot-path approximation doesn't have to wait for gap-recovery. - queue_storage.rs: drop the column from prepare_schema's CREATE TABLE; remove the post-install backfill; remove the queue_lanes UPDATE from claim_ready_runtime's PL/pgSQL (keeping the gap-recovery branch); delete the adjust_lane_counts / adjust_lane_counts_batch helpers (all callers passed pruned_completed_delta=0 so the family was dead code after available_count was dropped); rewrite queue_claimer_signal to use the head-difference formula; rewrite queue_counts_exact to scan ready_entries; cancel_job_tx gains the same head-advance for the head-lane case as delete_job_compat. - migrations.rs: CURRENT_VERSION 15 -> 16; new V16_UP constant. Tests: - queue_storage_runtime_test::test_available_count_matches_ready_entries_scan reframed to two contracts: the hot-path approximation must never under-count vs scan, and the admin API must exactly equal scan. The previous "scan == cache" assertion no longer applies — there is no cache. - test_queue_storage_queue_counts_reads_legacy_lane_rollups_and_backfills_them: remove available_count from the seed INSERT. - 4 sites in queue_storage_copy_test.rs and 2 in awa-python tests: rewrite the inline SQL to use the head-difference formula. - 3 sites (test_dlq.py x2, test_cli.py x1): delete the lane_fix CTE that updated available_count. - awa-ui seed.ts: drop the column from the INSERT. All 57 queue_storage_runtime tests pass on pg18. Workspace builds clean with -D warnings. Bench A/B in flight to measure the idle_in_tx ratio recovery.

coderabbitai · 2026-05-10T22:49:39Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3687829-6ec2-4ca0-9596-95eb8a21098f

📥 Commits

Reviewing files that changed from the base of the PR and between 79ad31b and e257e53.

📒 Files selected for processing (13)

CHANGELOG.md
awa-model/migrations/v016_drop_queue_lanes_available_count.sql
awa-model/src/migrations.rs
awa-model/src/queue_storage.rs
awa-model/tests/queue_storage_copy_test.rs
awa-python/tests/test_awa.py
awa-python/tests/test_cli.py
awa-python/tests/test_dlq.py
awa-python/tests/test_sync.py
awa-ui/frontend/e2e/seed.ts
awa/tests/queue_storage_runtime_test.rs
docs/adr/008-copy-batch-ingestion.md
docs/adr/019-queue-storage-redesign.md

📝 Walkthrough

Walkthrough

Migration v016 drops the queue_lanes.available_count cache column and refactors insert_job_compat and delete_job_compat to advance head cursors instead. Queue storage derives dispatcher availability from queue_enqueue_heads.next_seq - queue_claim_heads.claim_seq, removes lane-count maintenance, and adds gap-recovery logic in the claim path. All tests and seeds are updated to use the cursor-derivation formula.

Changes

Queue Availability Refactoring: Drop Lane Cache, Derive from Cursors

Layer / File(s)	Summary
Migration Registry and Version `awa-model/src/migrations.rs`	`CURRENT_VERSION` bumped to 16; new v016 migration entry and `V16_UP` constant added.
Schema Migration and Function Replacements `awa-model/migrations/v016_drop_queue_lanes_available_count.sql`	Drops `available_count` column from `queue_lanes`; replaces `insert_job_compat` to advance `queue_enqueue_heads.next_seq` and `delete_job_compat` to update `queue_claim_heads.claim_seq` only at claim head; records schema version 16.
Availability Derivation in Dispatcher `awa-model/src/queue_storage.rs`	`AvailableSignal` and `queue_claimer_signal` now sum `GREATEST(next_seq - claim_seq, 0)` across head tables; `queue_counts_exact` scans `ready_entries` joined to `queue_claim_heads` for exact count.
Claim and Enqueue Head Management `awa-model/src/queue_storage.rs`	`claim_ready_runtime` adds gap-recovery branch to advance `claim_seq` when no rows claimed but head gap exists; `cancel_job_tx` advances `claim_seq` only when cancelling lane exactly at claim head.
Removal of Lane-Count Maintenance `awa-model/src/queue_storage.rs`	Removes `adjust_lane_counts` and `adjust_lane_counts_batch` helper functions; deletes lane-count adjustment calls from `insert_ready_rows_tx`, `insert_ready_rows_copy_tx`, `insert_existing_ready_rows_tx`, `retry_job_tx`, and `age_waiting_priorities`.
Schema Installation and Inline Documentation `awa-model/src/queue_storage.rs`	`prepare_schema` updates comment on dropped `queue_count_snapshots` table; `desired_queue_claimer_target` and `queue_claimer_signal` doc comments revised for cursor-derived availability and bounded drift.
Rust Integration Tests `awa-model/tests/queue_storage_copy_test.rs`, `awa/tests/queue_storage_runtime_test.rs`	COPY/batch/density tests derive `available_count` via `GREATEST(qe.next_seq - qc.claim_seq, 0)` join; drift test computes `derived_approx` from head tables and asserts it never under-counts the legacy `ready_entries` scan.
Python Test Suite `awa-python/tests/test_awa.py`, `test_cli.py`, `test_dlq.py`, `test_sync.py`	Enqueue/sync tests compute availability from cursor difference; DLQ helpers remove `lane_fix` CTE that updated `queue_lanes.available_count` during test setup.
E2E Seed Data `awa-ui/frontend/e2e/seed.ts`	`queue_lanes` `INSERT` omits `available_count` column; rows specify only `queue`, `priority`, `next_seq`, and `claim_seq`.
Changelog and ADR Updates `CHANGELOG.md`, `docs/adr/008-copy-batch-ingestion.md`, `docs/adr/019-queue-storage-redesign.md`	Changelog notes v016 performance improvement; ADR-008 clarifies remaining hot-lane write is `next_seq` bump; ADR-019 defines cursor-based availability and `ready_entries` scan for exact counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

hardbyte/awa#181: Introduces the insert_job_compat scaffold and early cursor-based sequencing that this PR builds upon to drop lane counters entirely.
hardbyte/awa#206: Also modifies queue-storage enqueue paths and lane-counting logic, affecting the same state transitions and cursor advancement patterns.
hardbyte/awa#248: Concurrently updates ADR-019 and queue-availability narrative to reflect shift from cached counter to head-table-derived availability.

Poem

🐰 A counter once cached on each lane,
Dead tuples piled up like a train.
But heads lead the way—
Next seq minus claim, hooray!
We've dropped the cache, sanity's gained. ✨

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/drop-queue-lanes-available-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CI surfaced two issues: 1. v016 migration didn't record itself in awa.schema_version, so migrations::run kept thinking version 16 was pending and the idempotency test failed (current_version returned 15, not 16). Every prior migration ends with an `INSERT INTO awa.schema_version (version, description) ... ON CONFLICT DO NOTHING`; v016 was missing it. Added. 2. test_dlq.py x2 and test_cli.py x1: when I removed the `lane_fix` CTE that UPDATEd queue_lanes.available_count, I left the trailing comma after the previous `released AS (...)` block, producing `WITH ... released AS (...), INSERT INTO ...` — syntax error. Replaced `),` with `)` at the three sites. Local verification: test_c13_migration_idempotent passes against fresh pg18. Pushing for CI to confirm the Python suite recovers.

Strip "v016:" prefixes, date stamps, and "after the X investigation" phrasing from comments added with the migration. Promote `queue_claimer_signal` and `queue_counts_exact` to /// doc comments describing the head-difference approximation vs the ready_entries scan, when each is used, and the drift behaviour under admin deletes. Rewrite the drift-detection test's docstring around the head-table invariants (enqueue bumps next_seq, claim/cancel bump claim_seq, gap-recovery closes mid-ring gaps) so the test reads as a coverage table for the lifecycle rather than a counter-maintenance audit. No behaviour changes.

- ADR-019 §lane_state: replace the stale paragraph that described the cache as the current implementation; document the two-grade availability scheme (head-difference for the hot path, ready_entries scan for the admin API) and capture the cache as a historical iteration superseded by v016. - ADR-008: rewrite the "lane counters remain online" caveat to reflect that only the next_seq cursor bump is on the enqueue hot path now; cross-link to ADR-019. - CHANGELOG: add an Unreleased Performance entry for v016. Bench-dated ADRs under docs/adr/bench/ are left as-is — they are historical artifacts of specific benchmark runs.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e257e539e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T01:25:40Z

+        EXECUTE format(
+            'ALTER TABLE %I.queue_lanes DROP COLUMN IF EXISTS available_count',
+            v_schema


Avoid dropping the column in a rolling migration

When this migration is applied while any v15 workers are still running, their enqueue/claim/delete paths and the old compat functions still issue UPDATE queue_lanes SET available_count = ..., so this ALTER TABLE ... DROP COLUMN makes those workers fail until every process is upgraded and functions are replaced. This also contradicts the migration policy in awa-model/src/migrations.rs that incremental migrations must be additive-only to avoid breaking running workers; either keep the column through a compatibility window or make this a documented stop-the-world/major upgrade.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-11T01:25:40Z

-            SELECT COALESCE(sum(available_count), 0)::bigint
-            FROM {schema}.queue_lanes
-            WHERE queue = ANY($1)
+            SELECT COALESCE(sum(GREATEST(qe.next_seq - qc.claim_seq, 0)), 0)::bigint


Close deleted tail gaps before using the head-derived signal

For queues where admins cancel/delete all unclaimed rows at or beyond the next claim head after an earlier head row remains, this derived signal can stay positive forever: after the remaining head is claimed, claim_seq points at a deleted gap while next_seq is still higher, but claim_ready_runtime first joins through a LATERAL ready_entries candidate and returns before its gap-recovery branch when no row exists. Previously available_count was decremented by those deletes; now the dispatcher will keep waking/claiming an empty queue until a future enqueue happens to advance the cursor.

Useful? React with 👍 / 👎.

hardbyte added 3 commits May 11, 2026 12:02

hardbyte marked this pull request as ready for review May 11, 2026 01:22

hardbyte merged commit b46a5c2 into main May 11, 2026
12 of 13 checks passed

hardbyte deleted the feat/drop-queue-lanes-available-count branch May 11, 2026 01:23

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

This was referenced May 12, 2026

0.6 release readiness: align TLA+ models, docs, and queue-storage replacement gates #197

Open

Validate MVCC / dead-tuple behaviour under sustained load #169

Open

coderabbitai Bot mentioned this pull request May 17, 2026

Fix queue-storage shard joins and nightly chaos races #261

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v016: drop queue_lanes.available_count; derive from heads#251

v016: drop queue_lanes.available_count; derive from heads#251
hardbyte merged 4 commits into
mainfrom
feat/drop-queue-lanes-available-count

hardbyte commented May 10, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hardbyte commented May 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results — noise-controlled A/B

queue_lanes dead-tuple peak during idle_in_tx — deterministic

Throughput — v016 stable, prefix unstable

Code changes

Trade

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hardbyte commented May 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 10, 2026 •

edited

Loading