Skip to content

v016: drop queue_lanes.available_count; derive from heads#251

Merged
hardbyte merged 4 commits into
mainfrom
feat/drop-queue-lanes-available-count
May 11, 2026
Merged

v016: drop queue_lanes.available_count; derive from heads#251
hardbyte merged 4 commits into
mainfrom
feat/drop-queue-lanes-available-count

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented May 10, 2026

Summary

Removes the redundant `available_count` cache from `queue_lanes`. The 2026-05-10 investigation traced this column as the dominant bloat source under pinned xmin — every claim, enqueue, and completion batch was UPDATEing it, tripling `queue_lanes`'s mutation rate vs its sibling counter tables. The value is mathematically identical to `queue_enqueue_heads.next_seq − queue_claim_heads.claim_seq`, which the two head tables already maintain; storing a third counter just to denormalise a difference was pure cost.

Results — noise-controlled A/B

Three iterations of two cells (idle_in_tx; W=64 steady), interleaving prefix and v016 so any time-correlated host noise hits both alike. Local docker pg18.

queue_lanes dead-tuple peak during idle_in_tx — deterministic

iter prefix v016
1 30,926 0
2 35,956 0
3 35,442 0

The bloat source is gone. Live row count is ~16 and stays that way.

Throughput — v016 stable, prefix unstable

iter shape prefix clean v016 clean prefix idle v016 idle v016 idle/clean
1 idle_in_tx 592 740 524 707 96 %
2 idle_in_tx 0 1,000 0 651 65 %
3 idle_in_tx 0 914 574 789 86 %
1 W=64 steady 0 1,657
2 W=64 steady 0 1,505
3 W=64 steady 1,424 1,422

Prefix hits a "0 jobs/s" stall in 4 of 6 cells. v016 never does. In the one cell where prefix didn't stall (iter 3 W=64), the two are equivalent (~1,420). The bench harness sampling is the same on both sides — the stalls are real production-style stalls in the prefix code, made visible by macOS docker's higher MVCC pressure.

The v016 numbers themselves vary across iterations (740–1,000 clean, 651–789 idle, 1,505–1,657 W=64), which is normal docker-on-Mac noise. The point is v016 doesn't stall.

Code changes

Hot path (`queue_storage.rs`):

  • Drop `available_count` column from `queue_lanes` CREATE TABLE.
  • Remove the column's backfill from `prepare_schema`.
  • Remove the `UPDATE queue_lanes SET available_count = …` from `claim_ready_runtime` PL/pgSQL (the gap-recovery branch stays).
  • Delete `adjust_lane_counts` and `adjust_lane_counts_batch` — every historical caller passed `pruned_completed_delta = 0`, so the entire helper collapsed to a no-op once `available_count` was dropped.
  • Rewrite `queue_claimer_signal` to derive the count from the head-table difference (two PK reads per lane).
  • Rewrite `queue_counts_exact` (admin-grade API) to scan `ready_entries WHERE lane_seq >= claim_seq` for the exact count.
  • `cancel_job_tx` and the v016 `delete_job_compat` each gain a small head-advance UPDATE for the head-lane delete case, so the cheap hot-path approximation doesn't have to wait for gap-recovery.

Migration (`v016_drop_queue_lanes_available_count.sql`):

  • `ALTER TABLE queue_lanes DROP COLUMN available_count`.
  • `CREATE OR REPLACE FUNCTION` for `insert_job_compat` and `delete_job_compat`, byte-identical to v013's signatures, bodies minus their `available_count` UPDATEs.
  • `migrations.rs` bumps `CURRENT_VERSION 15 → 16`.

Tests (no production code path uses the dropped column):

  • `test_available_count_matches_ready_entries_scan` reframed against two contracts: hot-path approximation never under-counts vs scan; admin API exactly equals scan.
  • 7 inline-SQL sites in Rust + Python tests + the awa-ui seed: rewrite to head-difference or just drop the column.
  • 3 lane_fix CTE sites in Python tests: deleted (no column to update).

Trade

The cheap hot-path approximation (`next_seq − claim_seq`) over-counts by 1 each time an admin operation deletes an unclaimed non-head lane, until the next claim attempt on that lane hits the gap-recovery branch in `claim_ready_runtime` and catches `claim_seq` up. The dispatcher's wake-up signal is tolerant of brief over-counts — an over-eager wake-up is harmless. Admin tools that need exactness route through `queue_counts_exact`, which scans `ready_entries`.

Test plan

  • `cargo build --workspace`
  • `cargo clippy --workspace --tests -- -D warnings` clean
  • `cargo fmt --all`
  • `cargo test -p awa --test queue_storage_runtime_test` — 57/57 pass on pg18 local
  • Bloat A/B: `queue_lanes` idle dead-tuple peak 30k+ → 0, deterministic over 3 iterations
  • Throughput A/B: v016 stable; prefix stalls in 4 of 6 cells under macOS docker pressure
  • Confirmation on the NixOS sweep box (more stable hardware) — optional given the bloat metric is deterministic and v016 was strictly stabler in every cell

Summary by CodeRabbit

  • Performance

    • Optimized queue availability calculation by eliminating a cached counter and deriving availability from sequence cursors. Reduces database dead-tuple overhead in high-load workloads without affecting the public API.
  • Tests

    • Updated queue availability assertions across test suites to reflect the new calculation method.
  • Documentation

    • Updated architectural decision records documenting queue availability computation approach.

Review Change Stack

Removes the redundant available_count cache on queue_lanes that
the 2026-05-10 investigation traced as the dominant bloat source
under pinned xmin (~40k dead tuples in a 4 minute window, ratio
2,500x over live row count, driving the idle_in_tx scenario to
69% of clean throughput in the local repro).

available_count is mathematically equal to
queue_enqueue_heads.next_seq - queue_claim_heads.claim_seq for the
same (queue, priority). The two head tables already track every
enqueue and every claim; the cache was a third counter that had
to be UPDATEd on every claim, enqueue, and completion batch,
tripling the UPDATE rate on queue_lanes versus its siblings and
making it the worst MVCC pressure source on the hot path.

Hot path (queue_claimer_signal) now derives the count cheaply
from `sum(GREATEST(next_seq - claim_seq, 0))` over the head
tables — two PK reads per lane, O(few rows). Tolerates a transient
over-count after admin DELETE of non-head lanes: the dispatcher's
gap-recovery branch in claim_ready_runtime absorbs the drift on
the next claim attempt.

Admin API (queue_counts_exact) scans `ready_entries WHERE
lane_seq >= claim_seq` for the exact count — what tests assert,
what the UI reads. Not on any hot path.

Code changes:
- v016 migration: ALTER TABLE DROP COLUMN; CREATE OR REPLACE the
  insert_job_compat / delete_job_compat functions byte-identically
  to v013 minus their available_count UPDATEs. delete_job_compat
  gains a head-advance UPDATE for the head-lane delete case so the
  cheap hot-path approximation doesn't have to wait for
  gap-recovery.
- queue_storage.rs: drop the column from prepare_schema's CREATE
  TABLE; remove the post-install backfill; remove the queue_lanes
  UPDATE from claim_ready_runtime's PL/pgSQL (keeping the
  gap-recovery branch); delete the adjust_lane_counts /
  adjust_lane_counts_batch helpers (all callers passed
  pruned_completed_delta=0 so the family was dead code after
  available_count was dropped); rewrite queue_claimer_signal to
  use the head-difference formula; rewrite queue_counts_exact to
  scan ready_entries; cancel_job_tx gains the same head-advance
  for the head-lane case as delete_job_compat.
- migrations.rs: CURRENT_VERSION 15 -> 16; new V16_UP constant.

Tests:
- queue_storage_runtime_test::test_available_count_matches_ready_entries_scan
  reframed to two contracts: the hot-path approximation must never
  under-count vs scan, and the admin API must exactly equal scan.
  The previous "scan == cache" assertion no longer applies — there
  is no cache.
- test_queue_storage_queue_counts_reads_legacy_lane_rollups_and_backfills_them:
  remove available_count from the seed INSERT.
- 4 sites in queue_storage_copy_test.rs and 2 in awa-python tests:
  rewrite the inline SQL to use the head-difference formula.
- 3 sites (test_dlq.py x2, test_cli.py x1): delete the lane_fix
  CTE that updated available_count.
- awa-ui seed.ts: drop the column from the INSERT.

All 57 queue_storage_runtime tests pass on pg18. Workspace builds
clean with -D warnings. Bench A/B in flight to measure the
idle_in_tx ratio recovery.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e3687829-6ec2-4ca0-9596-95eb8a21098f

📥 Commits

Reviewing files that changed from the base of the PR and between 79ad31b and e257e53.

📒 Files selected for processing (13)
  • CHANGELOG.md
  • awa-model/migrations/v016_drop_queue_lanes_available_count.sql
  • awa-model/src/migrations.rs
  • awa-model/src/queue_storage.rs
  • awa-model/tests/queue_storage_copy_test.rs
  • awa-python/tests/test_awa.py
  • awa-python/tests/test_cli.py
  • awa-python/tests/test_dlq.py
  • awa-python/tests/test_sync.py
  • awa-ui/frontend/e2e/seed.ts
  • awa/tests/queue_storage_runtime_test.rs
  • docs/adr/008-copy-batch-ingestion.md
  • docs/adr/019-queue-storage-redesign.md

📝 Walkthrough

Walkthrough

Migration v016 drops the queue_lanes.available_count cache column and refactors insert_job_compat and delete_job_compat to advance head cursors instead. Queue storage derives dispatcher availability from queue_enqueue_heads.next_seq - queue_claim_heads.claim_seq, removes lane-count maintenance, and adds gap-recovery logic in the claim path. All tests and seeds are updated to use the cursor-derivation formula.

Changes

Queue Availability Refactoring: Drop Lane Cache, Derive from Cursors

Layer / File(s) Summary
Migration Registry and Version
awa-model/src/migrations.rs
CURRENT_VERSION bumped to 16; new v016 migration entry and V16_UP constant added.
Schema Migration and Function Replacements
awa-model/migrations/v016_drop_queue_lanes_available_count.sql
Drops available_count column from queue_lanes; replaces insert_job_compat to advance queue_enqueue_heads.next_seq and delete_job_compat to update queue_claim_heads.claim_seq only at claim head; records schema version 16.
Availability Derivation in Dispatcher
awa-model/src/queue_storage.rs
AvailableSignal and queue_claimer_signal now sum GREATEST(next_seq - claim_seq, 0) across head tables; queue_counts_exact scans ready_entries joined to queue_claim_heads for exact count.
Claim and Enqueue Head Management
awa-model/src/queue_storage.rs
claim_ready_runtime adds gap-recovery branch to advance claim_seq when no rows claimed but head gap exists; cancel_job_tx advances claim_seq only when cancelling lane exactly at claim head.
Removal of Lane-Count Maintenance
awa-model/src/queue_storage.rs
Removes adjust_lane_counts and adjust_lane_counts_batch helper functions; deletes lane-count adjustment calls from insert_ready_rows_tx, insert_ready_rows_copy_tx, insert_existing_ready_rows_tx, retry_job_tx, and age_waiting_priorities.
Schema Installation and Inline Documentation
awa-model/src/queue_storage.rs
prepare_schema updates comment on dropped queue_count_snapshots table; desired_queue_claimer_target and queue_claimer_signal doc comments revised for cursor-derived availability and bounded drift.
Rust Integration Tests
awa-model/tests/queue_storage_copy_test.rs, awa/tests/queue_storage_runtime_test.rs
COPY/batch/density tests derive available_count via GREATEST(qe.next_seq - qc.claim_seq, 0) join; drift test computes derived_approx from head tables and asserts it never under-counts the legacy ready_entries scan.
Python Test Suite
awa-python/tests/test_awa.py, test_cli.py, test_dlq.py, test_sync.py
Enqueue/sync tests compute availability from cursor difference; DLQ helpers remove lane_fix CTE that updated queue_lanes.available_count during test setup.
E2E Seed Data
awa-ui/frontend/e2e/seed.ts
queue_lanes INSERT omits available_count column; rows specify only queue, priority, next_seq, and claim_seq.
Changelog and ADR Updates
CHANGELOG.md, docs/adr/008-copy-batch-ingestion.md, docs/adr/019-queue-storage-redesign.md
Changelog notes v016 performance improvement; ADR-008 clarifies remaining hot-lane write is next_seq bump; ADR-019 defines cursor-based availability and ready_entries scan for exact counts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • hardbyte/awa#181: Introduces the insert_job_compat scaffold and early cursor-based sequencing that this PR builds upon to drop lane counters entirely.
  • hardbyte/awa#206: Also modifies queue-storage enqueue paths and lane-counting logic, affecting the same state transitions and cursor advancement patterns.
  • hardbyte/awa#248: Concurrently updates ADR-019 and queue-availability narrative to reflect shift from cached counter to head-table-derived availability.

Poem

🐰 A counter once cached on each lane,
Dead tuples piled up like a train.
But heads lead the way—
Next seq minus claim, hooray!
We've dropped the cache, sanity's gained.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/drop-queue-lanes-available-count

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

hardbyte added 3 commits May 11, 2026 12:02
CI surfaced two issues:

1. v016 migration didn't record itself in awa.schema_version, so
   migrations::run kept thinking version 16 was pending and the
   idempotency test failed (current_version returned 15, not 16).
   Every prior migration ends with an `INSERT INTO awa.schema_version
   (version, description) ... ON CONFLICT DO NOTHING`; v016 was
   missing it. Added.

2. test_dlq.py x2 and test_cli.py x1: when I removed the `lane_fix`
   CTE that UPDATEd queue_lanes.available_count, I left the trailing
   comma after the previous `released AS (...)` block, producing
   `WITH ... released AS (...), INSERT INTO ...` — syntax error.
   Replaced `),` with `)` at the three sites.

Local verification: test_c13_migration_idempotent passes against
fresh pg18. Pushing for CI to confirm the Python suite recovers.
Strip "v016:" prefixes, date stamps, and "after the X investigation"
phrasing from comments added with the migration. Promote
`queue_claimer_signal` and `queue_counts_exact` to /// doc comments
describing the head-difference approximation vs the ready_entries
scan, when each is used, and the drift behaviour under admin deletes.

Rewrite the drift-detection test's docstring around the head-table
invariants (enqueue bumps next_seq, claim/cancel bump claim_seq,
gap-recovery closes mid-ring gaps) so the test reads as a coverage
table for the lifecycle rather than a counter-maintenance audit.

No behaviour changes.
- ADR-019 §lane_state: replace the stale paragraph that described
  the cache as the current implementation; document the two-grade
  availability scheme (head-difference for the hot path, ready_entries
  scan for the admin API) and capture the cache as a historical
  iteration superseded by v016.
- ADR-008: rewrite the "lane counters remain online" caveat to
  reflect that only the next_seq cursor bump is on the enqueue hot
  path now; cross-link to ADR-019.
- CHANGELOG: add an Unreleased Performance entry for v016.

Bench-dated ADRs under docs/adr/bench/ are left as-is — they are
historical artifacts of specific benchmark runs.
@hardbyte hardbyte marked this pull request as ready for review May 11, 2026 01:22
@hardbyte hardbyte merged commit b46a5c2 into main May 11, 2026
12 of 13 checks passed
@hardbyte hardbyte deleted the feat/drop-queue-lanes-available-count branch May 11, 2026 01:23
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e257e539e7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +41 to +43
EXECUTE format(
'ALTER TABLE %I.queue_lanes DROP COLUMN IF EXISTS available_count',
v_schema
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid dropping the column in a rolling migration

When this migration is applied while any v15 workers are still running, their enqueue/claim/delete paths and the old compat functions still issue UPDATE queue_lanes SET available_count = ..., so this ALTER TABLE ... DROP COLUMN makes those workers fail until every process is upgraded and functions are replaced. This also contradicts the migration policy in awa-model/src/migrations.rs that incremental migrations must be additive-only to avoid breaking running workers; either keep the column through a compatibility window or make this a documented stop-the-world/major upgrade.

Useful? React with 👍 / 👎.

SELECT COALESCE(sum(available_count), 0)::bigint
FROM {schema}.queue_lanes
WHERE queue = ANY($1)
SELECT COALESCE(sum(GREATEST(qe.next_seq - qc.claim_seq, 0)), 0)::bigint
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Close deleted tail gaps before using the head-derived signal

For queues where admins cancel/delete all unclaimed rows at or beyond the next claim head after an earlier head row remains, this derived signal can stay positive forever: after the remaining head is claimed, claim_seq points at a deleted gap while next_seq is still higher, but claim_ready_runtime first joins through a LATERAL ready_entries candidate and returns before its gap-recovery branch when no row exists. Previously available_count was decremented by those deletes; now the dispatcher will keep waking/claiming an empty queue until a future enqueue happens to advance the cursor.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant