feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290) by hardbyte · Pull Request #306 · hardbyte/awa

hardbyte · 2026-06-02T07:38:57Z

Summary

Successor to #305 (auto-closed when its base branch — the now-squashed #304 — was deleted). Carries the same two commits, rebased onto current main:

Read switch + rename: queue_counts_exact now reads live terminal counts from queue_terminal_live_counts instead of scanning done_entries. QueueCounts.completed renamed to QueueCounts.terminal — honest name (the count includes failed and cancelled rows, not just completed).
Queue-leading index: CREATE INDEX (queue, priority) ON queue_terminal_live_counts. Without it, the new read path scans every queue's counter rows in multi-queue schemas — undercutting the perf(queue_storage): live terminal counter for queue_counts_exact #290 perf goal.

After this lands, the headline regressor from #169's long_horizon evidence — the O(done_entries) scan inside queue_counts_exact — is gone, and the read path is an indexed range scan over a small denormalised counter.

What ships

Read switch in `queue_counts_exact`

The live_terminal CTE used to be:

SELECT count(*)::bigint AS completed
FROM {schema}.done_entries
WHERE queue = ANY($1)

…O(done_entries-for-queue). Now reads:

SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal
FROM {schema}.queue_terminal_live_counts
WHERE queue = ANY($1)

…indexed range scan + heap fetch for the matching counter rows (at most queue_slot_count × priorities × enqueue_shards rows per queue).

Queue-leading index

CREATE INDEX IF NOT EXISTS idx_..._queue ON queue_terminal_live_counts (queue, priority). The PK leads with ready_slot (right shape for the row-level UPSERT, wrong shape for the read aggregation), so the new index is what makes the perf claim load-bearing.

Notes on the index choice:

(queue, priority) keeps it narrow while supporting future per-priority drill-down.
No INCLUDE (live_terminal_count) — that would block HOT updates on the column the increment/decrement path mutates on every write, re-introducing the v016 bloat shape perf(queue_storage): live terminal counter for queue_counts_exact #290 set out to avoid. Heap fetch per matching row is fine.

Rename `QueueCounts.completed` → `QueueCounts.terminal`

The historical name was a misnomer: queue_counts_exact has always returned count(*) FROM done_entries, which includes failed and cancelled rows. Field doc spells out the semantic. queue_counts_fast got the same rename + a clarifying note.

Internal-only — the struct doesn't derive Serialize, so no JSON wire shape changes. Test/bench callers updated; the bench output's local QueueStorageSnapshot.completed field is preserved so downstream bench-dashboard scripts that consume the JSON don't break; only the RHS <queue_counts>.completed accesses are renamed.

Refs

Refs perf(queue_storage): live terminal counter for queue_counts_exact #290. With this PR + feat(queue_storage): queue_terminal_live_counts + full lifecycle wiring (#290 PR A) #304 (now on main), the perf(queue_storage): live terminal counter for queue_counts_exact #290 implementation is correctness-complete except for the bench + docs PRs.

Test plan

cargo fmt --all
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo build --workspace --all-targets
cargo test -p awa --test queue_storage_runtime_test test_queue_terminal_live_counts against Postgres — all 6 perf(queue_storage): live terminal counter for queue_counts_exact #290 invariant tests still pass
cargo test -p awa --test queue_storage_runtime_test queue_counts — 3 pre-existing queue_counts tests pass, proving the counter-backed read returns the same values as the old scan
CI for full matrix (running on this push)

Summary by CodeRabbit

Improvements
- Queue metrics now accurately track all terminal job states (completed, failed, and cancelled) instead of completed-only tracking
- Added automatic counter validation during software upgrades to ensure terminal counts remain accurate without requiring service interruption
- Implemented fallback verification when terminal counters are being rebuilt to maintain data consistency

…s.terminal (#290) Switches `queue_counts_exact` to read live terminal counts from the denormalised `queue_terminal_live_counts` table instead of scanning `done_entries`, and renames `QueueCounts.completed` → `QueueCounts.terminal` to honestly reflect that the count includes failed and cancelled terminals, not just completed ones. Stacks on the prior PR (which wired the counter on every insert / delete / prune-fold path). With those write-side guarantees in place this PR is the last load-bearing piece of #290: the `count(*) FROM done_entries WHERE queue = ANY(...)` scan is gone. ### Read switch in `queue_counts_exact` The `live_terminal` CTE used to be: SELECT count(*)::bigint AS completed FROM {schema}.done_entries WHERE queue = ANY($1) …which is O(done_entries-for-queue) and the headline regressor in the long_horizon evidence on #169. It now reads: SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal FROM {schema}.queue_terminal_live_counts WHERE queue = ANY($1) …which is O(num counter rows for queue) — at most `queue_slot_count * priorities * enqueue_shards` rows per queue. The `pruned_terminal + live_terminal` sum at the outer SELECT is unchanged; the rollup column already accumulates per-slot folds from prune, and the live counter holds everything not yet pruned. ### `QueueCounts.completed` → `QueueCounts.terminal` The historical name was a misnomer: `queue_counts_exact` has always returned `count(*) FROM done_entries`, which includes `failed` and `cancelled` rows alongside `completed`. The field doc spells out the semantic. `queue_counts_fast` got the same rename + a clarifying note that the historical name was wrong. The change is internal — the struct doesn't derive `Serialize`, so no JSON wire shape changes. Only the Rust API renames. All callers (test files only — the prod admin API uses `admin::QueueOverview`, not `QueueCounts`) are updated. The bench output's local `QueueStorageSnapshot.completed` field is preserved so downstream bench-dashboard scripts that consume the JSON don't break; only the RHS `<queue_counts>.completed` accesses are renamed. ### Tests - Existing `test_queue_storage_queue_counts_*` tests still pass — proves the counter-backed read returns the same values as the old scan. - The 4 #290 invariant tests from the prior PR still pass, but they now also implicitly verify the read path: any drift between the counter and `done_entries` would show as a divergence between `queue_counts_exact.terminal` and the test's direct `count(*) FROM done_entries` reference value. Refs #290.

…#290) The PK on `queue_terminal_live_counts` leads with `ready_slot`, which is the right shape for the row-level UPSERT path (one row per `(ready_slot, queue, priority, enqueue_shard)` group) but the wrong shape for the new aggregating read path: SELECT COALESCE(SUM(live_terminal_count), 0)::bigint FROM {schema}.queue_terminal_live_counts WHERE queue = ANY($1) Without a queue-leading index, that scan walks every row in the table — in a multi-queue schema with M queues × S slots × P priorities × H shards, that's M·S·P·H rows to find the ~S·P·H rows for the requested queue. Undercuts the #290 perf goal. Add `CREATE INDEX IF NOT EXISTS idx_..._queue ON ...(queue, priority)` in `QueueStorage::prepare_schema()` next to the table creation. Index choice notes: - `(queue, priority)` keeps it narrow (the two columns the read path filters on) and supports future per-priority drill-down. - No `INCLUDE (live_terminal_count)` — that would disqualify HOT updates on the very column the increment/decrement path mutates, re-introducing the v016 bloat shape #290 is trying to avoid. Heap fetch per matching counter row is fine; the index range is small. All 6 #290 invariant tests still green.

coderabbitai · 2026-06-02T07:39:04Z

📝 Walkthrough

Walkthrough

This PR implements a trust-gated terminal counter denormalization strategy: the QueueCounts struct field is renamed from completed to terminal, a new trust marker (terminal_counter_trusted_at) gates exact counting between fast denormalized counter reads and conservative full scans during rolling upgrades, and comprehensive tests validate the marker lifecycle.

Changes

Terminal Counter Denormalization with Trust-Gated Reading

Layer / File(s)	Summary
Terminal counts contract and schema foundation `awa-model/src/queue_storage.rs`	`QueueCounts` struct field renamed to `terminal`; `queue_ring_state` gains `terminal_counter_trusted_at` nullable column; new index added on `queue_terminal_live_counts`; auto-initialization sets trust marker on fresh installs.
Trust-gated exact counting path `awa-model/src/queue_storage.rs`	`queue_counts_exact` conditionally chooses counter-fed reads (from `queue_terminal_live_counts` when trusted) or conservative full scans (from `done_entries` when untrusted); SQL and Rust mappings updated for `terminal` field.
Fast counting and terminal field wiring `awa-model/src/queue_storage.rs`	`queue_counts_fast` documentation and implementation updated to use `terminal` semantics; local variable and struct field wiring switched from `completed` to `terminal`.
Trust marker lifecycle `awa-model/src/queue_storage.rs`	New public `terminal_counter_trusted` method checks marker state; `rebuild_terminal_counters` flips marker to `now()` after rebuild, enabling counter-fed exact reads post-rebuild.
Comprehensive test coverage for trust-gated behavior `awa/tests/benchmark_test.rs`, `awa/tests/queue_storage_benchmark_test.rs`, `awa/tests/queue_storage_runtime_test.rs`	New integration test validates trust marker gates exact counting across untrusted/rebuild scenarios; all runtime and benchmark assertions updated to use `terminal` field; completion tracking in benchmarks switched to terminal counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

hardbyte/awa#290: This PR directly implements the live terminal denormalization proposal, adding the trust marker gating and updating queue_counts_exact to conditionally read from the denormalized counter during rolling upgrades.

Possibly related PRs

hardbyte/awa#289: This PR renamed the fast counting method and introduced queue_counts_fast with denormalized terminal counter reads; the current PR builds on that foundation by adding trust-gated path selection in exact counts.
hardbyte/awa#304: Both PRs integrate queue_terminal_live_counts into terminal counting logic and update rebuild_terminal_counters, with this PR adding the trust marker gating layer and exact read-path selection.

Poem

🐰 From completed to terminal we hop,
Trust markers guard each query stop,
Fresh installs bloom with counters true,
While rolling upgrades scan anew—
A denormalized dream, hopping fast and clean. 🌱✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main changes: switching to read live terminal counts from the new index and renaming QueueCounts.completed to QueueCounts.terminal.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch brian/issue-290-read-switch-and-rename

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 034e2b055d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T07:42:46Z

+                SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal
+                FROM {schema}.queue_terminal_live_counts
                WHERE queue = ANY($1)


Keep exact counts correct when counters have drifted

When upgrading from a pre-#290 fleet (or after any counter drift), rebuild_terminal_counters explicitly documents that done_entries may contain terminal rows that queue_terminal_live_counts does not include (awa-model/src/queue_storage.rs:11564-11569). This change makes queue_counts_exact/queue_counts read live terminals only from the counter, so the admin “exact” count under-reports completed/failed/cancelled jobs until an operator manually rebuilds the table; the previous count(*) FROM done_entries path stayed correct in that scenario. Please either make the read switch self-heal/backfill before relying on the counter, or retain the scan/fallback when the counter may be untrusted.

Useful? React with 👍 / 👎.

Addresses reviewer P1 on #306: `queue_counts_exact` was reading live terminals from `queue_terminal_live_counts` unconditionally, but the codebase documents that the counter can drift during a pre-#290 rolling upgrade and should be rebuilt before relying on it. The "exact" naming was therefore only exact-after-rebuild. Fix: a trust marker. When the marker is NULL the read path falls back to scanning `done_entries` (slower but correct); when set, the counter-fed path is used. The marker lives as a column on the existing `queue_ring_state` singleton row rather than its own table — `prepare_schema` already creates ~22 per-schema tables, and this is one boolean that doesn't warrant a 23rd. Update lifecycle: - ADD COLUMN IF NOT EXISTS on every `prepare_schema` (idempotent; no-op for already-migrated schemas). - Fresh installs auto-mark via `UPDATE queue_ring_state SET terminal_counter_trusted_at = now() WHERE singleton AND trusted_at IS NULL AND NOT EXISTS (SELECT 1 FROM done_entries LIMIT 1)` — vacuously trustable when there's nothing to drift. - `rebuild_terminal_counters` flips the marker to `now()` after the TRUNCATE + re-aggregation, in the same transaction. - Existing installs upgrading from a pre-#290 binary land with trusted_at NULL, and the read path scans `done_entries` until the operator runs `awa storage rebuild-terminal-counters`. `terminal_counter_trusted(pool)` is one PK fetch on the singleton row — negligible cost per `queue_counts_exact` call. The read query builds its `live_terminal` CTE as either the counter-sum or the done_entries-scan depending on the trust check; outer plan shape is identical between the two paths. Update lifecycle for the column is once-at-install and once-per-rebuild, so adding it to `queue_ring_state` doesn't disturb the existing HOT-update tuning for rotation writes (`current_slot`, `generation`). New test `test_queue_terminal_counter_trust_marker_gates_read_path`: asserts the fresh-install auto-mark, simulates rolling-upgrade drift by inserting an orphan done_entries row + clearing the marker, asserts the read returns 6 (scan) not 5 (counter), then rebuilds and asserts the marker flips back and reads return the counter sum. All 7 #290 invariant + trust-marker tests pass locally.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awa-model/src/queue_storage.rs`:
- Around line 3501-3529: The UPDATE that auto-sets terminal_counter_trusted_at
is currently gated only on done_entries being empty, which can mistakenly trust
upgraded-but-not-fresh schemas; change the logic in prepare_schema so that the
UPDATE on {schema}.queue_ring_state (the terminal_counter_trusted_at row) runs
only when this prepare_schema invocation actually created the schema (or an
explicit "fresh install" marker set in this transaction), not merely when
done_entries is empty; implement this by recording a boolean (e.g.
schema_created_in_prepare) when you create the schema in this prepare_schema
flow and include that condition in the WHERE (or create a transient marker row
checked in the WHERE), ensuring the query executed with install_tx only
auto-trusts when that marker/flag is present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db375cbe-70da-48d4-ac4c-33efe08ddeab

📥 Commits

Reviewing files that changed from the base of the PR and between 118cfc0 and fbd57af.

📒 Files selected for processing (4)

awa-model/src/queue_storage.rs
awa/tests/benchmark_test.rs
awa/tests/queue_storage_benchmark_test.rs
awa/tests/queue_storage_runtime_test.rs

coderabbitai · 2026-06-02T08:17:28Z

+            // #290: auto-mark trusted for fresh installs. If
+            // `done_entries` is empty at this point AND the trust
+            // marker is still NULL, the counter is vacuously correct
+            // (nothing to drift) and the operator shouldn't have to
+            // run the rebuild CLI just to enable the perf path. On an
+            // existing install upgrading from a pre-#290 fleet, old
+            // binaries that wrote to done_entries before the new
+            // runtime booted would have left non-zero rows here, so
+            // we leave trusted_at NULL and the operator must
+            // explicitly rebuild after the rolling upgrade completes.
+            //
+            // The advisory lock held by prepare_schema gives us a
+            // tight window — concurrent writes from other schema
+            // preps are serialised on the lock, and any in-flight
+            // Rust writers are already counter-maintaining by
+            // definition (they built against this codebase).
+            sqlx::query(&format!(
+                r#"
+            UPDATE {schema}.queue_ring_state
+            SET terminal_counter_trusted_at = now()
+            WHERE singleton = TRUE
+              AND terminal_counter_trusted_at IS NULL
+              AND NOT EXISTS (SELECT 1 FROM {schema}.done_entries LIMIT 1)
+            "#
+            ))
+            .execute(install_tx.as_mut())
+            .await
+            .map_err(map_sqlx_error)?;
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't auto-trust upgraded schemas just because done_entries is empty.

Line 3501 uses current emptiness of done_entries as the freshness signal, but an existing pre-#290 schema can also be empty when the first new node boots. If old binaries are still live after this update, they can append to done_entries without touching queue_terminal_live_counts, and queue_counts_exact() will immediately switch to the counter path and undercount terminal rows. Gate this auto-marking on “schema created in this prepare_schema run” (or another explicit fresh-install signal), not on emptiness alone.

🧰 Tools

🪛 OpenGrep (1.22.0)

[ERROR] 3517-3525: SQL query built via format!() passed to a database method. Use parameterized queries with bind parameters instead.

(coderabbit.sql-injection.rust-format-query)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@awa-model/src/queue_storage.rs` around lines 3501 - 3529, The UPDATE that auto-sets terminal_counter_trusted_at is currently gated only on done_entries being empty, which can mistakenly trust upgraded-but-not-fresh schemas; change the logic in prepare_schema so that the UPDATE on {schema}.queue_ring_state (the terminal_counter_trusted_at row) runs only when this prepare_schema invocation actually created the schema (or an explicit "fresh install" marker set in this transaction), not merely when done_entries is empty; implement this by recording a boolean (e.g. schema_created_in_prepare) when you create the schema in this prepare_schema flow and include that condition in the WHERE (or create a transient marker row checked in the WHERE), ensuring the query executed with install_tx only auto-trusts when that marker/flag is present.

hardbyte added 2 commits June 2, 2026 19:37

hardbyte mentioned this pull request Jun 2, 2026

feat(queue_storage): read live terminal counts + rename to QueueCounts.terminal (#290) #305

Closed

6 tasks

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

hardbyte mentioned this pull request Jun 2, 2026

0.6 release readiness: align TLA+ models, docs, and queue-storage replacement gates #197

Open

68 tasks

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

hardbyte merged commit 667a09b into main Jun 2, 2026
24 of 25 checks passed

hardbyte deleted the brian/issue-290-read-switch-and-rename branch June 2, 2026 08:27

hardbyte mentioned this pull request Jun 2, 2026

test(chaos): test_mixed_rust_and_python_workers_share_same_queue is race-based and flaky #309

Closed

coderabbitai Bot mentioned this pull request Jun 2, 2026

refactor(migrations): move default queue-storage substrate into v023 SQL helper (#308 PR 1) #310

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290)#306

feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290)#306
hardbyte merged 3 commits into
mainfrom
brian/issue-290-read-switch-and-rename

hardbyte commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hardbyte commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Read switch in queue_counts_exact

Queue-leading index

Rename QueueCounts.completed → QueueCounts.terminal

Refs

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hardbyte commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Read switch in `queue_counts_exact`

Rename `QueueCounts.completed` → `QueueCounts.terminal`

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading