Skip to content

feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290)#306

Merged
hardbyte merged 3 commits into
mainfrom
brian/issue-290-read-switch-and-rename
Jun 2, 2026
Merged

feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290)#306
hardbyte merged 3 commits into
mainfrom
brian/issue-290-read-switch-and-rename

Conversation

@hardbyte
Copy link
Copy Markdown
Owner

@hardbyte hardbyte commented Jun 2, 2026

Summary

Successor to #305 (auto-closed when its base branch — the now-squashed #304 — was deleted). Carries the same two commits, rebased onto current main:

  1. Read switch + rename: queue_counts_exact now reads live terminal counts from queue_terminal_live_counts instead of scanning done_entries. QueueCounts.completed renamed to QueueCounts.terminal — honest name (the count includes failed and cancelled rows, not just completed).
  2. Queue-leading index: CREATE INDEX (queue, priority) ON queue_terminal_live_counts. Without it, the new read path scans every queue's counter rows in multi-queue schemas — undercutting the perf(queue_storage): live terminal counter for queue_counts_exact #290 perf goal.

After this lands, the headline regressor from #169's long_horizon evidence — the O(done_entries) scan inside queue_counts_exact — is gone, and the read path is an indexed range scan over a small denormalised counter.

What ships

Read switch in queue_counts_exact

The live_terminal CTE used to be:

SELECT count(*)::bigint AS completed
FROM {schema}.done_entries
WHERE queue = ANY($1)

…O(done_entries-for-queue). Now reads:

SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal
FROM {schema}.queue_terminal_live_counts
WHERE queue = ANY($1)

…indexed range scan + heap fetch for the matching counter rows (at most queue_slot_count × priorities × enqueue_shards rows per queue).

Queue-leading index

CREATE INDEX IF NOT EXISTS idx_..._queue ON queue_terminal_live_counts (queue, priority). The PK leads with ready_slot (right shape for the row-level UPSERT, wrong shape for the read aggregation), so the new index is what makes the perf claim load-bearing.

Notes on the index choice:

  • (queue, priority) keeps it narrow while supporting future per-priority drill-down.
  • No INCLUDE (live_terminal_count) — that would block HOT updates on the column the increment/decrement path mutates on every write, re-introducing the v016 bloat shape perf(queue_storage): live terminal counter for queue_counts_exact #290 set out to avoid. Heap fetch per matching row is fine.

Rename QueueCounts.completedQueueCounts.terminal

The historical name was a misnomer: queue_counts_exact has always returned count(*) FROM done_entries, which includes failed and cancelled rows. Field doc spells out the semantic. queue_counts_fast got the same rename + a clarifying note.

Internal-only — the struct doesn't derive Serialize, so no JSON wire shape changes. Test/bench callers updated; the bench output's local QueueStorageSnapshot.completed field is preserved so downstream bench-dashboard scripts that consume the JSON don't break; only the RHS <queue_counts>.completed accesses are renamed.

Refs

Test plan

  • cargo fmt --all
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo build --workspace --all-targets
  • cargo test -p awa --test queue_storage_runtime_test test_queue_terminal_live_counts against Postgres — all 6 perf(queue_storage): live terminal counter for queue_counts_exact #290 invariant tests still pass
  • cargo test -p awa --test queue_storage_runtime_test queue_counts — 3 pre-existing queue_counts tests pass, proving the counter-backed read returns the same values as the old scan
  • CI for full matrix (running on this push)

Summary by CodeRabbit

  • Improvements
    • Queue metrics now accurately track all terminal job states (completed, failed, and cancelled) instead of completed-only tracking
    • Added automatic counter validation during software upgrades to ensure terminal counts remain accurate without requiring service interruption
    • Implemented fallback verification when terminal counters are being rebuilt to maintain data consistency

hardbyte added 2 commits June 2, 2026 19:37
…s.terminal (#290)

Switches `queue_counts_exact` to read live terminal counts from the
denormalised `queue_terminal_live_counts` table instead of scanning
`done_entries`, and renames `QueueCounts.completed` → `QueueCounts.terminal`
to honestly reflect that the count includes failed and cancelled
terminals, not just completed ones.

Stacks on the prior PR (which wired the counter on every insert /
delete / prune-fold path). With those write-side guarantees in place
this PR is the last load-bearing piece of #290: the
`count(*) FROM done_entries WHERE queue = ANY(...)` scan is gone.

### Read switch in `queue_counts_exact`

The `live_terminal` CTE used to be:

    SELECT count(*)::bigint AS completed
    FROM {schema}.done_entries
    WHERE queue = ANY($1)

…which is O(done_entries-for-queue) and the headline regressor in
the long_horizon evidence on #169. It now reads:

    SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal
    FROM {schema}.queue_terminal_live_counts
    WHERE queue = ANY($1)

…which is O(num counter rows for queue) — at most
`queue_slot_count * priorities * enqueue_shards` rows per queue.

The `pruned_terminal + live_terminal` sum at the outer SELECT is
unchanged; the rollup column already accumulates per-slot folds from
prune, and the live counter holds everything not yet pruned.

### `QueueCounts.completed` → `QueueCounts.terminal`

The historical name was a misnomer: `queue_counts_exact` has always
returned `count(*) FROM done_entries`, which includes `failed` and
`cancelled` rows alongside `completed`. The field doc spells out the
semantic. `queue_counts_fast` got the same rename + a clarifying
note that the historical name was wrong.

The change is internal — the struct doesn't derive `Serialize`, so
no JSON wire shape changes. Only the Rust API renames. All callers
(test files only — the prod admin API uses `admin::QueueOverview`,
not `QueueCounts`) are updated. The bench output's local
`QueueStorageSnapshot.completed` field is preserved so downstream
bench-dashboard scripts that consume the JSON don't break; only the
RHS `<queue_counts>.completed` accesses are renamed.

### Tests

- Existing `test_queue_storage_queue_counts_*` tests still pass —
  proves the counter-backed read returns the same values as the old
  scan.
- The 4 #290 invariant tests from the prior PR still pass, but they
  now also implicitly verify the read path: any drift between the
  counter and `done_entries` would show as a divergence between
  `queue_counts_exact.terminal` and the test's direct
  `count(*) FROM done_entries` reference value.

Refs #290.
…#290)

The PK on `queue_terminal_live_counts` leads with `ready_slot`, which is
the right shape for the row-level UPSERT path (one row per
`(ready_slot, queue, priority, enqueue_shard)` group) but the wrong
shape for the new aggregating read path:

    SELECT COALESCE(SUM(live_terminal_count), 0)::bigint
    FROM {schema}.queue_terminal_live_counts
    WHERE queue = ANY($1)

Without a queue-leading index, that scan walks every row in the table
— in a multi-queue schema with M queues × S slots × P priorities ×
H shards, that's M·S·P·H rows to find the ~S·P·H rows for the
requested queue. Undercuts the #290 perf goal.

Add `CREATE INDEX IF NOT EXISTS idx_..._queue ON ...(queue, priority)`
in `QueueStorage::prepare_schema()` next to the table creation. Index
choice notes:

- `(queue, priority)` keeps it narrow (the two columns the read path
  filters on) and supports future per-priority drill-down.
- No `INCLUDE (live_terminal_count)` — that would disqualify HOT
  updates on the very column the increment/decrement path mutates,
  re-introducing the v016 bloat shape #290 is trying to avoid. Heap
  fetch per matching counter row is fine; the index range is small.

All 6 #290 invariant tests still green.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR implements a trust-gated terminal counter denormalization strategy: the QueueCounts struct field is renamed from completed to terminal, a new trust marker (terminal_counter_trusted_at) gates exact counting between fast denormalized counter reads and conservative full scans during rolling upgrades, and comprehensive tests validate the marker lifecycle.

Changes

Terminal Counter Denormalization with Trust-Gated Reading

Layer / File(s) Summary
Terminal counts contract and schema foundation
awa-model/src/queue_storage.rs
QueueCounts struct field renamed to terminal; queue_ring_state gains terminal_counter_trusted_at nullable column; new index added on queue_terminal_live_counts; auto-initialization sets trust marker on fresh installs.
Trust-gated exact counting path
awa-model/src/queue_storage.rs
queue_counts_exact conditionally chooses counter-fed reads (from queue_terminal_live_counts when trusted) or conservative full scans (from done_entries when untrusted); SQL and Rust mappings updated for terminal field.
Fast counting and terminal field wiring
awa-model/src/queue_storage.rs
queue_counts_fast documentation and implementation updated to use terminal semantics; local variable and struct field wiring switched from completed to terminal.
Trust marker lifecycle
awa-model/src/queue_storage.rs
New public terminal_counter_trusted method checks marker state; rebuild_terminal_counters flips marker to now() after rebuild, enabling counter-fed exact reads post-rebuild.
Comprehensive test coverage for trust-gated behavior
awa/tests/benchmark_test.rs, awa/tests/queue_storage_benchmark_test.rs, awa/tests/queue_storage_runtime_test.rs
New integration test validates trust marker gates exact counting across untrusted/rebuild scenarios; all runtime and benchmark assertions updated to use terminal field; completion tracking in benchmarks switched to terminal counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • hardbyte/awa#290: This PR directly implements the live terminal denormalization proposal, adding the trust marker gating and updating queue_counts_exact to conditionally read from the denormalized counter during rolling upgrades.

Possibly related PRs

  • hardbyte/awa#289: This PR renamed the fast counting method and introduced queue_counts_fast with denormalized terminal counter reads; the current PR builds on that foundation by adding trust-gated path selection in exact counts.
  • hardbyte/awa#304: Both PRs integrate queue_terminal_live_counts into terminal counting logic and update rebuild_terminal_counters, with this PR adding the trust marker gating layer and exact read-path selection.

Poem

🐰 From completed to terminal we hop,
Trust markers guard each query stop,
Fresh installs bloom with counters true,
While rolling upgrades scan anew—
A denormalized dream, hopping fast and clean. 🌱✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: switching to read live terminal counts from the new index and renaming QueueCounts.completed to QueueCounts.terminal.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch brian/issue-290-read-switch-and-rename

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 034e2b055d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread awa-model/src/queue_storage.rs Outdated
Comment on lines 7163 to 7165
SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal
FROM {schema}.queue_terminal_live_counts
WHERE queue = ANY($1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep exact counts correct when counters have drifted

When upgrading from a pre-#290 fleet (or after any counter drift), rebuild_terminal_counters explicitly documents that done_entries may contain terminal rows that queue_terminal_live_counts does not include (awa-model/src/queue_storage.rs:11564-11569). This change makes queue_counts_exact/queue_counts read live terminals only from the counter, so the admin “exact” count under-reports completed/failed/cancelled jobs until an operator manually rebuilds the table; the previous count(*) FROM done_entries path stayed correct in that scenario. Please either make the read switch self-heal/backfill before relying on the counter, or retain the scan/fallback when the counter may be untrusted.

Useful? React with 👍 / 👎.

Addresses reviewer P1 on #306: `queue_counts_exact` was reading live
terminals from `queue_terminal_live_counts` unconditionally, but the
codebase documents that the counter can drift during a pre-#290
rolling upgrade and should be rebuilt before relying on it. The
"exact" naming was therefore only exact-after-rebuild.

Fix: a trust marker. When the marker is NULL the read path falls
back to scanning `done_entries` (slower but correct); when set, the
counter-fed path is used.

The marker lives as a column on the existing `queue_ring_state`
singleton row rather than its own table — `prepare_schema` already
creates ~22 per-schema tables, and this is one boolean that doesn't
warrant a 23rd. Update lifecycle:

- ADD COLUMN IF NOT EXISTS on every `prepare_schema` (idempotent;
  no-op for already-migrated schemas).
- Fresh installs auto-mark via
  `UPDATE queue_ring_state SET terminal_counter_trusted_at = now()
   WHERE singleton AND trusted_at IS NULL
     AND NOT EXISTS (SELECT 1 FROM done_entries LIMIT 1)` —
  vacuously trustable when there's nothing to drift.
- `rebuild_terminal_counters` flips the marker to `now()` after the
  TRUNCATE + re-aggregation, in the same transaction.
- Existing installs upgrading from a pre-#290 binary land with
  trusted_at NULL, and the read path scans `done_entries` until the
  operator runs `awa storage rebuild-terminal-counters`.

`terminal_counter_trusted(pool)` is one PK fetch on the singleton
row — negligible cost per `queue_counts_exact` call. The read query
builds its `live_terminal` CTE as either the counter-sum or the
done_entries-scan depending on the trust check; outer plan shape is
identical between the two paths.

Update lifecycle for the column is once-at-install and
once-per-rebuild, so adding it to `queue_ring_state` doesn't disturb
the existing HOT-update tuning for rotation writes
(`current_slot`, `generation`).

New test `test_queue_terminal_counter_trust_marker_gates_read_path`:
asserts the fresh-install auto-mark, simulates rolling-upgrade
drift by inserting an orphan done_entries row + clearing the
marker, asserts the read returns 6 (scan) not 5 (counter), then
rebuilds and asserts the marker flips back and reads return the
counter sum.

All 7 #290 invariant + trust-marker tests pass locally.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@awa-model/src/queue_storage.rs`:
- Around line 3501-3529: The UPDATE that auto-sets terminal_counter_trusted_at
is currently gated only on done_entries being empty, which can mistakenly trust
upgraded-but-not-fresh schemas; change the logic in prepare_schema so that the
UPDATE on {schema}.queue_ring_state (the terminal_counter_trusted_at row) runs
only when this prepare_schema invocation actually created the schema (or an
explicit "fresh install" marker set in this transaction), not merely when
done_entries is empty; implement this by recording a boolean (e.g.
schema_created_in_prepare) when you create the schema in this prepare_schema
flow and include that condition in the WHERE (or create a transient marker row
checked in the WHERE), ensuring the query executed with install_tx only
auto-trusts when that marker/flag is present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: db375cbe-70da-48d4-ac4c-33efe08ddeab

📥 Commits

Reviewing files that changed from the base of the PR and between 118cfc0 and fbd57af.

📒 Files selected for processing (4)
  • awa-model/src/queue_storage.rs
  • awa/tests/benchmark_test.rs
  • awa/tests/queue_storage_benchmark_test.rs
  • awa/tests/queue_storage_runtime_test.rs

Comment on lines +3501 to +3529
// #290: auto-mark trusted for fresh installs. If
// `done_entries` is empty at this point AND the trust
// marker is still NULL, the counter is vacuously correct
// (nothing to drift) and the operator shouldn't have to
// run the rebuild CLI just to enable the perf path. On an
// existing install upgrading from a pre-#290 fleet, old
// binaries that wrote to done_entries before the new
// runtime booted would have left non-zero rows here, so
// we leave trusted_at NULL and the operator must
// explicitly rebuild after the rolling upgrade completes.
//
// The advisory lock held by prepare_schema gives us a
// tight window — concurrent writes from other schema
// preps are serialised on the lock, and any in-flight
// Rust writers are already counter-maintaining by
// definition (they built against this codebase).
sqlx::query(&format!(
r#"
UPDATE {schema}.queue_ring_state
SET terminal_counter_trusted_at = now()
WHERE singleton = TRUE
AND terminal_counter_trusted_at IS NULL
AND NOT EXISTS (SELECT 1 FROM {schema}.done_entries LIMIT 1)
"#
))
.execute(install_tx.as_mut())
.await
.map_err(map_sqlx_error)?;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't auto-trust upgraded schemas just because done_entries is empty.

Line 3501 uses current emptiness of done_entries as the freshness signal, but an existing pre-#290 schema can also be empty when the first new node boots. If old binaries are still live after this update, they can append to done_entries without touching queue_terminal_live_counts, and queue_counts_exact() will immediately switch to the counter path and undercount terminal rows. Gate this auto-marking on “schema created in this prepare_schema run” (or another explicit fresh-install signal), not on emptiness alone.

🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 3517-3525: SQL query built via format!() passed to a database method. Use parameterized queries with bind parameters instead.

(coderabbit.sql-injection.rust-format-query)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@awa-model/src/queue_storage.rs` around lines 3501 - 3529, The UPDATE that
auto-sets terminal_counter_trusted_at is currently gated only on done_entries
being empty, which can mistakenly trust upgraded-but-not-fresh schemas; change
the logic in prepare_schema so that the UPDATE on {schema}.queue_ring_state (the
terminal_counter_trusted_at row) runs only when this prepare_schema invocation
actually created the schema (or an explicit "fresh install" marker set in this
transaction), not merely when done_entries is empty; implement this by recording
a boolean (e.g. schema_created_in_prepare) when you create the schema in this
prepare_schema flow and include that condition in the WHERE (or create a
transient marker row checked in the WHERE), ensuring the query executed with
install_tx only auto-trusts when that marker/flag is present.

@hardbyte hardbyte merged commit 667a09b into main Jun 2, 2026
24 of 25 checks passed
@hardbyte hardbyte deleted the brian/issue-290-read-switch-and-rename branch June 2, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant