feat(queue_storage): read live terminal counts + queue-leading index + rename to QueueCounts.terminal (#290)#306
Conversation
…s.terminal (#290) Switches `queue_counts_exact` to read live terminal counts from the denormalised `queue_terminal_live_counts` table instead of scanning `done_entries`, and renames `QueueCounts.completed` → `QueueCounts.terminal` to honestly reflect that the count includes failed and cancelled terminals, not just completed ones. Stacks on the prior PR (which wired the counter on every insert / delete / prune-fold path). With those write-side guarantees in place this PR is the last load-bearing piece of #290: the `count(*) FROM done_entries WHERE queue = ANY(...)` scan is gone. ### Read switch in `queue_counts_exact` The `live_terminal` CTE used to be: SELECT count(*)::bigint AS completed FROM {schema}.done_entries WHERE queue = ANY($1) …which is O(done_entries-for-queue) and the headline regressor in the long_horizon evidence on #169. It now reads: SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal FROM {schema}.queue_terminal_live_counts WHERE queue = ANY($1) …which is O(num counter rows for queue) — at most `queue_slot_count * priorities * enqueue_shards` rows per queue. The `pruned_terminal + live_terminal` sum at the outer SELECT is unchanged; the rollup column already accumulates per-slot folds from prune, and the live counter holds everything not yet pruned. ### `QueueCounts.completed` → `QueueCounts.terminal` The historical name was a misnomer: `queue_counts_exact` has always returned `count(*) FROM done_entries`, which includes `failed` and `cancelled` rows alongside `completed`. The field doc spells out the semantic. `queue_counts_fast` got the same rename + a clarifying note that the historical name was wrong. The change is internal — the struct doesn't derive `Serialize`, so no JSON wire shape changes. Only the Rust API renames. All callers (test files only — the prod admin API uses `admin::QueueOverview`, not `QueueCounts`) are updated. The bench output's local `QueueStorageSnapshot.completed` field is preserved so downstream bench-dashboard scripts that consume the JSON don't break; only the RHS `<queue_counts>.completed` accesses are renamed. ### Tests - Existing `test_queue_storage_queue_counts_*` tests still pass — proves the counter-backed read returns the same values as the old scan. - The 4 #290 invariant tests from the prior PR still pass, but they now also implicitly verify the read path: any drift between the counter and `done_entries` would show as a divergence between `queue_counts_exact.terminal` and the test's direct `count(*) FROM done_entries` reference value. Refs #290.
…#290) The PK on `queue_terminal_live_counts` leads with `ready_slot`, which is the right shape for the row-level UPSERT path (one row per `(ready_slot, queue, priority, enqueue_shard)` group) but the wrong shape for the new aggregating read path: SELECT COALESCE(SUM(live_terminal_count), 0)::bigint FROM {schema}.queue_terminal_live_counts WHERE queue = ANY($1) Without a queue-leading index, that scan walks every row in the table — in a multi-queue schema with M queues × S slots × P priorities × H shards, that's M·S·P·H rows to find the ~S·P·H rows for the requested queue. Undercuts the #290 perf goal. Add `CREATE INDEX IF NOT EXISTS idx_..._queue ON ...(queue, priority)` in `QueueStorage::prepare_schema()` next to the table creation. Index choice notes: - `(queue, priority)` keeps it narrow (the two columns the read path filters on) and supports future per-priority drill-down. - No `INCLUDE (live_terminal_count)` — that would disqualify HOT updates on the very column the increment/decrement path mutates, re-introducing the v016 bloat shape #290 is trying to avoid. Heap fetch per matching counter row is fine; the index range is small. All 6 #290 invariant tests still green.
📝 WalkthroughWalkthroughThis PR implements a trust-gated terminal counter denormalization strategy: the ChangesTerminal Counter Denormalization with Trust-Gated Reading
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 034e2b055d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| SELECT COALESCE(SUM(live_terminal_count), 0)::bigint AS terminal | ||
| FROM {schema}.queue_terminal_live_counts | ||
| WHERE queue = ANY($1) |
There was a problem hiding this comment.
Keep exact counts correct when counters have drifted
When upgrading from a pre-#290 fleet (or after any counter drift), rebuild_terminal_counters explicitly documents that done_entries may contain terminal rows that queue_terminal_live_counts does not include (awa-model/src/queue_storage.rs:11564-11569). This change makes queue_counts_exact/queue_counts read live terminals only from the counter, so the admin “exact” count under-reports completed/failed/cancelled jobs until an operator manually rebuilds the table; the previous count(*) FROM done_entries path stayed correct in that scenario. Please either make the read switch self-heal/backfill before relying on the counter, or retain the scan/fallback when the counter may be untrusted.
Useful? React with 👍 / 👎.
Addresses reviewer P1 on #306: `queue_counts_exact` was reading live terminals from `queue_terminal_live_counts` unconditionally, but the codebase documents that the counter can drift during a pre-#290 rolling upgrade and should be rebuilt before relying on it. The "exact" naming was therefore only exact-after-rebuild. Fix: a trust marker. When the marker is NULL the read path falls back to scanning `done_entries` (slower but correct); when set, the counter-fed path is used. The marker lives as a column on the existing `queue_ring_state` singleton row rather than its own table — `prepare_schema` already creates ~22 per-schema tables, and this is one boolean that doesn't warrant a 23rd. Update lifecycle: - ADD COLUMN IF NOT EXISTS on every `prepare_schema` (idempotent; no-op for already-migrated schemas). - Fresh installs auto-mark via `UPDATE queue_ring_state SET terminal_counter_trusted_at = now() WHERE singleton AND trusted_at IS NULL AND NOT EXISTS (SELECT 1 FROM done_entries LIMIT 1)` — vacuously trustable when there's nothing to drift. - `rebuild_terminal_counters` flips the marker to `now()` after the TRUNCATE + re-aggregation, in the same transaction. - Existing installs upgrading from a pre-#290 binary land with trusted_at NULL, and the read path scans `done_entries` until the operator runs `awa storage rebuild-terminal-counters`. `terminal_counter_trusted(pool)` is one PK fetch on the singleton row — negligible cost per `queue_counts_exact` call. The read query builds its `live_terminal` CTE as either the counter-sum or the done_entries-scan depending on the trust check; outer plan shape is identical between the two paths. Update lifecycle for the column is once-at-install and once-per-rebuild, so adding it to `queue_ring_state` doesn't disturb the existing HOT-update tuning for rotation writes (`current_slot`, `generation`). New test `test_queue_terminal_counter_trust_marker_gates_read_path`: asserts the fresh-install auto-mark, simulates rolling-upgrade drift by inserting an orphan done_entries row + clearing the marker, asserts the read returns 6 (scan) not 5 (counter), then rebuilds and asserts the marker flips back and reads return the counter sum. All 7 #290 invariant + trust-marker tests pass locally.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@awa-model/src/queue_storage.rs`:
- Around line 3501-3529: The UPDATE that auto-sets terminal_counter_trusted_at
is currently gated only on done_entries being empty, which can mistakenly trust
upgraded-but-not-fresh schemas; change the logic in prepare_schema so that the
UPDATE on {schema}.queue_ring_state (the terminal_counter_trusted_at row) runs
only when this prepare_schema invocation actually created the schema (or an
explicit "fresh install" marker set in this transaction), not merely when
done_entries is empty; implement this by recording a boolean (e.g.
schema_created_in_prepare) when you create the schema in this prepare_schema
flow and include that condition in the WHERE (or create a transient marker row
checked in the WHERE), ensuring the query executed with install_tx only
auto-trusts when that marker/flag is present.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: db375cbe-70da-48d4-ac4c-33efe08ddeab
📒 Files selected for processing (4)
awa-model/src/queue_storage.rsawa/tests/benchmark_test.rsawa/tests/queue_storage_benchmark_test.rsawa/tests/queue_storage_runtime_test.rs
| // #290: auto-mark trusted for fresh installs. If | ||
| // `done_entries` is empty at this point AND the trust | ||
| // marker is still NULL, the counter is vacuously correct | ||
| // (nothing to drift) and the operator shouldn't have to | ||
| // run the rebuild CLI just to enable the perf path. On an | ||
| // existing install upgrading from a pre-#290 fleet, old | ||
| // binaries that wrote to done_entries before the new | ||
| // runtime booted would have left non-zero rows here, so | ||
| // we leave trusted_at NULL and the operator must | ||
| // explicitly rebuild after the rolling upgrade completes. | ||
| // | ||
| // The advisory lock held by prepare_schema gives us a | ||
| // tight window — concurrent writes from other schema | ||
| // preps are serialised on the lock, and any in-flight | ||
| // Rust writers are already counter-maintaining by | ||
| // definition (they built against this codebase). | ||
| sqlx::query(&format!( | ||
| r#" | ||
| UPDATE {schema}.queue_ring_state | ||
| SET terminal_counter_trusted_at = now() | ||
| WHERE singleton = TRUE | ||
| AND terminal_counter_trusted_at IS NULL | ||
| AND NOT EXISTS (SELECT 1 FROM {schema}.done_entries LIMIT 1) | ||
| "# | ||
| )) | ||
| .execute(install_tx.as_mut()) | ||
| .await | ||
| .map_err(map_sqlx_error)?; | ||
|
|
There was a problem hiding this comment.
Don't auto-trust upgraded schemas just because done_entries is empty.
Line 3501 uses current emptiness of done_entries as the freshness signal, but an existing pre-#290 schema can also be empty when the first new node boots. If old binaries are still live after this update, they can append to done_entries without touching queue_terminal_live_counts, and queue_counts_exact() will immediately switch to the counter path and undercount terminal rows. Gate this auto-marking on “schema created in this prepare_schema run” (or another explicit fresh-install signal), not on emptiness alone.
🧰 Tools
🪛 OpenGrep (1.22.0)
[ERROR] 3517-3525: SQL query built via format!() passed to a database method. Use parameterized queries with bind parameters instead.
(coderabbit.sql-injection.rust-format-query)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@awa-model/src/queue_storage.rs` around lines 3501 - 3529, The UPDATE that
auto-sets terminal_counter_trusted_at is currently gated only on done_entries
being empty, which can mistakenly trust upgraded-but-not-fresh schemas; change
the logic in prepare_schema so that the UPDATE on {schema}.queue_ring_state (the
terminal_counter_trusted_at row) runs only when this prepare_schema invocation
actually created the schema (or an explicit "fresh install" marker set in this
transaction), not merely when done_entries is empty; implement this by recording
a boolean (e.g. schema_created_in_prepare) when you create the schema in this
prepare_schema flow and include that condition in the WHERE (or create a
transient marker row checked in the WHERE), ensuring the query executed with
install_tx only auto-trusts when that marker/flag is present.
Summary
Successor to #305 (auto-closed when its base branch — the now-squashed #304 — was deleted). Carries the same two commits, rebased onto current main:
queue_counts_exactnow reads live terminal counts fromqueue_terminal_live_countsinstead of scanningdone_entries.QueueCounts.completedrenamed toQueueCounts.terminal— honest name (the count includes failed and cancelled rows, not just completed).CREATE INDEX (queue, priority) ON queue_terminal_live_counts. Without it, the new read path scans every queue's counter rows in multi-queue schemas — undercutting the perf(queue_storage): live terminal counter for queue_counts_exact #290 perf goal.After this lands, the headline regressor from #169's
long_horizonevidence — theO(done_entries)scan insidequeue_counts_exact— is gone, and the read path is an indexed range scan over a small denormalised counter.What ships
Read switch in
queue_counts_exactThe
live_terminalCTE used to be:…O(done_entries-for-queue). Now reads:
…indexed range scan + heap fetch for the matching counter rows (at most
queue_slot_count × priorities × enqueue_shardsrows per queue).Queue-leading index
CREATE INDEX IF NOT EXISTS idx_..._queue ON queue_terminal_live_counts (queue, priority). The PK leads withready_slot(right shape for the row-level UPSERT, wrong shape for the read aggregation), so the new index is what makes the perf claim load-bearing.Notes on the index choice:
(queue, priority)keeps it narrow while supporting future per-priority drill-down.INCLUDE (live_terminal_count)— that would block HOT updates on the column the increment/decrement path mutates on every write, re-introducing the v016 bloat shape perf(queue_storage): live terminal counter for queue_counts_exact #290 set out to avoid. Heap fetch per matching row is fine.Rename
QueueCounts.completed→QueueCounts.terminalThe historical name was a misnomer:
queue_counts_exacthas always returnedcount(*) FROM done_entries, which includesfailedandcancelledrows. Field doc spells out the semantic.queue_counts_fastgot the same rename + a clarifying note.Internal-only — the struct doesn't derive
Serialize, so no JSON wire shape changes. Test/bench callers updated; the bench output's localQueueStorageSnapshot.completedfield is preserved so downstream bench-dashboard scripts that consume the JSON don't break; only the RHS<queue_counts>.completedaccesses are renamed.Refs
Test plan
cargo fmt --allcargo clippy --workspace --all-targets --all-features -- -D warningscargo build --workspace --all-targetscargo test -p awa --test queue_storage_runtime_test test_queue_terminal_live_countsagainst Postgres — all 6 perf(queue_storage): live terminal counter for queue_counts_exact #290 invariant tests still passcargo test -p awa --test queue_storage_runtime_test queue_counts— 3 pre-existing queue_counts tests pass, proving the counter-backed read returns the same values as the old scanSummary by CodeRabbit