diff --git a/correctness/storage/AwaDeferredMaterialize.cfg b/correctness/storage/AwaDeferredMaterialize.cfg new file mode 100644 index 00000000..cf61f023 --- /dev/null +++ b/correctness/storage/AwaDeferredMaterialize.cfg @@ -0,0 +1,15 @@ +CONSTANTS + Jobs = {"j1", "j2"} + ReadySegments = {"s0", "s1", "s2"} + MaxRunLease = 2 + +SPECIFICATION Spec + +INVARIANTS + TypeOK + ExactlyOneOpen + ClosureSourceNotPruned + ClaimedSourceNotPruned + ReadySourceNotPruned + ReaderProjectionConsistent + NoLostJob diff --git a/correctness/storage/AwaDeferredMaterialize.tla b/correctness/storage/AwaDeferredMaterialize.tla new file mode 100644 index 00000000..b8c1dc6f --- /dev/null +++ b/correctness/storage/AwaDeferredMaterialize.tla @@ -0,0 +1,292 @@ +---- MODULE AwaDeferredMaterialize ---- +EXTENDS TLC, Naturals, FiniteSets + +\* ADR-024 was REJECTED 2026-05-11 after benchmarking showed the +\* deferred design was 34% slower than the synchronous path on the +\* sweep workload shape. This spec is preserved as a record of the +\* design space — it captures the safety + liveness contract any +\* future revisit would have to satisfy. The Rust implementation it +\* described has been removed; do not assume it is in the codebase. +\* +\* Focused contract spec for deferred done_entries materialisation. +\* +\* Today (pre-change): completing a job atomically writes both the +\* `lease_claim_closures` row (the durability-of-completion record) +\* and the `done_entries` row (the terminal index used by the UI / DLQ +\* / admin views). Two heavy writes per completion, one of them a +\* JSONB duplicate of `ready_entries` data the database already +\* holds. +\* +\* Proposed change: the hot path writes only the closure. A periodic +\* materialiser scans closures missing a corresponding done_entries +\* row, JOINs `lease_claims` (for partition routing keys) and +\* `ready_entries` (for kind / args / payload / unique fields, all of +\* which are still in the database after claim — the receipt-ring +\* model never DELETEs them per-row), and bulk-INSERTs the missing +\* terminal rows. Read-side views UNION (closures with outcome but +\* no done_entries yet) so callers never observe a job as +\* still-running once the closure exists. +\* +\* The new safety hazard the spec is here to catch: a `ready_entries` +\* partition TRUNCATE that fires before the materialiser has +\* processed every closure pointing into that partition would +\* orphan the source data and leave done_entries permanently +\* incomplete. The rotation guard must block partition TRUNCATE on +\* "all closures referencing this partition have been materialised". +\* +\* What the spec models: +\* - per-job state machine: idle -> ready -> claimed -> closed -> materialised +\* - the partition pointer the closure carries (the receipt's +\* ready_slot, captured at claim time) +\* - ready-partition lifecycle: open -> sealed -> pruned +\* - the rotation guard on Prune +\* - the materialiser action +\* - a crash that loses worker state mid-flight (a closed job is +\* still closed; the materialiser doesn't care about workers) +\* +\* What it does NOT model (out of scope): +\* - row-level Postgres locks — see feedback in +\* `~/.claude/projects/.../memory/feedback_tla_abstraction.md`; +\* this spec speaks at the table-row abstraction +\* - retries, DLQ, attempt_state — orthogonal to the materialiser +\* contract +\* - heartbeat / deadline rescue — those produce closures with +\* outcome=failed, which the materialiser materialises the same way +\* - performance — this is a correctness model, not a load model + +CONSTANTS + Jobs, \* set of job IDs + ReadySegments, \* set of partition IDs for ready_entries / done_entries + MaxRunLease \* bound on attempts per job (TLC needs finite state) + +ASSUME Cardinality(Jobs) >= 1 +ASSUME Cardinality(ReadySegments) >= 2 \* need at least one to seal + one open +ASSUME MaxRunLease \in Nat /\ MaxRunLease >= 1 + +VARIABLES + state, \* per-job: idle | ready | claimed | closed | materialised + seg, \* per-job: which ready partition holds (or held) its data + runLease, \* per-job: monotonically increasing attempt counter + segStatus, \* per-partition: open | sealed | pruned + currentSeg \* the current open partition (single-writer cursor) + +vars == <> + +\* ---- Domain helpers ---- + +JobStates == {"idle", "ready", "claimed", "closed", "materialised"} +SegStatuses == {"open", "sealed", "pruned"} + +\* A closure exists iff state is closed or materialised. +HasClosure(j) == state[j] \in {"closed", "materialised"} + +\* The closure's source data is still readable iff the partition that +\* held it has not yet been pruned. +SourceReadable(j) == segStatus[seg[j]] \in {"open", "sealed"} + +\* The reader-side projection: a job appears completed (terminal) once +\* the closure exists, regardless of whether done_entries has been +\* written yet. This is the UNION view the implementation must serve. +ReaderTerminal(j) == HasClosure(j) + +\* ---- Initial state ---- + +Init == + /\ state = [j \in Jobs |-> "idle"] + /\ seg = [j \in Jobs |-> CHOOSE s \in ReadySegments : TRUE] \* arbitrary; only meaningful once state != idle + /\ runLease = [j \in Jobs |-> 0] + /\ segStatus = [s \in ReadySegments |-> + IF s = (CHOOSE x \in ReadySegments : TRUE) + THEN "open" ELSE "sealed"] + /\ currentSeg = (CHOOSE s \in ReadySegments : TRUE) + \* Re-establish the open invariant: pick the same seg as currentSeg. + +\* Re-do Init properly: pick currentSeg deterministically and align +\* segStatus to it so exactly one partition is open. +InitProper == + \E s0 \in ReadySegments : + /\ state = [j \in Jobs |-> "idle"] + /\ seg = [j \in Jobs |-> s0] + /\ runLease = [j \in Jobs |-> 0] + /\ segStatus = [s \in ReadySegments |-> IF s = s0 THEN "open" ELSE "sealed"] + /\ currentSeg = s0 + +\* ---- Actions ---- + +\* Enqueue: write a ready_entries row in the current open partition. +\* The job's run_lease advances on enqueue (matches Rust: each enqueue +\* attempt increments the lease). +Enqueue(j) == + /\ state[j] = "idle" + /\ runLease[j] < MaxRunLease + /\ state' = [state EXCEPT ![j] = "ready"] + /\ seg' = [seg EXCEPT ![j] = currentSeg] + /\ runLease' = [runLease EXCEPT ![j] = runLease[j] + 1] + /\ UNCHANGED <> + +\* Claim: write a lease_claims receipt. The receipt captures the +\* partition pointer; ready_entries row stays in place (receipt-ring +\* model). Cannot claim from a pruned partition by construction — +\* ready_entries data has to still be readable for the worker to +\* assemble the JobRow. +Claim(j) == + /\ state[j] = "ready" + /\ SourceReadable(j) + /\ state' = [state EXCEPT ![j] = "claimed"] + /\ UNCHANGED <> + +\* Complete: write the closure. Hot path stops here in the new +\* design — no done_entries write at this step. +Complete(j) == + /\ state[j] = "claimed" + /\ state' = [state EXCEPT ![j] = "closed"] + /\ UNCHANGED <> + +\* Materialise: the periodic maint pass picks up a closure that has +\* not yet been materialised, JOINs ready_entries (still readable — +\* checked in the precondition), and writes the done_entries row. +\* Idempotent: re-firing the action with the same job is a no-op +\* because state is already 'materialised'. +Materialise(j) == + /\ state[j] = "closed" + /\ SourceReadable(j) \* the rotation guard's job is to keep this true for any closed j + /\ state' = [state EXCEPT ![j] = "materialised"] + /\ UNCHANGED <> + +\* RotateOpen: seal the current open partition and open a new one. +\* Models queue_ring_state.current_slot advancing. +RotateOpen == + \E next \in ReadySegments : + /\ next /= currentSeg + /\ segStatus[next] /= "open" \* not currently open; can be sealed or pruned + \* Re-opening a pruned partition is fine; it gets re-initialised at TRUNCATE time. + /\ segStatus' = [segStatus EXCEPT + ![currentSeg] = "sealed", + ![next] = "open"] + /\ currentSeg' = next + /\ UNCHANGED <> + +\* Prune: TRUNCATE a sealed partition. The rotation guard requires +\* every job whose data lives in this partition to have already +\* reached `materialised` — otherwise we'd orphan the source data +\* needed by a future Materialise. +Prune(s) == + /\ segStatus[s] = "sealed" + /\ \A j \in Jobs : + seg[j] = s => state[j] \in {"idle", "materialised"} + /\ segStatus' = [segStatus EXCEPT ![s] = "pruned"] + /\ UNCHANGED <> + +\* Crash: a worker disappears between Claim and Complete (or before). +\* Models the worker-process-died case. The job's claim row stays in +\* lease_claims; eventually the deadline-rescue path writes a closure +\* with outcome=failed (modeled here by transitioning closed). For +\* simplicity we don't distinguish outcome — the materialiser handles +\* both completed and failed closures the same way. +Crash(j) == + /\ state[j] = "claimed" + /\ SourceReadable(j) \* rescue path needs to read ready_entries to materialise + /\ state' = [state EXCEPT ![j] = "closed"] + /\ UNCHANGED <> + +Next == + \/ \E j \in Jobs : Enqueue(j) + \/ \E j \in Jobs : Claim(j) + \/ \E j \in Jobs : Complete(j) + \/ \E j \in Jobs : Materialise(j) + \/ \E j \in Jobs : Crash(j) + \/ RotateOpen + \/ \E s \in ReadySegments : Prune(s) + +Spec == InitProper /\ [][Next]_vars + +\* Fairness: the materialiser gets to run on every closed job that has +\* readable source data, and the rotation eventually advances. Without +\* fairness on Materialise, Prune(s) is permanently disabled the +\* moment any job lands at state=closed in s, which is correct safety +\* but blocks liveness. +Fairness == + /\ \A j \in Jobs : WF_vars(Materialise(j)) + /\ \A s \in ReadySegments : WF_vars(Prune(s)) + +LiveSpec == Spec /\ Fairness + +\* ---- Type / shape invariants ---- + +TypeOK == + /\ state \in [Jobs -> JobStates] + /\ seg \in [Jobs -> ReadySegments] + /\ runLease \in [Jobs -> 0..MaxRunLease] + /\ segStatus \in [ReadySegments -> SegStatuses] + /\ currentSeg \in ReadySegments + +ExactlyOneOpen == + /\ segStatus[currentSeg] = "open" + /\ \A s \in ReadySegments : (s /= currentSeg) => segStatus[s] /= "open" + +\* ---- Safety contract ---- + +\* The headline invariant the rotation guard is here to enforce: +\* whenever a job has a closure that has not yet been materialised, +\* the partition containing its source data is NOT pruned. If this +\* ever fails, the materialiser would have nothing to read and +\* done_entries would be permanently incomplete for that job. +ClosureSourceNotPruned == + \A j \in Jobs : + (state[j] = "closed") => (segStatus[seg[j]] /= "pruned") + +\* No claimed job's source can be pruned either — the worker is +\* still relying on it to assemble the JobRow + complete the job. +ClaimedSourceNotPruned == + \A j \in Jobs : + (state[j] = "claimed") => (segStatus[seg[j]] /= "pruned") + +\* No ready job's source can be pruned — that would silently delete +\* enqueued work. +ReadySourceNotPruned == + \A j \in Jobs : + (state[j] = "ready") => (segStatus[seg[j]] /= "pruned") + +\* The reader-side terminal projection is consistent: a job is shown +\* terminal iff its closure exists, regardless of materialisation. +\* Stated as: never the case that the reader sees terminal but no +\* closure has been written. (Trivially holds by definition of +\* HasClosure / ReaderTerminal; the invariant is a tripwire if the +\* projection ever drifts.) +ReaderProjectionConsistent == + \A j \in Jobs : ReaderTerminal(j) <=> HasClosure(j) + +\* Materialisation is monotonic: once a job is materialised it stays +\* materialised. Re-running maint never undoes a row. +MaterialiseMonotonic == + \A j \in Jobs : + (state[j] = "materialised") => + \* state can only transition out of materialised on a fresh + \* enqueue (run_lease advances). Express that as: if + \* materialised, the only post-state allowed is materialised + \* OR a higher run_lease. Encoded across two adjacent + \* states using the [Next]_vars formula -- TLC checks step + \* relations through TLA+'s built-in temporal operators. + TRUE \* placeholder: see MaterialiseMonotonicTemporal below + +\* Temporal version (checked as a property, not an invariant): +\* []((state[j] = "materialised") => [](state[j] = "materialised" +\* \/ state[j] = "ready")) +\* expressed using the implementation's promise that state monotonically +\* advances within a single run_lease. + +NoLostJob == + \A j \in Jobs : + \* If a job left idle, it is in some valid in-flight or terminal + \* state — never silently disappears. + (state[j] /= "idle") => + state[j] \in {"ready", "claimed", "closed", "materialised"} + +\* ---- Liveness ---- + +\* Under fairness, every closed job is eventually materialised. +EventuallyMaterialised == + \A j \in Jobs : + (state[j] = "closed") ~> (state[j] = "materialised") + +============================================================================ diff --git a/correctness/storage/AwaDeferredMaterializeLiveness.cfg b/correctness/storage/AwaDeferredMaterializeLiveness.cfg new file mode 100644 index 00000000..71590a20 --- /dev/null +++ b/correctness/storage/AwaDeferredMaterializeLiveness.cfg @@ -0,0 +1,14 @@ +CONSTANTS + Jobs = {"j1"} + ReadySegments = {"s0", "s1"} + MaxRunLease = 1 + +SPECIFICATION LiveSpec + +INVARIANTS + TypeOK + ExactlyOneOpen + ClosureSourceNotPruned + +PROPERTIES + EventuallyMaterialised diff --git a/docs/adr/024-deferred-done-entries.md b/docs/adr/024-deferred-done-entries.md new file mode 100644 index 00000000..c0d315da --- /dev/null +++ b/docs/adr/024-deferred-done-entries.md @@ -0,0 +1,306 @@ +# ADR-024: Deferred `done_entries` Materialisation + +## Status + +**Rejected (2026-05-11)** — investigated, prototyped, benchmarked, and +backed out. Synchronous `done_entries` write on the completion hot +path stays the only first-class implementation. The TLA+ spec under +`correctness/storage/AwaDeferredMaterialize.tla` is preserved as a +record of the design space; the runtime branching, materialiser +task, and rotation guard added by this ADR have been removed. + +The motivation, design, and what we learned are kept below for +reference. + +## Why rejected + +A/B benchmark on the same hardware as the 2026-05-09 sweep shape +(producer-rate=50000, depth-target=4000, W=64, 30 s warmup + +60 s clean): + +| mode | completion (jobs/s) | enqueue | depth | e2e p99 | +|---|---:|---:|---:|---:| +| sync (this branch) | **2,803** | 3,222 | 4,140 | 1,396 ms | +| deferred (final design) | 1,839 | 2,023 | 3,895 | 1,954 ms | + +Deferred shipped **34 % lower** completion-rate and **40 % higher** +p99. The first-attempt design (closure prune gated on materialiser) +was even worse — multi-second stalls — and prompted a redesign to +inline the materialise pass in `prune_oldest_claims` itself. That +redesign removed the stalls but did not recover throughput parity +with sync, let alone deliver lift. + +Why it didn't pan out, in retrospect: + +- **The `done_entries` INSERT is not the bottleneck on this + workload.** Sync mode at 2.8k jobs/s on macOS-docker / pg18 leaves + plenty of headroom; the JSONB write isn't where the time goes. +- **The deferred design adds three coupled mechanisms** (periodic + materialiser timer, inline catch-up in `prune_oldest_claims`, + rotation guard via `prune_oldest`'s existing pending check). Each + mechanism is correct in isolation; together they introduce + partition-rotation contention that exceeds the saved hot-path WAL. +- **The read-side migration was still pending** (`v016` UNION branch + in `awa.jobs_compat()`). The ADR could not flip the default + without it, and the bench numbers said it was not worth shipping + even if we did the read-side work. + +The simpler architecture wins: keep the synchronous receipt-ring +complete path. The throughput gap to pgque (14k vs 40k in the +sweep) is structural — pgque trades feature surface and per-job +durability for throughput, awa keeps both. The right awa +improvements live elsewhere — bloat-resistance under `idle_in_tx` +and `active_readers` (75 % / 71 % drops in the sweep) are +awa-specific and closeable; that's where the next investigation +should look. + +## Context + +The 2026-05-09 cross-system benchmark (postgresql-job-queue-benchmarking +sweep #26) measured awa at 14.2 k jobs/s (1×256w) against pgque's 40 k. +Investigation against the upstream pgque source showed pgque pays ~1 +row write per completed job (one `last_tick` UPDATE per consumed +batch); awa pays ~3 row writes per completed job — `lease_claim_closures` +INSERT (5 cols), `attempt_state` DELETE, and `done_entries` INSERT (19 +cols including `args` and `payload` JSONB). + +The dominant WAL hit on the awa hot path is `done_entries`: each row +duplicates the `args` and `payload` JSONB the database already has in +`ready_entries`. The receipt-ring model from ADR-023 keeps +`ready_entries` rows alive after claim — they are reclaimed by partition +TRUNCATE rotation, not row-level DELETE — so the source data is still +queryable when completion fires. + +This means `done_entries` can be reconstructed from +`lease_claim_closures` ⨝ `lease_claims` ⨝ `ready_entries` *after* the +fact. Doing so on a periodic maint pass instead of synchronously on +the completion hot path moves the JSONB write off the latency path +and amortises it across larger batches. + +## Decision + +The completion hot path stops writing `done_entries` directly. A new +**materialiser** background task scans `lease_claim_closures` for rows +that don't yet have a corresponding `done_entries` row, joins +`lease_claims` and `ready_entries` to assemble the terminal record, +and bulk-INSERTs into `done_entries`. + +A read-side **deferred view** (`awa.done_entries_view` or callsite +UNION) returns the union of physically-materialised rows and +not-yet-materialised closures so external callers (admin, CLI, +`jobs_view`) never observe a job as still-running once its closure +exists. + +A **rotation guard** prevents a `ready_entries` partition from being +TRUNCATEd while any `lease_claim_closures` row still references it +without a corresponding `done_entries` row. Without the guard, a +partition rotation between closure-write and materialise-read would +orphan the source data permanently. + +## Architecture + +### Hot-path complete (changed) + +`complete_runtime_batch` receipt path drops the `insert_done_rows_tx` +call. The CTE remains the same: + +```sql +WITH completed (claim_slot, job_id, run_lease) AS (...), +locked_claims AS (... FOR UPDATE OF claims), +inserted_closures AS ( + INSERT INTO {schema}.lease_claim_closures (...) + SELECT ... FROM locked_claims + ON CONFLICT DO NOTHING + RETURNING job_id, run_lease +), +deleted_attempts AS ( + DELETE FROM {schema}.attempt_state + USING inserted_closures + WHERE ... +) +SELECT job_id, run_lease FROM inserted_closures; +``` + +No `done_entries` write. WAL footprint per completion: closure row (5 +cols, no JSONB) + attempt-row delete + minimal index touches. + +### Materialiser (new) + +A new periodic task in `awa-worker/src/materializer.rs` runs on a +configurable cadence (`AWA_MATERIALIZE_INTERVAL_MS`, default 100 ms) +and a single-leader lease (cooperates with maintenance leader +election; doesn't need its own). + +Each pass: + +1. Reads up to `AWA_MATERIALIZE_BATCH` (default 4096) closure rows + that have no matching `done_entries` row, oldest first. +2. JOINs the source data: + - `lease_claim_closures` for `(claim_slot, job_id, run_lease, + outcome, closed_at)` + - `lease_claims` for `(ready_slot, ready_generation, queue, + priority, attempt, max_attempts, lane_seq)` + - `ready_entries` for `(kind, args, run_at, attempted_at, + created_at, unique_key, unique_states, payload)` +3. Bulk-INSERTs into `done_entries` with `ON CONFLICT + (ready_slot, queue, priority, lane_seq) DO NOTHING` (idempotent). + +The pass is a single SQL statement — one round-trip per batch. + +### Rotation guard (already in place) + +Implementation note: a fresh review of `prune_oldest` in +`awa-model/src/queue_storage.rs` showed the existing pre-TRUNCATE +check already serves as the rotation guard the deferred path needs. +Today's check counts ready_entries rows in the candidate partition +that have no corresponding done_entries row: + +```sql +SELECT count(*)::bigint +FROM {ready_child} AS ready +LEFT JOIN {done_child} AS done + ON done.ready_generation = ready.ready_generation + AND done.queue = ready.queue + AND done.priority = ready.priority + AND done.lane_seq = ready.lane_seq +WHERE done.lane_seq IS NULL +``` + +Under the synchronous path this counted "still-running or never-started" +work. Under the deferred path the same count *also* catches +"completed-but-not-yet-materialised" — the closure has been written but +done_entries lags. The prune returns +`SkippedActive { reason: QueuePendingReady, count: N }` and the +maintenance loop retries on the next vacuum tick. + +Once the materialiser drains the backlog (default cadence 100 ms), the +LEFT JOIN matches, the count drops to zero, and the next +`rotate_queue_storage_queue` tick prunes successfully. No new SQL +needed; the `materialise_done_entries` task is the only addition. + +The dedicated helper `unmaterialised_closures_for_ready_partition` +mirrors this check from the closure side; it's exposed for ops +visibility but not on the prune hot path. + +### Read-side projection + +`awa.jobs_view` and admin queries that today read `done_entries` +directly switch to a UNION view that surfaces both: + +- Physical `done_entries` rows (the common case after maint runs) +- Synthesised rows from `lease_claim_closures` + `lease_claims` + + `ready_entries` for closures not yet materialised + +The synthesised projection cannot be cached — it joins three live +tables on every read — but it's bounded by the materialiser-batch +backlog, which under default cadence stays under a few thousand rows. + +### Crash safety + +`lease_claim_closures` is the source of truth: a job is *completed* +the moment a closure row is written. Crash recovery requires no +extra mechanism — the next materialiser pass picks up unmaterialised +closures. + +If the materialiser itself crashes mid-batch, its work is idempotent: +the next pass sees the same backlog and re-runs `INSERT ... ON +CONFLICT DO NOTHING`. No state needs to be carried across passes. + +## Consequences + +**Throughput.** Hot-path WAL drops by the entire `done_entries` +INSERT. JSONB writes (the heaviest component, since `args` and +`payload` are duplicated) move off the completion latency path. +Expected lift on the bench shape: 20–40 % at 1×128 w / 1×256 w. Will +be confirmed by the 4-cell A/B in +`benchmarks/portable/long_horizon` once implementation lands. + +**Latency: completion p99 improves.** No JSONB write on the +completion path. + +**Latency: read-after-complete sees a small lag.** A worker that +completes a job and immediately queries the admin view to read it +back may see "synthesised" projection rather than physical +`done_entries`. The synthesised projection is correct (returns the +same row); the only operational difference is plan time vs. an indexed +read. Materialiser cadence (100 ms) bounds this to sub-second. + +**Bloat profile unchanged.** `done_entries` is partitioned and +TRUNCATE-rotated by ADR-019; that doesn't change. The maint pass +just feeds it on a slightly slower clock. + +**New failure mode: materialiser falls behind.** If the materialiser +backlog grows unboundedly (e.g. a stuck rotation), reads through the +synthesised view get slower; eventually the rotation guard would +block partition rotation entirely. The `materialise_lag` metric +(closures - done_entries) is exposed in `awa.runtime_storage_health` +for ops monitoring. + +**Schema impact.** Pre-0.6, no compatibility hops. Existing +clusters running an alpha will need a one-shot backfill (run the +materialiser inline once on first start of the new binary, against +all closures whose done_entries row is missing). Encoded as a +data-migration step in the install path. + +## Still to land before flipping the default + +The branch as committed enables the deferred path under the +`QueueStorageConfig.deferred_done_entries` flag (default false). The +hot-path skip, materialiser, and rotation guard are correct and +tested. The remaining production-readiness gate is **read-side +projection updates** — under `deferred=true`, callers that read +`done_entries` directly may briefly miss a completed job during the +materialiser cadence window. + +Affected read paths (each must UNION in a synthesised projection +from `lease_claim_closures` ⨝ `lease_claims` ⨝ `ready_entries` +WHERE `done_entries.lane_seq IS NULL`): + +1. `awa.jobs_compat()` in v012 — the user-facing `awa.jobs` view's + backing function. Currently UNIONs ready_entries + lease rows + + done_entries + dlq_entries; needs a fifth branch. +2. `cli get-job ` — admin lookup. Routes through `jobs_compat` + today, picks up the fix automatically once the view is updated. +3. `awa-ui` runtime API endpoints — same. + +A dedicated migration `v016_deferred_done_entries.sql` rewrites +`awa.jobs_compat()` to add the synthesised branch. Until that +migration ships, flipping `deferred_done_entries=true` is safe for +bench/throughput experiments (the bench worker doesn't query +`jobs_view`) but unsafe for production deployments that have +operators or job state machinery polling `awa.jobs`. + +## Validation + +- TLA+ spec at `correctness/storage/AwaDeferredMaterialize.tla`. + Safety + liveness exhaustive on (Jobs={j1,j2}, 3 segments, max + run_lease 2): 1710 distinct states, all invariants hold. +- Unit tests in `awa/tests/queue_storage_runtime_test.rs`: + - hot path skips done_entries write + - materialiser reconstructs the same row the synchronous path + would have written + - rotation guard blocks Prune when closure backlog references the + partition; unblocks after materialiser drains + - read-side view returns the same job state during the transient + closure-but-no-done window +- A/B on `postgresql-job-queue-benchmarking` long_horizon at W=64 / + 128 / 256 with `AWA_DEFER_DONE_ENTRIES=1`. Recorded to + `docs/adr/bench/024-deferred-done-entries-2026-05-XX/`. + +## Alternatives considered + +- **Thin done_entries (Design B in the plan).** Drop only the + JSONB columns from `done_entries`, keep the row write + synchronous, JOIN `ready_entries` on read. Smaller change, no + rotation guard. Rejected because the hot-path write is still on + the latency path and the read side has to JOIN per row anyway. + Deferred materialisation is a strict superset of the win. +- **No `done_entries` at all (Design C extreme).** Reconstruct the + terminal projection on every read from `closures` ⨝ + `lease_claims` ⨝ `ready_entries`. Rejected because audit / + long-tail reads break once `ready_entries` partitions rotate out. + Deferred materialisation gives us a stable terminal index for + reads past the rotation horizon. +- **Different per-attempt scratch table.** Stash `args/payload` in a + dedicated `closure_payload` table, write per-completion. Same WAL + cost as the status quo; no win. diff --git a/docs/adr/README.md b/docs/adr/README.md index 61c2f11f..61a0f233 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -36,6 +36,7 @@ in-place as historical context. | 021 | [Sequential callbacks and callback heartbeats](021-enhanced-external-wait.md) | Accepted | `wait_for_callback()` + `resume_external()` for multi-step orchestration; `heartbeat_callback` for long-running externals | Callback state moved to `active_leases` per ADR-019 | | 022 | [Descriptor catalog](022-descriptor-catalog.md) | Accepted | `queue_descriptors` / `job_kind_descriptors` tables, BLAKE3-hashed, code-declared, off the hot path | Off the queue-storage hot path | | 023 | [Receipt plane ring partitioning](023-receipt-plane-ring-partitioning.md) | Accepted | Partitioned `lease_claims` / `lease_claim_closures` ring replaces `open_receipt_claims`; receipts default on in 0.6 | Refines ADR-019 receipt plane | +| 024 | [Deferred done_entries materialisation](024-deferred-done-entries.md) | **Rejected** | Move `done_entries` write off the completion hot path to a maintenance materialiser pass | Investigated 2026-05-10/11; benchmarked 34 % slower than synchronous receipt-ring complete. Kept as design-space record alongside `correctness/storage/AwaDeferredMaterialize.tla` | ## Validation artifacts