taquba-v0.8.0

Latest

Latest

micllam released this 15 Jun 12:39

· 7 commits to master since this release

73ca489

Added

Queue::claim_batch claims up to max_jobs pending jobs in one
transaction, sharing one claim-lock hold and one commit across the
batch. Jobs are returned in claim order and share one lease.
Queue::claim is now a batch of one.
Queue::wait_for_jobs_on blocks until a job becomes claimable on
one queue. Unlike Queue::wait_for_jobs, the wakeup is queue-scoped
and delivered to one waiter per inserted job.
Queue::ack_with acknowledges a job and applies a set of effects in
the same transaction: follow-up enqueues (AckEffects::enqueues,
honouring run_at, dedup_key, priority, and id_override per
request) and caller KV writes and deletes. Either the ack and every
effect land together or nothing does; when the claim is gone the
call fails with ClaimLost and applies nothing, so a chained job
exists only if the settlement that created it won. Queue::ack is
now ack_with with empty effects.
Error::ClaimLost: returned by ack, ack_with, nack,
dead_letter, and renew_lease when the record's claim is no
longer present (the lease expired and the reaper requeued the job,
or the record is a stale copy from before a lease renewal rotated
the claimed key). These cases previously returned the catch-all
Error::InvalidState, which remains for genuine misuse (a record
missing lease_expires_at, requeue_dead_job on a non-dead
record).
Worker::process_with_effects: workers can return AckEffects
from processing, which run_worker and run_worker_concurrent
apply atomically with the job's acknowledgement via
Queue::ack_with. process and process_with_effects default to
each other; implement exactly one. Existing Worker
implementations are unaffected.
Queue::close persists each queue's claim-scan state (scan bound
and emptiness marker) under a new cursor: key prefix; the next
open restores the in-memory state from it and deletes the record. The
first claim after a clean restart resumes at the recorded bound
instead of re-scanning the tombstone band left by previously claimed
jobs, whose cost grows with the band and the store's latency. After
a crash the record is absent and the first claim falls back to the
front prefix scan as before.

Changed

run_worker no longer exits when settling a job fails. Settlement
failures (including ClaimLost when a job outlives its lease and
the reaper requeues it) are logged and the loop continues, matching
run_worker_concurrent; the redelivered attempt settles the job.
Claim-path errors still terminate both loops. Both loops log a lost
claim distinctly from other settlement failures.
run_worker_concurrent claims jobs in batches sized to its free
capacity via Queue::claim_batch, costing one claim transaction
per batch instead of per job under a backlog. Jobs are still
processed concurrently and acked individually.
Queue::claim_with_wait and the run_worker / run_worker_concurrent
loops wait on a queue-scoped wakeup that wakes one waiter per
inserted job, instead of the process-wide notification that woke
every waiting worker on every insert. A pool of idle workers no
longer contends on the claim path when a single job arrives, and a
worker claiming a job passes one wakeup on so a backlog keeps waking
further workers. Queue::claim_with_wait now also keeps waiting out
its full max_wait after losing a claim race instead of returning
None early.
Queue::claim commits without awaiting WAL durability. Claims
serialise per queue through the claim lock, which excluded them from
WAL group commit: the lock holder awaited its flush before the next
claim could start, making the flush round trip the queue's claim
throughput ceiling.
Losing an unflushed claim in a crash leaves the job pending, so it
is redelivered immediately on recovery instead of after its lease
expires; at-least-once delivery is unaffected, and a settled job's
claim is always durable because later durable commits flush
preceding WAL entries.
The scheduler promotes due jobs without awaiting WAL durability,
for the same reasons and with the same crash behaviour as the
reaper change below: a lost promotion leaves the scheduled key in
place with its run_at in the past, and the next tick re-promotes
it. A backlog of due jobs (a retry-backoff wave, or scheduled jobs
accumulated during downtime) no longer promotes at one job per
flush interval.
The reaper requeues and dead-letters expired claims without awaiting
WAL durability. Each expired claim is processed in its own
transaction, and awaiting the flush serialised the sweep at one job
per flush interval (about ten per second at the default 100 ms
flush). A commit lost in a crash leaves the expired claim in place
for the next sweep, which re-processes it without consuming an
attempt, and later durable commits flush preceding WAL entries, so
a settled job's requeue is durable by ordering.
The done and dead-letter retention sweeps delete expired records
without awaiting WAL durability, for the same reasons as the reaper
and scheduler changes above: a delete lost in a crash leaves the
record in place for the next sweep, whose existence re-check keeps
the rerun idempotent. With this, no background sweep awaits the
flush; only caller-driven operations do. A retention backlog no
longer delays the lease reaping that shares its tick.
Queue::claim tracks per-queue emptiness and a scan bound in
process memory. Polling an empty queue answers without a storage
scan or the claim lock, and the pending tombstone band is never
re-walked from the front while the process stays up; a full prefix
scan now happens only on cold start or process restart.
Queue stats counter merges are excluded from transaction conflict
detection. The merges are commutative, so concurrent job-state
transitions on the same queue no longer abort and retry each other
over the shared stats keys.

Fixed

A pending: insert landing behind the claim cursor while a claim
was in flight could have its cursor invalidation overwritten by
that claim's cursor update, hiding the job from cursor scans until
the queue next drained. The scan bound now moves back to include
such inserts, and a claim drops its bound advance when the bound
moved while it ran.
A pending: key could be hidden from claims indefinitely when its
insert committed while a claim was in flight and the key sorted at
or below the keys that claim advanced the scan bound past. Job ids
are generated before the enqueue transaction commits, so commit
order can invert key order under concurrent producers, and a
requeue (reaper or nack) restores a job at its original key. The
next claim then recorded emptiness at a valid epoch and the queue
answered None while live jobs were pending. Bound advances now
clamp to the smallest key recorded since the bound was observed,
including when no bound exists yet (the first claim after a
process restart) and when the key equals the claimed one (the
claimed job requeued after its lease expired within the claim).
Duplicate EnqueueOptions::id_override values are now rejected
transactionally with Error::DuplicateJobId instead of overwriting
jobindex:{id} and leaving older queue-state records behind.
Queue::ack, Queue::nack, Queue::dead_letter, and
Queue::renew_lease now check that the expected claimed: record
still exists before settling a job. A worker finishing after its
lease was reaped now gets Error::ClaimLost instead of being
able to ack, retry, dead-letter, renew, or corrupt stats from a
stale JobRecord.
Queue::nack and Queue::renew_lease now retry on transaction
conflict like Queue::ack and Queue::dead_letter already did.
A reaper committing the expired-lease delete concurrently with a
late settlement is now retried (and resolves to Error::ClaimLost
on the next attempt) instead of surfacing a raw SlateDB transaction
error to the caller.
Queue::requeue_dead_job now checks that the dead-letter record
still exists before reviving it. Requeueing a stale record after
dead-letter retention swept it now returns Error::JobNotFound
instead of recreating the job and corrupting queue stats.

Assets 2