Releases: micllam/taquba
Releases · micllam/taquba
taquba-workflow-v0.6.0
Changed
- Terminal marker filenames lead with an inverted timestamp, and the
memo-retention sweep lists only expired markers (via the
object-storelist_with_offsetcontract) instead of every retained
marker on every tick, so a sweep's listing cost is proportional to
the expired set.MemoStore::list_expired_terminal_markersis the
new sweeper building block;list_terminal_markersremains for
inspection. Markers written by earlier versions are not recognised
by the sweeper: when upgrading a store that ran with
memo_retentionenabled, clear the<memo_prefix>/terminals/
prefix out-of-band. - Step transitions settle atomically. The next step's enqueue (for
Continue/ContinueAfter) and the terminal run-record delete
(for terminal outcomes) now join the current step's acknowledgement
transaction via Taquba'sack_with, halving the durable commits
per transition and removing the crash window between enqueuing the
next step and acking the current one: a step's successor exists if
and only if the step's settlement committed. The terminal hook now
fires before the settlement commits rather than after the run-record
delete; hooks remain at-least-once as before.
Fixed
WorkflowRuntime::submitno longer serialises every submission on a
process-wide lock held across queue I/O. The duplicate-check lock is
now per run id, so concurrent submissions of distinct runs proceed in
parallel and share WAL group commits. Previously a batch of
submissions completed at one run per flush interval regardless of
submission concurrency (about ten runs per second at SlateDB's
default 100 ms flush); same-run-id submissions keep their existing
duplicate and input-mismatch semantics.
Added
WorkflowRuntimeBuilder::step_output_replay: opt-in
content-addressed replay of runner-returned step outcomes, keyed by
(run_id, step_number, SHA-256(step payload)). When enabled, the
runtime persists everyStepOutcomethe runner returns (including
FailandCancel) before applying it; if the same step is delivered
again after a crash before ack, the stored outcome is replayed without
invoking the runner again. Step errors are not recorded, so retries
still invoke the runner. A replayedContinueAfterreduces its delay
by the time already elapsed since the outcome was stored, preserving
the original schedule.Memo::content_getandMemo::content_putderive per-step memo keys
from a MessagePack serialization of caller-supplied input hashed with
SHA-256.
taquba-webhooks-v0.3.0
Changed
- Raised the minimum
taqubarequirement to 0.8.
taquba-v0.8.0
Added
Queue::claim_batchclaims up tomax_jobspending jobs in one
transaction, sharing one claim-lock hold and one commit across the
batch. Jobs are returned in claim order and share one lease.
Queue::claimis now a batch of one.Queue::wait_for_jobs_onblocks until a job becomes claimable on
one queue. UnlikeQueue::wait_for_jobs, the wakeup is queue-scoped
and delivered to one waiter per inserted job.Queue::ack_withacknowledges a job and applies a set of effects in
the same transaction: follow-up enqueues (AckEffects::enqueues,
honouringrun_at,dedup_key,priority, andid_overrideper
request) and caller KV writes and deletes. Either the ack and every
effect land together or nothing does; when the claim is gone the
call fails withClaimLostand applies nothing, so a chained job
exists only if the settlement that created it won.Queue::ackis
nowack_withwith empty effects.Error::ClaimLost: returned byack,ack_with,nack,
dead_letter, andrenew_leasewhen the record's claim is no
longer present (the lease expired and the reaper requeued the job,
or the record is a stale copy from before a lease renewal rotated
the claimed key). These cases previously returned the catch-all
Error::InvalidState, which remains for genuine misuse (a record
missinglease_expires_at,requeue_dead_jobon a non-dead
record).Worker::process_with_effects: workers can returnAckEffects
from processing, whichrun_workerandrun_worker_concurrent
apply atomically with the job's acknowledgement via
Queue::ack_with.processandprocess_with_effectsdefault to
each other; implement exactly one. ExistingWorker
implementations are unaffected.Queue::closepersists each queue's claim-scan state (scan bound
and emptiness marker) under a newcursor:key prefix; the next
open restores the in-memory state from it and deletes the record. The
first claim after a clean restart resumes at the recorded bound
instead of re-scanning the tombstone band left by previously claimed
jobs, whose cost grows with the band and the store's latency. After
a crash the record is absent and the first claim falls back to the
front prefix scan as before.
Changed
run_workerno longer exits when settling a job fails. Settlement
failures (includingClaimLostwhen a job outlives its lease and
the reaper requeues it) are logged and the loop continues, matching
run_worker_concurrent; the redelivered attempt settles the job.
Claim-path errors still terminate both loops. Both loops log a lost
claim distinctly from other settlement failures.run_worker_concurrentclaims jobs in batches sized to its free
capacity viaQueue::claim_batch, costing one claim transaction
per batch instead of per job under a backlog. Jobs are still
processed concurrently and acked individually.Queue::claim_with_waitand therun_worker/run_worker_concurrent
loops wait on a queue-scoped wakeup that wakes one waiter per
inserted job, instead of the process-wide notification that woke
every waiting worker on every insert. A pool of idle workers no
longer contends on the claim path when a single job arrives, and a
worker claiming a job passes one wakeup on so a backlog keeps waking
further workers.Queue::claim_with_waitnow also keeps waiting out
its fullmax_waitafter losing a claim race instead of returning
Noneearly.Queue::claimcommits without awaiting WAL durability. Claims
serialise per queue through the claim lock, which excluded them from
WAL group commit: the lock holder awaited its flush before the next
claim could start, making the flush round trip the queue's claim
throughput ceiling.
Losing an unflushed claim in a crash leaves the job pending, so it
is redelivered immediately on recovery instead of after its lease
expires; at-least-once delivery is unaffected, and a settled job's
claim is always durable because later durable commits flush
preceding WAL entries.- The scheduler promotes due jobs without awaiting WAL durability,
for the same reasons and with the same crash behaviour as the
reaper change below: a lost promotion leaves the scheduled key in
place with itsrun_atin the past, and the next tick re-promotes
it. A backlog of due jobs (a retry-backoff wave, or scheduled jobs
accumulated during downtime) no longer promotes at one job per
flush interval. - The reaper requeues and dead-letters expired claims without awaiting
WAL durability. Each expired claim is processed in its own
transaction, and awaiting the flush serialised the sweep at one job
per flush interval (about ten per second at the default 100 ms
flush). A commit lost in a crash leaves the expired claim in place
for the next sweep, which re-processes it without consuming an
attempt, and later durable commits flush preceding WAL entries, so
a settled job's requeue is durable by ordering. - The done and dead-letter retention sweeps delete expired records
without awaiting WAL durability, for the same reasons as the reaper
and scheduler changes above: a delete lost in a crash leaves the
record in place for the next sweep, whose existence re-check keeps
the rerun idempotent. With this, no background sweep awaits the
flush; only caller-driven operations do. A retention backlog no
longer delays the lease reaping that shares its tick. Queue::claimtracks per-queue emptiness and a scan bound in
process memory. Polling an empty queue answers without a storage
scan or the claim lock, and the pending tombstone band is never
re-walked from the front while the process stays up; a full prefix
scan now happens only on cold start or process restart.- Queue stats counter merges are excluded from transaction conflict
detection. The merges are commutative, so concurrent job-state
transitions on the same queue no longer abort and retry each other
over the shared stats keys.
Fixed
- A
pending:insert landing behind the claim cursor while a claim
was in flight could have its cursor invalidation overwritten by
that claim's cursor update, hiding the job from cursor scans until
the queue next drained. The scan bound now moves back to include
such inserts, and a claim drops its bound advance when the bound
moved while it ran. - A
pending:key could be hidden from claims indefinitely when its
insert committed while a claim was in flight and the key sorted at
or below the keys that claim advanced the scan bound past. Job ids
are generated before the enqueue transaction commits, so commit
order can invert key order under concurrent producers, and a
requeue (reaper or nack) restores a job at its original key. The
next claim then recorded emptiness at a valid epoch and the queue
answeredNonewhile live jobs were pending. Bound advances now
clamp to the smallest key recorded since the bound was observed,
including when no bound exists yet (the first claim after a
process restart) and when the key equals the claimed one (the
claimed job requeued after its lease expired within the claim). - Duplicate
EnqueueOptions::id_overridevalues are now rejected
transactionally withError::DuplicateJobIdinstead of overwriting
jobindex:{id}and leaving older queue-state records behind. Queue::ack,Queue::nack,Queue::dead_letter, and
Queue::renew_leasenow check that the expectedclaimed:record
still exists before settling a job. A worker finishing after its
lease was reaped now getsError::ClaimLostinstead of being
able to ack, retry, dead-letter, renew, or corrupt stats from a
staleJobRecord.Queue::nackandQueue::renew_leasenow retry on transaction
conflict likeQueue::ackandQueue::dead_letteralready did.
A reaper committing the expired-lease delete concurrently with a
late settlement is now retried (and resolves toError::ClaimLost
on the next attempt) instead of surfacing a raw SlateDB transaction
error to the caller.Queue::requeue_dead_jobnow checks that the dead-letter record
still exists before reviving it. Requeueing a stale record after
dead-letter retention swept it now returnsError::JobNotFound
instead of recreating the job and corrupting queue stats.
taquba-jobs-v0.4.0
Changed
- Terminal marker filenames lead with an inverted timestamp, and the
result-retention sweep lists only expired markers (via the
object-storelist_with_offsetcontract) instead of every retained
marker on every tick, so a sweep's listing cost is proportional to
the expired set. Markers written by earlier versions are not
recognised by the sweeper: when upgrading a store that ran with
result_retentionenabled, clear the<result_prefix>/terminals/
prefix out-of-band.
taquba-cron-v0.4.0
Changed
- Raised the minimum
taqubarequirement to 0.8.
taquba-bulk-v0.2.0
Changed
- Batch submission runs with bounded concurrency instead of one
awaited submit at a time. Each submission blocks on a durable
enqueue commit and concurrent commits share WAL flushes, so serial
submission capped at one item per flush interval (one item per
100ms at the SlateDB default). Enqueue order across in-flight
submissions is not defined; batch items are independent.
Added
BulkCtx::memoized_by_contentand
BulkCtx::memoized_by_content_with_cached_costfor memoized steps
whose keys should be derived from serialized input content rather
than caller-supplied strings.BulkCtx::memoized_with_cached_costfor memoized steps whose cost counters
should be recorded both on fresh compute and on memo hits.
taquba-bulk-v0.1.0
Initial release. Per-batch orchestrator that runs one pipeline over many
inputs in a single process on top of taquba-workflow.
Added
Pipeline: the per-item contract (typedInput/Output, anError
that converts into aStepError, and an asyncrun). Each input item
becomes onetaquba-workflowrun whose single step invokesrun; the
pipeline's own logical steps live insiderunasBulkCtx::memoized
calls.BulkCtx<T>: per-item execution context. Carries the typedinput,
run_id, and submitterheaders; exposesmemoized(durable per-step
result caching so an at-least-once retry replays cached results instead of
repeating a paid call),record_cost, andcancel_token.CostReport: generic named-metric accumulator (token counts, paid-API
units, compute-seconds, dollars). Interior-mutable while a step runs and
serializable for the per-item envelope and the batch rollup.Bulk/BulkBuilder: the runner. Submits N runs, drives the worker pool,
streams output as items complete, and aggregates progress and cost.
Builder options:output,key_fn,headers,max_concurrent,
poll_interval,queue_name,memo_prefix,fail_threshold.run
executes to completion;run_with_shutdowndrains in-flight items on a
shutdown signal (e.g. spot preemption).ProgressSnapshot: point-in-time counts, rate, estimated time remaining,
and cost rollup, returned byBulk::progress.BulkReport: final counts, elapsed time, cost rollup, and
failed_run_ids(re-submitting those ids resumes from cached memo state).OutputSinkwithJsonlSink(one JSON record per line) andNullSink
(discards records, for side-effecting pipelines);read_jsonlfor
line-delimited JSON input.Error/Result: crate error type, including
Error::FailureThresholdExceededwhen the share of failed items crosses
the configured threshold.- Re-exports
StepErrorandStepErrorKindfromtaquba-workflowfor the
Pipeline::Errortype.
taquba-jobs-v0.3.0
Added
JobRunnerBuilder::result_retention(Duration): opt-in retention
window for persisted result blobs. When set, the runner writes a
terminal marker every time a job reaches a terminal state and an
in-process sweeper deletes that job's result blobretentionafter
termination. When unset (default), result blobs are retained
indefinitely (the previous behaviour). Once a blob is swept,
JobHandle::fetch_resultfor that job returnsOk(None)and an
idempotent re-submission of the same payload falls through to
re-running the job rather than short-circuiting; size the window
so it covers the longest gap callers need between submission and
idempotent re-submit.JobRunnerBuilder::clock(Arc<dyn Clock>): override the time source
the runner reads its timestamps from (terminal-marker timestamps
and the retention sweep cutoff). Defaults to the queue's clock
(Queue::clock), so passing aMockClockto
Queue::open_with_optionsis enough for tests; this override is
for the rarer case where the runner needs a different clock than
the queue.
Changed
- Idempotent submissions now short-circuit to a prior submission's
persisted outcome. Previously,Job::idempotency_keyonly deduped
against jobs that were still pending or scheduled: a re-submission
after the original acked would create a new job (re-paying for the
work). The dedup record now carries the assignedjob_id(written
atomically with the enqueue via the new
EnqueueOptions::id_override) so a re-submission with a matching
payload returns a handle pointing at the cached result blob (with
newly_submitted = false). - The result-store prefix now reserves a sibling
terminals/segment
for retention markers (<prefix>/terminals/<terminal_at_ms:020>_<job_id>).
Existing result blobs (<prefix>/<job_id>) are unaffected: ULID
job ids cannot collide with the literalterminalssegment. Markers
are only written whenresult_retentionis configured. - Breaking (on-disk):
JobSubmissionRecord(the durable per-idem-key
dedup record) gained ajob_idfield. Records written by earlier
versions oftaquba-jobswill fail to deserialize and need to be
cleared; the simplest path is to delete the queue's user KV prefix
(usr:jobs/dedup/...) when upgrading.
taquba-workflow-v0.5.0
Added
Memo: per-step durable key-value store for memoizing within-step
side effects, backed by object storage. Bound to a specific
(run_id, step_number);get(key)/put(key, value)take only
the user key. Strictly per-step; the durable channel between steps
isStepOutcome::Continue's payload, not memo.MemoStore: the backing storeMemoviews are derived from
(Arc<dyn ObjectStore>+ path prefix). Used internally by the
runtime builder; users construct one directly mainly in tests.Step::memo: every step receives aMemoscoped to its own
(run_id, step_number). Runners use it to cache results of
expensive within-step side effects (LLM calls, paid APIs) so
at-least-once retries don't re-pay for work the prior attempt
already did.WorkflowRuntimeBuilder::memo_prefix: configures the object-store
prefixStep::memoentries live under. Defaults to"workflow-memo";
set a distinct prefix when multiple runtimes share one store.Error::Store(taquba::object_store::Error): surfaced from memo
read/write failures. Classified as transient byis_permanent.WorkflowRuntimeBuilder::memo_retention(Duration): opts the runtime
into writing a terminal marker viaMemoStore::write_terminal_marker
on every terminal state (Succeeded, Failed, Cancelled). Markers
outlive the run record and provide the input a memo-retention sweep
consumes to decide when a run's memo entries are eligible for
deletion. Without this setter no marker is written and memo entries
are retained indefinitely (appropriate for short-lived runs or
external cleanup).- Memo-retention sweeper: when
memo_retentionis set,
WorkflowRuntime::runspawns a background task that periodically
scans terminal markers and, for each marker older than the
configured window, deletes the run's memo entries and then the
marker itself. The first sweep fires on startup so a fresh process
catches markers left behind by an earlier one. The sweeper shuts
down with the caller-supplied shutdown future. WorkflowRuntimenow reads every timestamp it writes
(DurableRunRecord::submitted_at_ms, theContinueAfterrun_at,
and the terminal-marker timestamp) through ataquba::Clock. By
default the runtime shares the clock itsQueuewas opened with
(viaQueue::clock), so passing aMockClocktoOpenOptions
virtualises time for the queue and the workflow runtime together.WorkflowRuntimeBuilder::clock(Arc<dyn Clock>)overrides the
defaulted-from-queue clock when a test or specialised setup needs a
separate time source.
Changed
-
Breaking:
WorkflowRuntime::buildernow takes an additional
requiredobject_store: Arc<dyn ObjectStore>argument between the
queue and the runner. The store backsStep::memoand need not be
the same store the queue was opened with, though sharing one (just
cloning theArc) is the common case. Existing call sites must add
the store argument:// Before: let runtime = WorkflowRuntime::builder(queue, runner, hook).build(); // After: let runtime = WorkflowRuntime::builder(queue, store, runner, hook).build();
taquba-v0.7.0
Added
EnqueueOptions::id_overridelets callers supply the job id instead
of receiving a generated ULID. Useful when the id must be known before
the enqueue returns. Ids are validated at the API boundary (1-128 bytes
of[A-Za-z0-9_-]) and bad inputs return the new
Error::InvalidId { id, reason }variant. Callers should prefer
ULID-shaped ids when FIFO-within-priority claim order matters:
pending/scheduledkeys end with the id, so claim order follows
id sort.Queue::clock()accessor returns theArc<dyn Clock>the queue
was opened with (or the defaultSystemClock). Lets downstream
crates share the queue's time source for their own timestamp work
so virtualising time withMockClockadvances the whole stack
in lockstep.OpenOptions::flush_interval: Option<Duration>exposes SlateDB's
WAL flush interval.Nonekeeps SlateDB's own default (100ms).
Every taquba state transition (enqueue,claim,ack,nack,
dead_letter) blocks ontxn.commit()which waits for the next
flush tick, so this value is the lower bound on per-operation
latency.
Changed
- Breaking on-disk layout: the
done:keyspace is reordered from
done:{queue}:{id}todone:{completed_at:020}:{queue}:{id},
mirroring the existing time-first layout ofclaimed:and
scheduled:. The retention sweep can now early-exit on the first
unexpired record instead of walking the full prefix. Public API is
unchanged; in-flight runs from prior versions must be drained
before upgrading because the old keys will not be observed by the
reaper. Queue::claim(and thereforeclaim_next/claim_with_wait)
serialises same-queue claim attempts through an in-process
tokio::sync::Mutex. Same-queue attempts no longer rely on
SlateDB's transaction-conflict retry to resolve which worker
takes the head ofpending:. The lock is per-queue, so different
queues' claims still run in parallel. Per-claim wall-clock latency
under high single-queue concurrency drops from seconds to roughly
one commit interval. Public API unchanged.Queue::claimnow maintains an in-memory per-queue cursor that
records the most recently claimedpending:key, and starts the
next claim's scan from immediately after it. This skips the
tombstone band left by previously claimed (and deleted)pending:
entries that the SlateDB iterator would otherwise walk. The
cursor is invalidated whenever apending:key is written at or
before it (nack-requeue, dead-job requeue, reaper-requeue,
scheduler promotion, and any enqueue at a lower-numbered
priority); when this happens the next claim falls back to a full
prefix scan. The cursor is not persisted: on process restart the
first claim falls back to the prefix scan and re-warms naturally.
Public API unchanged.- Bumped minimum
slatedbversion from 0.13 to 0.13.1.
Fixed
enqueue_with's non-dedup path (write_new) now retries on
transaction conflict, matching the dedup path (write_unique),
enqueue_with_kv,ack,dead_letter, and every other write path
in the crate. Previously a conflict during a non-dedup enqueue would
surface asError::Storageto the caller; under normal contention
this would have manifested as spurious enqueue failures that a retry
could resolve.