Skip to content

Releases: micllam/taquba

taquba-workflow-v0.6.0

15 Jun 12:46
73ca489

Choose a tag to compare

Changed

  • Terminal marker filenames lead with an inverted timestamp, and the
    memo-retention sweep lists only expired markers (via the
    object-store list_with_offset contract) instead of every retained
    marker on every tick, so a sweep's listing cost is proportional to
    the expired set. MemoStore::list_expired_terminal_markers is the
    new sweeper building block; list_terminal_markers remains for
    inspection. Markers written by earlier versions are not recognised
    by the sweeper: when upgrading a store that ran with
    memo_retention enabled, clear the <memo_prefix>/terminals/
    prefix out-of-band.
  • Step transitions settle atomically. The next step's enqueue (for
    Continue / ContinueAfter) and the terminal run-record delete
    (for terminal outcomes) now join the current step's acknowledgement
    transaction via Taquba's ack_with, halving the durable commits
    per transition and removing the crash window between enqueuing the
    next step and acking the current one: a step's successor exists if
    and only if the step's settlement committed. The terminal hook now
    fires before the settlement commits rather than after the run-record
    delete; hooks remain at-least-once as before.

Fixed

  • WorkflowRuntime::submit no longer serialises every submission on a
    process-wide lock held across queue I/O. The duplicate-check lock is
    now per run id, so concurrent submissions of distinct runs proceed in
    parallel and share WAL group commits. Previously a batch of
    submissions completed at one run per flush interval regardless of
    submission concurrency (about ten runs per second at SlateDB's
    default 100 ms flush); same-run-id submissions keep their existing
    duplicate and input-mismatch semantics.

Added

  • WorkflowRuntimeBuilder::step_output_replay: opt-in
    content-addressed replay of runner-returned step outcomes, keyed by
    (run_id, step_number, SHA-256(step payload)). When enabled, the
    runtime persists every StepOutcome the runner returns (including
    Fail and Cancel) before applying it; if the same step is delivered
    again after a crash before ack, the stored outcome is replayed without
    invoking the runner again. Step errors are not recorded, so retries
    still invoke the runner. A replayed ContinueAfter reduces its delay
    by the time already elapsed since the outcome was stored, preserving
    the original schedule.
  • Memo::content_get and Memo::content_put derive per-step memo keys
    from a MessagePack serialization of caller-supplied input hashed with
    SHA-256.

taquba-webhooks-v0.3.0

15 Jun 12:43
73ca489

Choose a tag to compare

Changed

  • Raised the minimum taquba requirement to 0.8.

taquba-v0.8.0

15 Jun 12:39
73ca489

Choose a tag to compare

Added

  • Queue::claim_batch claims up to max_jobs pending jobs in one
    transaction, sharing one claim-lock hold and one commit across the
    batch. Jobs are returned in claim order and share one lease.
    Queue::claim is now a batch of one.
  • Queue::wait_for_jobs_on blocks until a job becomes claimable on
    one queue. Unlike Queue::wait_for_jobs, the wakeup is queue-scoped
    and delivered to one waiter per inserted job.
  • Queue::ack_with acknowledges a job and applies a set of effects in
    the same transaction: follow-up enqueues (AckEffects::enqueues,
    honouring run_at, dedup_key, priority, and id_override per
    request) and caller KV writes and deletes. Either the ack and every
    effect land together or nothing does; when the claim is gone the
    call fails with ClaimLost and applies nothing, so a chained job
    exists only if the settlement that created it won. Queue::ack is
    now ack_with with empty effects.
  • Error::ClaimLost: returned by ack, ack_with, nack,
    dead_letter, and renew_lease when the record's claim is no
    longer present (the lease expired and the reaper requeued the job,
    or the record is a stale copy from before a lease renewal rotated
    the claimed key). These cases previously returned the catch-all
    Error::InvalidState, which remains for genuine misuse (a record
    missing lease_expires_at, requeue_dead_job on a non-dead
    record).
  • Worker::process_with_effects: workers can return AckEffects
    from processing, which run_worker and run_worker_concurrent
    apply atomically with the job's acknowledgement via
    Queue::ack_with. process and process_with_effects default to
    each other; implement exactly one. Existing Worker
    implementations are unaffected.
  • Queue::close persists each queue's claim-scan state (scan bound
    and emptiness marker) under a new cursor: key prefix; the next
    open restores the in-memory state from it and deletes the record. The
    first claim after a clean restart resumes at the recorded bound
    instead of re-scanning the tombstone band left by previously claimed
    jobs, whose cost grows with the band and the store's latency. After
    a crash the record is absent and the first claim falls back to the
    front prefix scan as before.

Changed

  • run_worker no longer exits when settling a job fails. Settlement
    failures (including ClaimLost when a job outlives its lease and
    the reaper requeues it) are logged and the loop continues, matching
    run_worker_concurrent; the redelivered attempt settles the job.
    Claim-path errors still terminate both loops. Both loops log a lost
    claim distinctly from other settlement failures.
  • run_worker_concurrent claims jobs in batches sized to its free
    capacity via Queue::claim_batch, costing one claim transaction
    per batch instead of per job under a backlog. Jobs are still
    processed concurrently and acked individually.
  • Queue::claim_with_wait and the run_worker / run_worker_concurrent
    loops wait on a queue-scoped wakeup that wakes one waiter per
    inserted job, instead of the process-wide notification that woke
    every waiting worker on every insert. A pool of idle workers no
    longer contends on the claim path when a single job arrives, and a
    worker claiming a job passes one wakeup on so a backlog keeps waking
    further workers. Queue::claim_with_wait now also keeps waiting out
    its full max_wait after losing a claim race instead of returning
    None early.
  • Queue::claim commits without awaiting WAL durability. Claims
    serialise per queue through the claim lock, which excluded them from
    WAL group commit: the lock holder awaited its flush before the next
    claim could start, making the flush round trip the queue's claim
    throughput ceiling.
    Losing an unflushed claim in a crash leaves the job pending, so it
    is redelivered immediately on recovery instead of after its lease
    expires; at-least-once delivery is unaffected, and a settled job's
    claim is always durable because later durable commits flush
    preceding WAL entries.
  • The scheduler promotes due jobs without awaiting WAL durability,
    for the same reasons and with the same crash behaviour as the
    reaper change below: a lost promotion leaves the scheduled key in
    place with its run_at in the past, and the next tick re-promotes
    it. A backlog of due jobs (a retry-backoff wave, or scheduled jobs
    accumulated during downtime) no longer promotes at one job per
    flush interval.
  • The reaper requeues and dead-letters expired claims without awaiting
    WAL durability. Each expired claim is processed in its own
    transaction, and awaiting the flush serialised the sweep at one job
    per flush interval (about ten per second at the default 100 ms
    flush). A commit lost in a crash leaves the expired claim in place
    for the next sweep, which re-processes it without consuming an
    attempt, and later durable commits flush preceding WAL entries, so
    a settled job's requeue is durable by ordering.
  • The done and dead-letter retention sweeps delete expired records
    without awaiting WAL durability, for the same reasons as the reaper
    and scheduler changes above: a delete lost in a crash leaves the
    record in place for the next sweep, whose existence re-check keeps
    the rerun idempotent. With this, no background sweep awaits the
    flush; only caller-driven operations do. A retention backlog no
    longer delays the lease reaping that shares its tick.
  • Queue::claim tracks per-queue emptiness and a scan bound in
    process memory. Polling an empty queue answers without a storage
    scan or the claim lock, and the pending tombstone band is never
    re-walked from the front while the process stays up; a full prefix
    scan now happens only on cold start or process restart.
  • Queue stats counter merges are excluded from transaction conflict
    detection. The merges are commutative, so concurrent job-state
    transitions on the same queue no longer abort and retry each other
    over the shared stats keys.

Fixed

  • A pending: insert landing behind the claim cursor while a claim
    was in flight could have its cursor invalidation overwritten by
    that claim's cursor update, hiding the job from cursor scans until
    the queue next drained. The scan bound now moves back to include
    such inserts, and a claim drops its bound advance when the bound
    moved while it ran.
  • A pending: key could be hidden from claims indefinitely when its
    insert committed while a claim was in flight and the key sorted at
    or below the keys that claim advanced the scan bound past. Job ids
    are generated before the enqueue transaction commits, so commit
    order can invert key order under concurrent producers, and a
    requeue (reaper or nack) restores a job at its original key. The
    next claim then recorded emptiness at a valid epoch and the queue
    answered None while live jobs were pending. Bound advances now
    clamp to the smallest key recorded since the bound was observed,
    including when no bound exists yet (the first claim after a
    process restart) and when the key equals the claimed one (the
    claimed job requeued after its lease expired within the claim).
  • Duplicate EnqueueOptions::id_override values are now rejected
    transactionally with Error::DuplicateJobId instead of overwriting
    jobindex:{id} and leaving older queue-state records behind.
  • Queue::ack, Queue::nack, Queue::dead_letter, and
    Queue::renew_lease now check that the expected claimed: record
    still exists before settling a job. A worker finishing after its
    lease was reaped now gets Error::ClaimLost instead of being
    able to ack, retry, dead-letter, renew, or corrupt stats from a
    stale JobRecord.
  • Queue::nack and Queue::renew_lease now retry on transaction
    conflict like Queue::ack and Queue::dead_letter already did.
    A reaper committing the expired-lease delete concurrently with a
    late settlement is now retried (and resolves to Error::ClaimLost
    on the next attempt) instead of surfacing a raw SlateDB transaction
    error to the caller.
  • Queue::requeue_dead_job now checks that the dead-letter record
    still exists before reviving it. Requeueing a stale record after
    dead-letter retention swept it now returns Error::JobNotFound
    instead of recreating the job and corrupting queue stats.

taquba-jobs-v0.4.0

15 Jun 12:45
73ca489

Choose a tag to compare

Changed

  • Terminal marker filenames lead with an inverted timestamp, and the
    result-retention sweep lists only expired markers (via the
    object-store list_with_offset contract) instead of every retained
    marker on every tick, so a sweep's listing cost is proportional to
    the expired set. Markers written by earlier versions are not
    recognised by the sweeper: when upgrading a store that ran with
    result_retention enabled, clear the <result_prefix>/terminals/
    prefix out-of-band.

taquba-cron-v0.4.0

15 Jun 12:42
73ca489

Choose a tag to compare

Changed

  • Raised the minimum taquba requirement to 0.8.

taquba-bulk-v0.2.0

15 Jun 12:47
73ca489

Choose a tag to compare

Changed

  • Batch submission runs with bounded concurrency instead of one
    awaited submit at a time. Each submission blocks on a durable
    enqueue commit and concurrent commits share WAL flushes, so serial
    submission capped at one item per flush interval (one item per
    100ms at the SlateDB default). Enqueue order across in-flight
    submissions is not defined; batch items are independent.

Added

  • BulkCtx::memoized_by_content and
    BulkCtx::memoized_by_content_with_cached_cost for memoized steps
    whose keys should be derived from serialized input content rather
    than caller-supplied strings.
  • BulkCtx::memoized_with_cached_cost for memoized steps whose cost counters
    should be recorded both on fresh compute and on memo hits.

taquba-bulk-v0.1.0

30 May 12:23
bb3fd01

Choose a tag to compare

Initial release. Per-batch orchestrator that runs one pipeline over many
inputs in a single process on top of taquba-workflow.

Added

  • Pipeline: the per-item contract (typed Input / Output, an Error
    that converts into a StepError, and an async run). Each input item
    becomes one taquba-workflow run whose single step invokes run; the
    pipeline's own logical steps live inside run as BulkCtx::memoized
    calls.
  • BulkCtx<T>: per-item execution context. Carries the typed input,
    run_id, and submitter headers; exposes memoized (durable per-step
    result caching so an at-least-once retry replays cached results instead of
    repeating a paid call), record_cost, and cancel_token.
  • CostReport: generic named-metric accumulator (token counts, paid-API
    units, compute-seconds, dollars). Interior-mutable while a step runs and
    serializable for the per-item envelope and the batch rollup.
  • Bulk / BulkBuilder: the runner. Submits N runs, drives the worker pool,
    streams output as items complete, and aggregates progress and cost.
    Builder options: output, key_fn, headers, max_concurrent,
    poll_interval, queue_name, memo_prefix, fail_threshold. run
    executes to completion; run_with_shutdown drains in-flight items on a
    shutdown signal (e.g. spot preemption).
  • ProgressSnapshot: point-in-time counts, rate, estimated time remaining,
    and cost rollup, returned by Bulk::progress.
  • BulkReport: final counts, elapsed time, cost rollup, and
    failed_run_ids (re-submitting those ids resumes from cached memo state).
  • OutputSink with JsonlSink (one JSON record per line) and NullSink
    (discards records, for side-effecting pipelines); read_jsonl for
    line-delimited JSON input.
  • Error / Result: crate error type, including
    Error::FailureThresholdExceeded when the share of failed items crosses
    the configured threshold.
  • Re-exports StepError and StepErrorKind from taquba-workflow for the
    Pipeline::Error type.

taquba-jobs-v0.3.0

29 May 11:56
5db214a

Choose a tag to compare

Added

  • JobRunnerBuilder::result_retention(Duration): opt-in retention
    window for persisted result blobs. When set, the runner writes a
    terminal marker every time a job reaches a terminal state and an
    in-process sweeper deletes that job's result blob retention after
    termination. When unset (default), result blobs are retained
    indefinitely (the previous behaviour). Once a blob is swept,
    JobHandle::fetch_result for that job returns Ok(None) and an
    idempotent re-submission of the same payload falls through to
    re-running the job rather than short-circuiting; size the window
    so it covers the longest gap callers need between submission and
    idempotent re-submit.
  • JobRunnerBuilder::clock(Arc<dyn Clock>): override the time source
    the runner reads its timestamps from (terminal-marker timestamps
    and the retention sweep cutoff). Defaults to the queue's clock
    (Queue::clock), so passing a MockClock to
    Queue::open_with_options is enough for tests; this override is
    for the rarer case where the runner needs a different clock than
    the queue.

Changed

  • Idempotent submissions now short-circuit to a prior submission's
    persisted outcome. Previously, Job::idempotency_key only deduped
    against jobs that were still pending or scheduled: a re-submission
    after the original acked would create a new job (re-paying for the
    work). The dedup record now carries the assigned job_id (written
    atomically with the enqueue via the new
    EnqueueOptions::id_override) so a re-submission with a matching
    payload returns a handle pointing at the cached result blob (with
    newly_submitted = false).
  • The result-store prefix now reserves a sibling terminals/ segment
    for retention markers (<prefix>/terminals/<terminal_at_ms:020>_<job_id>).
    Existing result blobs (<prefix>/<job_id>) are unaffected: ULID
    job ids cannot collide with the literal terminals segment. Markers
    are only written when result_retention is configured.
  • Breaking (on-disk): JobSubmissionRecord (the durable per-idem-key
    dedup record) gained a job_id field. Records written by earlier
    versions of taquba-jobs will fail to deserialize and need to be
    cleared; the simplest path is to delete the queue's user KV prefix
    (usr:jobs/dedup/...) when upgrading.

taquba-workflow-v0.5.0

28 May 12:57
c444f84

Choose a tag to compare

Added

  • Memo: per-step durable key-value store for memoizing within-step
    side effects, backed by object storage. Bound to a specific
    (run_id, step_number); get(key) / put(key, value) take only
    the user key. Strictly per-step; the durable channel between steps
    is StepOutcome::Continue's payload, not memo.
  • MemoStore: the backing store Memo views are derived from
    (Arc<dyn ObjectStore> + path prefix). Used internally by the
    runtime builder; users construct one directly mainly in tests.
  • Step::memo: every step receives a Memo scoped to its own
    (run_id, step_number). Runners use it to cache results of
    expensive within-step side effects (LLM calls, paid APIs) so
    at-least-once retries don't re-pay for work the prior attempt
    already did.
  • WorkflowRuntimeBuilder::memo_prefix: configures the object-store
    prefix Step::memo entries live under. Defaults to "workflow-memo";
    set a distinct prefix when multiple runtimes share one store.
  • Error::Store(taquba::object_store::Error): surfaced from memo
    read/write failures. Classified as transient by is_permanent.
  • WorkflowRuntimeBuilder::memo_retention(Duration): opts the runtime
    into writing a terminal marker via MemoStore::write_terminal_marker
    on every terminal state (Succeeded, Failed, Cancelled). Markers
    outlive the run record and provide the input a memo-retention sweep
    consumes to decide when a run's memo entries are eligible for
    deletion. Without this setter no marker is written and memo entries
    are retained indefinitely (appropriate for short-lived runs or
    external cleanup).
  • Memo-retention sweeper: when memo_retention is set,
    WorkflowRuntime::run spawns a background task that periodically
    scans terminal markers and, for each marker older than the
    configured window, deletes the run's memo entries and then the
    marker itself. The first sweep fires on startup so a fresh process
    catches markers left behind by an earlier one. The sweeper shuts
    down with the caller-supplied shutdown future.
  • WorkflowRuntime now reads every timestamp it writes
    (DurableRunRecord::submitted_at_ms, the ContinueAfter run_at,
    and the terminal-marker timestamp) through a taquba::Clock. By
    default the runtime shares the clock its Queue was opened with
    (via Queue::clock), so passing a MockClock to OpenOptions
    virtualises time for the queue and the workflow runtime together.
  • WorkflowRuntimeBuilder::clock(Arc<dyn Clock>) overrides the
    defaulted-from-queue clock when a test or specialised setup needs a
    separate time source.

Changed

  • Breaking: WorkflowRuntime::builder now takes an additional
    required object_store: Arc<dyn ObjectStore> argument between the
    queue and the runner. The store backs Step::memo and need not be
    the same store the queue was opened with, though sharing one (just
    cloning the Arc) is the common case. Existing call sites must add
    the store argument:

    // Before:
    let runtime = WorkflowRuntime::builder(queue, runner, hook).build();
    // After:
    let runtime = WorkflowRuntime::builder(queue, store, runner, hook).build();

taquba-v0.7.0

28 May 12:56
908744f

Choose a tag to compare

Added

  • EnqueueOptions::id_override lets callers supply the job id instead
    of receiving a generated ULID. Useful when the id must be known before
    the enqueue returns. Ids are validated at the API boundary (1-128 bytes
    of [A-Za-z0-9_-]) and bad inputs return the new
    Error::InvalidId { id, reason } variant. Callers should prefer
    ULID-shaped ids when FIFO-within-priority claim order matters:
    pending/scheduled keys end with the id, so claim order follows
    id sort.
  • Queue::clock() accessor returns the Arc<dyn Clock> the queue
    was opened with (or the default SystemClock). Lets downstream
    crates share the queue's time source for their own timestamp work
    so virtualising time with MockClock advances the whole stack
    in lockstep.
  • OpenOptions::flush_interval: Option<Duration> exposes SlateDB's
    WAL flush interval. None keeps SlateDB's own default (100ms).
    Every taquba state transition (enqueue, claim, ack, nack,
    dead_letter) blocks on txn.commit() which waits for the next
    flush tick, so this value is the lower bound on per-operation
    latency.

Changed

  • Breaking on-disk layout: the done: keyspace is reordered from
    done:{queue}:{id} to done:{completed_at:020}:{queue}:{id},
    mirroring the existing time-first layout of claimed: and
    scheduled:. The retention sweep can now early-exit on the first
    unexpired record instead of walking the full prefix. Public API is
    unchanged; in-flight runs from prior versions must be drained
    before upgrading because the old keys will not be observed by the
    reaper.
  • Queue::claim (and therefore claim_next / claim_with_wait)
    serialises same-queue claim attempts through an in-process
    tokio::sync::Mutex. Same-queue attempts no longer rely on
    SlateDB's transaction-conflict retry to resolve which worker
    takes the head of pending:. The lock is per-queue, so different
    queues' claims still run in parallel. Per-claim wall-clock latency
    under high single-queue concurrency drops from seconds to roughly
    one commit interval. Public API unchanged.
  • Queue::claim now maintains an in-memory per-queue cursor that
    records the most recently claimed pending: key, and starts the
    next claim's scan from immediately after it. This skips the
    tombstone band left by previously claimed (and deleted) pending:
    entries that the SlateDB iterator would otherwise walk. The
    cursor is invalidated whenever a pending: key is written at or
    before it (nack-requeue, dead-job requeue, reaper-requeue,
    scheduler promotion, and any enqueue at a lower-numbered
    priority); when this happens the next claim falls back to a full
    prefix scan. The cursor is not persisted: on process restart the
    first claim falls back to the prefix scan and re-warms naturally.
    Public API unchanged.
  • Bumped minimum slatedb version from 0.13 to 0.13.1.

Fixed

  • enqueue_with's non-dedup path (write_new) now retries on
    transaction conflict, matching the dedup path (write_unique),
    enqueue_with_kv, ack, dead_letter, and every other write path
    in the crate. Previously a conflict during a non-dedup enqueue would
    surface as Error::Storage to the caller; under normal contention
    this would have manifested as spurious enqueue failures that a retry
    could resolve.