Release v0.7.9 · panchenkoai/rivet

Closes the §10 optimization backlog (OPT-1…OPT-7), proves Parquet type
fidelity end-to-end with four independent readers (DuckDB, ClickHouse,
pyarrow, BigQuery) with native logical types for UUID/JSON, and
consolidates the per-runner commit + post-finalize state-write paths
behind two shared seams (commit::record_part for the per-part write
ordering, RunStore for the cursor + progression tail) so the
ADR-0001 / ADR-0008 ordering invariants live in interfaces rather
than in per-runner conventions. Adds CI debug-build invariant gates
that catch the next M1-shape bug at finalize time. Six new ADRs
(0015–0020) document the architectural decisions made along the way
and the deferred work (nullability propagation, PG UUID-PK
auto-keyset).

Architecture — seam consolidation

refactor(pipeline) — pipeline::commit::record_part is the
single home for the per-part commit ordering: I1 finalize → dest.write
→ ADR-0012 M1 manifest add → counters → journal event → I7 file-log
warn-on-fail → fault hooks. Six runners (single, keyset,
chunked::run_chunked_sequential, chunked::run_chunked_parallel,
chunked::sequential_checkpoint, chunked::parallel_checkpoint)
now share one body each instead of hand-copying. See ADR-0018.
fix(pipeline) — chunked::parallel_checkpoint previously
populated state.file_log per chunk but never appended to
summary.manifest_parts, so the cloud manifest (ADR-0012 M1) shipped
empty for every parallel>1 + chunk_checkpoint:true run. Migration
onto commit::record_part (with state=None in the drain to avoid
double-writing the per-chunk durable file_log) closes the gap.
refactor(pipeline) — pipeline::run_store::RunStore is the
builder facade for the post-finalize cursor + progression writes
(with_cursor is fatal-on-error per ADR-0001 I3; with_progression
is warn-on-fail per ADR-0008 PG2 / PG7). Four runners use it; the
ordering contract is now an interface property, not a convention.
See ADR-0018. Schema-drift stays in single.rs because it's a
policy state machine, not a persistence ordering.
refactor(tuning) — tuning::Governor extracts the OPT-2
adaptive-concurrency loop out of an inline thread::scope closure
in chunked::exec::run_chunked_parallel. The loop is now
unit-testable on a fake PressureSource in microseconds instead of
a 2-4 s live test. See ADR-0019.

Extraction & memory hardening (optimization backlog)

feat(pipeline) — adaptive concurrency governor (OPT-2): in chunked
mode with parallel > 1 and tuning.adaptive: true, a governor samples
source write-pressure on a dedicated monitoring connection and resizes the
live worker/connection count within [min_parallel, parallel] — backing off
under load, recovering when it eases. Decisions land in the run journal
(ParallelismAdjusted). Read-only credentials suffice.
fix(pipeline) — governor deadlock under chunk failure (OPT-2):
workers bumped the completed counter only on success, but the governor's
exit condition was keyed on it — so any failing chunk left the governor
spinning forever and thread::scope could never join. Workers now bump a
separate finished counter on every exit path (success or failure).
test(pipeline) — governor concurrent-write back-off (OPT-2):
deterministic live coverage of the closed-loop reaction to source pressure
— a background UPDATE/CHECKPOINT writer drives checkpoints_req past
the 80 ms sampler, and the governor's backed off log lines fire as
expected.
feat(pipeline) — MySQL keyset (seek) pagination (OPT-4): tables with
a UUID / string / composite (non-integer) PK now have a safe chunked shape via
chunk_by_key: (auto-resolved on MySQL when there's no single-int PK but a
usable unique key). Pages by an index-backed unique key
(WHERE key > last ORDER BY key LIMIT n) — bounded RSS and bounded
longest-query time, EXPLAIN-verified as an index range scan (never a
full-scan + filesort). A non-indexed chunk_by_key is refused.
feat(pipeline) — PG UUID-PK keyset via explicit chunk_by_key::
extract_last_cursor_value learned the FixedSizeBinary(16) arm so PG
uuid columns (ADR-0014 → arrow.uuid extension → native Parquet
LogicalType::Uuid) now serve as keyset cursors. Auto-resolution on PG
remains scoped to integer PKs (see ADR-0020 for the asymmetry rationale
vs MySQL); operators with UUID-PK tables can opt into keyset paging by
declaring chunk_by_key: <uuid_col>.
feat(sink) — per-value size ceiling (OPT-1): tuning.max_value_mb
(default 256 MB; 0 disables) aborts with RIVET_VALUE_TOO_LARGE when a
single text/JSON/blob cell would OOM the process — the average-based batch
cap can't bound a lone giant value.
test(types) — proptest MySQL value round-trip (OPT-3): 1000
randomized values per supported type prove Rivet's MySQL value-decoder
contract under the property-testing fuzzer.
test(pipeline) — subprocess crash coverage (OPT-6): every chunked
fault point (after_chunk_file, after_chunk_complete) now has crash-
recovery coverage on the parallel-export-processes engine. Brings the two
engines to per-fault symmetry.
feat(format) — stable Parquet created_by (OPT-5): pinned to a
release-stable string so per-part content_fingerprint survives across
builds and the manifest dedup token is reliable.

Supply chain

docs(security) — documented release-checksum verification. Every release
already publishes SHA256SUMS.txt; README + SECURITY.md now show
sha256sum -c / shasum -a 256 -c (the docs previously said "rebuild from
source"). Signing/SBOM remain on the roadmap.

Types — native Parquet logical types + round-trip proof

feat(pipeline) — adaptive concurrency governor (OPT-2): in chunked
mode with parallel > 1 and tuning.adaptive: true, a governor samples
source write-pressure on a dedicated monitoring connection and resizes the
live worker/connection count within [min_parallel, parallel] — backing off
under load, recovering when it eases. Decisions land in the run journal
(ParallelismAdjusted). Read-only credentials suffice.
feat(pipeline) — MySQL keyset (seek) pagination (OPT-4): tables with
a UUID / string / composite (non-integer) PK now have a safe chunked shape via
chunk_by_key: (auto-resolved on MySQL when there's no single-int PK but a
usable unique key). Pages by an index-backed unique key
(WHERE key > last ORDER BY key LIMIT n) — bounded RSS and bounded
longest-query time, EXPLAIN-verified as an index range scan (never a
full-scan + filesort). A non-indexed chunk_by_key is refused.
feat(sink) — per-value size ceiling (OPT-1): tuning.max_value_mb
(default 256 MB; 0 disables) aborts with RIVET_VALUE_TOO_LARGE when a
single text/JSON/blob cell would OOM the process — the average-based batch
cap can't bound a lone giant value.

Supply chain

docs(security) — documented release-checksum verification. Every release
already publishes SHA256SUMS.txt; README + SECURITY.md now show
sha256sum -c / shasum -a 256 -c (the docs previously said "rebuild from
source"). Signing/SBOM remain on the roadmap.

Types — native Parquet logical types + round-trip proof

feat(types) — UUID columns now emit native Parquet
LogicalType::Uuid as FixedSizeBinary(16) carrying the Arrow canonical
arrow.uuid extension type; JSON/JSONB carry arrow.json. Downstream
engines (DuckDB → native UUID, ClickHouse → Nullable(UUID),
BigQuery) load these without a cast. Enabled via the parquet
arrow_canonical_extension_types feature. See src/types/mapping.rs.
test(types) — four-reader validator matrix: every PG/MySQL type
round-trips through Parquet and is read back by DuckDB, ClickHouse,
pyarrow, and (live) BigQuery to pin field metadata + row-group stats
(tests/type_roundtrip/).
fix(types/mysql) — UNSIGNED BIGINT (UINT64) overflows INT64;
now mapped to Decimal128 so the full range survives.

Bug fixes — validation surface

fix(pipeline/chunked) — the sequential checkpoint path ran
--validate on every chunk file but never recorded the result, so
mode: chunked + chunk_checkpoint: true + default parallel: 1 runs
stored validated = NULL in export_metrics and dropped the
validated: pass line from the run summary. It now sets the flag like
the other three export paths (regression test in
tests/live_chunked_recovery.rs).
fix(preflight/doctor) — rivet doctor drops a .rivet_doctor_probe
writability test object at the destination and never removes it. A
subsequent rivet run --validate flagged it as an untracked_object
and downgraded the run to validated: FAIL. The probe is now a
recognised Rivet sidecar (manifest::DOCTOR_PROBE_FILENAME) and skipped
by the manifest-aware --validate pass (regression test in
src/pipeline/validate_manifest.rs).

Preflight + UX

fix(preflight) — chunked / incremental exports on an indexed
cursor / chunk column no longer report a false DEGRADED verdict. A
catalog btree probe replaces the EXPLAIN-on-base-query heuristic, so
an indexed PK reads as ACCEPTABLE with an (indexed) scan-type
suffix.
polish(ux) — rivet init explains its mode choice inline
(# auto: ~N rows ≥ 100K threshold and chunk column 'id' is available)
and scales chunk_size to the row estimate; skipped incremental runs
print status: skipped (no new rows since cursor 'X'); the plaintext-
password and TLS warnings fire from doctor / check; the retry-safe
WARN is demoted to DEBUG for local destinations.

Docs + assets

docs — validated every command in the user-facing guides against
the binary; fixed drift (the file_log state table, real rivet doctor
output, the pilot walkthrough's missing decimal(10,2) override).
docs(gifs) — regenerated all instructional GIFs against current
behavior (card UI, (indexed) scan type, validated: pass).

Invariant audit — CI gates and paper trail

test(invariants) — RunSummary::check_post_run_invariants runs
as a cfg!(debug_assertions) gate at the top of
pipeline::finalize::finalize_manifest. Catches the next M1-shape
bug (runner bumps files_committed / bytes_written without going
through commit::record_part) the moment a debug-build test
finishes a run. Closes gaps #2 + #3 ("completed table must have
manifest entries"; "summary totals derivable from manifest") from
the release-checklist invariant audit.
test(invariants) — companion gates close gaps #1
(success && total_rows > 0 ⇒ files_committed > 0 — no rows
extracted-then-dropped on the floor) and #4 (live test
successful_run_writes_summary_artifacts_under_dot_rivet asserts
.rivet/runs/<run_id>/summary.{json,md} exist on disk after every
successful run, pinning ADR-0001 I8 at the on-disk layer).

Cloud destinations — consolidate retry / runtime / read surface

refactor(destination) — CloudBackend trait + generic
CloudDestination<B> consolidate retry policy, blocking-operator
wrap, prefix join, and the ADR-0013 read surface
(write / list_prefix / read / head / move) across S3,
GCS, and Azure. Per-backend modules now only supply build_operator
- a label and a scheme. Net: -424 LoC duplication across the three
  cloud backends. The local filesystem destination stays separate
  (no OpenDAL runtime, partial-write semantics genuinely differ).

CI / infra

ci — jlumbroso/free-disk-space@main runs before the heavy
build + test-profile rebuild in the e2e job (ci.yml) and the
nightly-live job, freeing ~30 GB by pruning .NET / Android SDK /
Haskell GHC / CodeQL / tool-cache (Rivet never touches them).
Recent nightly-live failures (Process completed with exit code 101 with no test annotation because cargo's stderr garbled under
ENOSPC) are the prompt; df -h / snapshots before and after
surface any future regression directly in the run log.

Architecture decisions

docs(adr) — ADR-0015: Source introspection is a data-shape
seam, not a trait. Documents the dismissal of the recurrent "unify
introspect_pg_table_for_chunking + introspect_mysql_table_for_chunking
under a trait" suggestion — the two functions share a return type
but no implementation logic (different catalogs, dialects, quirks).
docs(adr) — ADR-0016: Nullability propagation deferred to v0.8
Phase A. Replaces the earlier "by design" dismissal of Gap #5 with
an honest deferred-decision record; names the four
SourceColumn::simple(…, true) hardcode sites, the per-query-shape
resolvability matrix, the operator workaround, and the revisit
trigger.
docs(adr) — ADR-0017: Per-runner durability ordering map.
Documents the asymmetric file_log timing (four runners inline
per part; chunked_parallel post-scope drain;
parallel_checkpoint split sync-worker + post-scope), names the
C3 live-test invariant that forced the split, and acknowledges
the per-chunk StateStore::open smell that the split kept.
docs(adr) — ADR-0018: Builder facades for runner-level
invariant ordering. Positive paper trail for commit::record_part
- RunStore: why builder over single-method / type-state, why
  facades and not traits, what stays outside the facade (schema
  drift in single.rs, metrics in job.rs).
docs(adr) — ADR-0019: Governor as extracted policy with
injectable PressureSource. Documents the testability win, why the
trait lives in tuning:: not source::, and the deadlock-class
regression cover.
docs(adr) — ADR-0020: PostgreSQL UUID-PK chunking asymmetry
vs MySQL. Two-layer gap: layer 1 (planner's PG-no-auto-keyset
default — deferred design choice; DECLARE CURSOR is RAM-bounded
but not wall-time-bounded) and layer 2 (sink runtime missing
FixedSizeBinary(16) arm — closed in this release).
docs(CLAUDE.md) — added "Verify before publishing agent-walk
claims" process rule: when an Agent(Explore, …) walk returns
claims with specific file paths / line numbers, the next action
is a Read / graph query on the named site before writing the
claim into a deliverable. Lesson from a real architecture-review
walk that produced six false claims unverified.

Dependencies

Bumped mach2 0.4 → 0.6, tikv-jemallocator 0.6 → 0.7,
criterion 0.5 → 0.8 (dev), brotli 8.0.2 → 8.0.3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.9

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Architecture — seam consolidation

Extraction & memory hardening (optimization backlog)

Supply chain

Types — native Parquet logical types + round-trip proof

Supply chain

Types — native Parquet logical types + round-trip proof

Bug fixes — validation surface

Preflight + UX

Docs + assets

Invariant audit — CI gates and paper trail

Cloud destinations — consolidate retry / runtime / read surface

CI / infra

Architecture decisions

Dependencies

Uh oh!