Skip to content

v0.9.7 — Performance Unleashed

Pre-release
Pre-release

Choose a tag to compare

@jamesgober jamesgober released this 12 May 20:29
· 9 commits to main since this release

Release Notes for v0.9.7 - Performance Unleashed

Version 0.9.7 - 2026-05-12

The performance milestone. Every read path that previously allocated or memcpy'd is now zero-copy, and the hot loops that grabbed locks per chunk now grab them once. as_slice works uniformly on all three mapping modes (ReadOnly, CopyOnWrite, and ReadWrite); it returns a new MappedSlice<'_> wrapper that derefs to &[u8] and, on RW, holds a read guard for its lifetime so concurrent resize() blocks until the slice is dropped. The chunks() / pages() iterators no longer allocate a Vec<u8> per item: they yield MappedSlice<'a> borrowed directly from the mapped region. touch_pages was rewritten as a tight ptr::read_volatile loop holding the lock once. chunks_mut().for_each_mut(...) similarly holds the write guard once for the entire iteration and dropped its triple-nested Result<Result<(), E>> for a flat Result<()>.

The headline numbers come from the H1 redesign: a 1 GiB scan at 4 KiB chunks goes from 262,144 heap allocations and 2x memory bandwidth (mmap → buffer → clone → caller) to zero allocations and 1x bandwidth (direct slice into the mapping). Audit items H1, H2, H4, and E4 all close in this release. H3 (lock-free RW reads via arc-swap) intentionally stays open and is rescoped to a 1.0 design conversation rather than a tactical fix; the current RwLock<MmapMut> is sound and bounded, and replacing it is a memory-model question, not a tuning question.

Highlights

  • MappedSlice<'a> wrapper: the unifying read-side type. Implements Deref<Target = [u8]>, AsRef<[u8]>, Debug, and PartialEq against [u8] / &[u8] / [u8; N] / &[u8; N] so call sites work as if it were a byte slice. On RO and COW the wrapper is the Owned(&'a [u8]) variant (lock-free, the underlying mapping is immutable). On RW the wrapper is the Guarded { guard, range } variant that holds the RwLock read guard for its lifetime. Re-exported from the crate root.
  • as_slice works on RW (BREAKING): previously returned MmapIoError::InvalidMode on RW, forcing callers to read_into (which copies). Now returns Result<MappedSlice<'_>> uniformly across all three modes. Callers that previously caught InvalidMode should remove the branch.
  • Iterator zero-copy (BREAKING): ChunkIterator::Item and PageIterator::Item are now MappedSlice<'a> (was Result<Vec<u8>>). The iterator captures the mapping's base pointer and total length once at construction; each next() is a pointer arithmetic + slice::from_raw_parts with no heap traffic. The iterator holds the RW read guard for its full lifetime, blocking concurrent resize() until iteration is done. Migration aids chunks_owned() and pages_owned() preserve the Vec<u8> ergonomics for the (rare) case where callers genuinely need owned buffers.
  • for_each_mut flattened (BREAKING): ChunkIteratorMut::for_each_mut(F) -> Result<()> where F: FnMut(u64, &mut [u8]) -> Result<()>. The old triple-nested Result<Result<(), E>> is gone, and the write guard is acquired ONCE for the entire iteration instead of per-chunk. Callers that returned Ok::<(), std::io::Error>(()) should return Ok(()) and map foreign errors into MmapIoError::Io(...) before returning.
  • touch_pages / touch_pages_range rewritten (H2): previously called read_into(offset, &mut [0u8; 1]) per page, which acquired the lock, validated bounds, and memcpy'd one byte 262,144 times for a 1 GiB file. The new implementation acquires the lock ONCE, walks the mapping with std::ptr::read_volatile::<u8> wrapped in std::hint::black_box, and steps by page_size(). Expected speedup on multi-GiB files: 50-100x.
  • Workload-pattern benches: sequential_read (1 / 16 / 256 MiB, as_slice vs read_into), random_read (xorshift64 PRNG, no new dep), sequential_write under three flush policies including EveryMillis(10) now that C2 made it actually work, iterator_throughput comparing zero-copy chunks() to chunks_owned() to show the H1 win directly, touch_pages_large on 1 GiB, and atomic_contention across 1 / 2 / 4 / 8 threads.
  • CI workflow bench-regression.yml: runs the full bench suite on every push and PR, uploads target/criterion/ as an artifact for diff against the checked-in baseline. The hard >10% regression gate is rescoped to 0.9.10 alongside the rest of the pre-1.0 stabilization pass.

Breaking changes

Three API breaks land in this release. All have mechanical migration paths.

  • as_slice return type. Old: Result<&[u8]> (errors with InvalidMode on RW). New: Result<MappedSlice<'_>> for all three modes. MappedSlice derefs to [u8], so most call sites compile unchanged because indexing, iteration, and .len() go through Deref. Sites that bound the result as let s: &[u8] = mmap.as_slice(...)? need to change to let s = mmap.as_slice(...)?; (let the type be inferred) or let s: &[u8] = &*mmap.as_slice(...)? (deref explicitly). Sites that caught InvalidMode on the RW path should remove that branch.

  • Iterator items. Old: Iterator<Item = Result<Vec<u8>>>. New: Iterator<Item = MappedSlice<'a>>. Patterns like if let Some(Ok(chunk)) = iter.next() become if let Some(chunk) = iter.next(); for chunk in mmap.chunks(N) { let chunk = chunk?; ... } becomes for chunk in mmap.chunks(N) { ... }. Sites that genuinely need owned buffers (e.g., handing data to a thread that outlives the mapping borrow) should switch to chunks_owned() / pages_owned(), which preserve the old Result<Vec<u8>> shape.

  • ChunkIteratorMut::for_each_mut signature. Old: fn for_each_mut<F, E>(F) -> Result<Result<(), E>>. New: fn for_each_mut<F>(F) -> Result<()> where F: FnMut(u64, &mut [u8]) -> Result<()> (the crate's Result, not std::result::Result<(), E>). Migration: return Ok(()) instead of Ok::<(), io::Error>(()) and map foreign errors with .map_err(|e| MmapIoError::Io(...)) before returning. The double ?? unwrap pattern at the call site collapses to a single ?.

Performance

  • Iterator zero-copy (H1): 1 GiB scan at 4 KiB chunks. Before: 262,144 heap allocations + 2x memory bandwidth (read into iterator buffer, clone buffer to return ownership). After: 0 allocations, 1x bandwidth, pointer arithmetic only. The iterator_throughput bench compares chunks() to chunks_owned() directly so the delta is on the bench output sheet.
  • touch_pages tight loop (H2): 1 GiB file with 4 KiB pages = 262,144 page touches. Before: per-page read_into(offset, &mut buf[..1]) = lock acquisition + bounds check + memcpy of 1 byte, 262,144 times. After: one lock acquisition + ptr::read_volatile step loop. Expected ~50-100x speedup on the touch_pages_large bench.
  • for_each_mut single guard (E4 follow-on): total time-window the write lock is held is unchanged (the iteration was always exclusive), but the per-chunk lock-acquire / lock-release overhead is gone. Tight RMW loops over many small chunks (e.g., zeroing a 1 GiB file at 4 KiB) save the per-iteration uncontended parking_lot overhead.
  • MappedSlice overhead: zero on RO and COW (the Owned(&[u8]) variant is a thin wrapper). On RW the wrapper holds an RwLockReadGuard whose destructor releases the lock. No allocation, no virtual dispatch. Deref lowers to a direct pointer access at the use site.

Tests

  • 15 property tests (256-1024 cases per property) continue to run; one obsolete property (as_slice_rw_invalid_mode) was rewritten as as_slice_rw_returns_mapped_slice to verify the new RW path. Plus new iterator tests for the zero-copy shape, chunks_owned migration aid, and for_each_mut single-guard behavior.
  • 101 tests total under --all-features (up from 99 in 0.9.6), 4 ignored (3 polling-watch tests gated on Windows mtime granularity, 1 hugepages fallback), 0 failed.
  • CI matrix combos green locally for --no-default-features (60 + doctests) and --no-default-features --features "cow locking advise" (75 + doctests). Banned-words scan zero hits.
  • MSRV unchanged at 1.75. cargo +1.75 build --all-features clean. The iterator's Send + Sync impls and the MappedSlice Deref / PartialEq stack work on 1.75 without GAT use or other recent features.

Notes

  • No new runtime dependencies. proptest (from 0.9.6) remains the only [dev-dependencies] addition for this milestone. The random-offset benches use a hand-rolled xorshift64 PRNG rather than pulling in rand.
  • MappedSlice and MappedSliceMut are now part of the stable-through-0.9.x public surface. Both are re-exported from the crate root.
  • as_slice_mut was already returning MappedSliceMut<'_> before this release. This release adds Deref<Target = [u8]> and DerefMut impls to it so it can be used as a &mut [u8] directly. Plus len() and is_empty() accessors.

Deferred (with documented reason)

  • H3 (lock-free RW reads via arc-swap) stays open. The current RwLock<MmapMut> design is sound and bounded; replacing it with arc-swap or an UnsafeCell design is a memory-model question (do we accept torn reads from concurrent writers on the same mapping?), not a tuning question. Rescoped from "tactical 0.9.x fix" to a 1.0 design conversation.
  • Bench-regression hard gate: the new bench-regression.yml runs the suite and uploads artifacts on every push. The >10% regression threshold check is part of the 0.9.10 pre-1.0 stabilization milestone alongside cargo-semver-checks and cargo-fuzz.
  • docs/PERFORMANCE.md: now that the workload-pattern benches exist, the next step is running them on the maintainer's reference machine and publishing measured P50 / P99 numbers per workload. That doc is part of the same 0.9.10 sweep.
  • 0.9.8 is the async-polish release: cancellation-safety review per async method, possible read_into_async / read_slice_async additions, Tokio version decision.

Full Changelog: v0.9.6...v0.9.7