v0.9.7 — Performance Unleashed
Pre-releaseRelease Notes for v0.9.7 - Performance Unleashed
Version 0.9.7 - 2026-05-12
The performance milestone. Every read path that previously allocated or memcpy'd is now zero-copy, and the hot loops that grabbed locks per chunk now grab them once. as_slice works uniformly on all three mapping modes (ReadOnly, CopyOnWrite, and ReadWrite); it returns a new MappedSlice<'_> wrapper that derefs to &[u8] and, on RW, holds a read guard for its lifetime so concurrent resize() blocks until the slice is dropped. The chunks() / pages() iterators no longer allocate a Vec<u8> per item: they yield MappedSlice<'a> borrowed directly from the mapped region. touch_pages was rewritten as a tight ptr::read_volatile loop holding the lock once. chunks_mut().for_each_mut(...) similarly holds the write guard once for the entire iteration and dropped its triple-nested Result<Result<(), E>> for a flat Result<()>.
The headline numbers come from the H1 redesign: a 1 GiB scan at 4 KiB chunks goes from 262,144 heap allocations and 2x memory bandwidth (mmap → buffer → clone → caller) to zero allocations and 1x bandwidth (direct slice into the mapping). Audit items H1, H2, H4, and E4 all close in this release. H3 (lock-free RW reads via arc-swap) intentionally stays open and is rescoped to a 1.0 design conversation rather than a tactical fix; the current RwLock<MmapMut> is sound and bounded, and replacing it is a memory-model question, not a tuning question.
Highlights
MappedSlice<'a>wrapper: the unifying read-side type. ImplementsDeref<Target = [u8]>,AsRef<[u8]>,Debug, andPartialEqagainst[u8]/&[u8]/[u8; N]/&[u8; N]so call sites work as if it were a byte slice. On RO and COW the wrapper is theOwned(&'a [u8])variant (lock-free, the underlying mapping is immutable). On RW the wrapper is theGuarded { guard, range }variant that holds theRwLockread guard for its lifetime. Re-exported from the crate root.as_sliceworks on RW (BREAKING): previously returnedMmapIoError::InvalidModeon RW, forcing callers toread_into(which copies). Now returnsResult<MappedSlice<'_>>uniformly across all three modes. Callers that previously caughtInvalidModeshould remove the branch.- Iterator zero-copy (BREAKING):
ChunkIterator::ItemandPageIterator::Itemare nowMappedSlice<'a>(wasResult<Vec<u8>>). The iterator captures the mapping's base pointer and total length once at construction; eachnext()is a pointer arithmetic +slice::from_raw_partswith no heap traffic. The iterator holds the RW read guard for its full lifetime, blocking concurrentresize()until iteration is done. Migration aidschunks_owned()andpages_owned()preserve theVec<u8>ergonomics for the (rare) case where callers genuinely need owned buffers. for_each_mutflattened (BREAKING):ChunkIteratorMut::for_each_mut(F) -> Result<()>whereF: FnMut(u64, &mut [u8]) -> Result<()>. The old triple-nestedResult<Result<(), E>>is gone, and the write guard is acquired ONCE for the entire iteration instead of per-chunk. Callers that returnedOk::<(), std::io::Error>(())should returnOk(())and map foreign errors intoMmapIoError::Io(...)before returning.touch_pages/touch_pages_rangerewritten (H2): previously calledread_into(offset, &mut [0u8; 1])per page, which acquired the lock, validated bounds, and memcpy'd one byte 262,144 times for a 1 GiB file. The new implementation acquires the lock ONCE, walks the mapping withstd::ptr::read_volatile::<u8>wrapped instd::hint::black_box, and steps bypage_size(). Expected speedup on multi-GiB files: 50-100x.- Workload-pattern benches:
sequential_read(1 / 16 / 256 MiB,as_slicevsread_into),random_read(xorshift64 PRNG, no new dep),sequential_writeunder three flush policies includingEveryMillis(10)now that C2 made it actually work,iterator_throughputcomparing zero-copychunks()tochunks_owned()to show the H1 win directly,touch_pages_largeon 1 GiB, andatomic_contentionacross 1 / 2 / 4 / 8 threads. - CI workflow
bench-regression.yml: runs the full bench suite on every push and PR, uploadstarget/criterion/as an artifact for diff against the checked-in baseline. The hard >10% regression gate is rescoped to 0.9.10 alongside the rest of the pre-1.0 stabilization pass.
Breaking changes
Three API breaks land in this release. All have mechanical migration paths.
-
as_slicereturn type. Old:Result<&[u8]>(errors withInvalidModeon RW). New:Result<MappedSlice<'_>>for all three modes.MappedSlicederefs to[u8], so most call sites compile unchanged because indexing, iteration, and.len()go throughDeref. Sites that bound the result aslet s: &[u8] = mmap.as_slice(...)?need to change tolet s = mmap.as_slice(...)?;(let the type be inferred) orlet s: &[u8] = &*mmap.as_slice(...)?(deref explicitly). Sites that caughtInvalidModeon the RW path should remove that branch. -
Iterator items. Old:
Iterator<Item = Result<Vec<u8>>>. New:Iterator<Item = MappedSlice<'a>>. Patterns likeif let Some(Ok(chunk)) = iter.next()becomeif let Some(chunk) = iter.next();for chunk in mmap.chunks(N) { let chunk = chunk?; ... }becomesfor chunk in mmap.chunks(N) { ... }. Sites that genuinely need owned buffers (e.g., handing data to a thread that outlives the mapping borrow) should switch tochunks_owned()/pages_owned(), which preserve the oldResult<Vec<u8>>shape. -
ChunkIteratorMut::for_each_mutsignature. Old:fn for_each_mut<F, E>(F) -> Result<Result<(), E>>. New:fn for_each_mut<F>(F) -> Result<()>whereF: FnMut(u64, &mut [u8]) -> Result<()>(the crate'sResult, notstd::result::Result<(), E>). Migration: returnOk(())instead ofOk::<(), io::Error>(())and map foreign errors with.map_err(|e| MmapIoError::Io(...))before returning. The double??unwrap pattern at the call site collapses to a single?.
Performance
- Iterator zero-copy (H1): 1 GiB scan at 4 KiB chunks. Before: 262,144 heap allocations + 2x memory bandwidth (read into iterator buffer, clone buffer to return ownership). After: 0 allocations, 1x bandwidth, pointer arithmetic only. The
iterator_throughputbench compareschunks()tochunks_owned()directly so the delta is on the bench output sheet. touch_pagestight loop (H2): 1 GiB file with 4 KiB pages = 262,144 page touches. Before: per-pageread_into(offset, &mut buf[..1])= lock acquisition + bounds check + memcpy of 1 byte, 262,144 times. After: one lock acquisition +ptr::read_volatilestep loop. Expected ~50-100x speedup on thetouch_pages_largebench.for_each_mutsingle guard (E4 follow-on): total time-window the write lock is held is unchanged (the iteration was always exclusive), but the per-chunk lock-acquire / lock-release overhead is gone. Tight RMW loops over many small chunks (e.g., zeroing a 1 GiB file at 4 KiB) save the per-iteration uncontended parking_lot overhead.MappedSliceoverhead: zero on RO and COW (theOwned(&[u8])variant is a thin wrapper). On RW the wrapper holds anRwLockReadGuardwhose destructor releases the lock. No allocation, no virtual dispatch.Dereflowers to a direct pointer access at the use site.
Tests
- 15 property tests (256-1024 cases per property) continue to run; one obsolete property (
as_slice_rw_invalid_mode) was rewritten asas_slice_rw_returns_mapped_sliceto verify the new RW path. Plus new iterator tests for the zero-copy shape,chunks_ownedmigration aid, andfor_each_mutsingle-guard behavior. - 101 tests total under
--all-features(up from 99 in 0.9.6), 4 ignored (3 polling-watch tests gated on Windows mtime granularity, 1 hugepages fallback), 0 failed. - CI matrix combos green locally for
--no-default-features(60 + doctests) and--no-default-features --features "cow locking advise"(75 + doctests). Banned-words scan zero hits. - MSRV unchanged at 1.75.
cargo +1.75 build --all-featuresclean. The iterator'sSend + Syncimpls and theMappedSliceDeref / PartialEq stack work on 1.75 without GAT use or other recent features.
Notes
- No new runtime dependencies.
proptest(from 0.9.6) remains the only[dev-dependencies]addition for this milestone. The random-offset benches use a hand-rolled xorshift64 PRNG rather than pulling inrand. MappedSliceandMappedSliceMutare now part of the stable-through-0.9.x public surface. Both are re-exported from the crate root.as_slice_mutwas already returningMappedSliceMut<'_>before this release. This release addsDeref<Target = [u8]>andDerefMutimpls to it so it can be used as a&mut [u8]directly. Pluslen()andis_empty()accessors.
Deferred (with documented reason)
- H3 (lock-free RW reads via arc-swap) stays open. The current
RwLock<MmapMut>design is sound and bounded; replacing it witharc-swapor anUnsafeCelldesign is a memory-model question (do we accept torn reads from concurrent writers on the same mapping?), not a tuning question. Rescoped from "tactical 0.9.x fix" to a 1.0 design conversation. - Bench-regression hard gate: the new
bench-regression.ymlruns the suite and uploads artifacts on every push. The >10% regression threshold check is part of the 0.9.10 pre-1.0 stabilization milestone alongsidecargo-semver-checksandcargo-fuzz. docs/PERFORMANCE.md: now that the workload-pattern benches exist, the next step is running them on the maintainer's reference machine and publishing measured P50 / P99 numbers per workload. That doc is part of the same 0.9.10 sweep.- 0.9.8 is the async-polish release: cancellation-safety review per async method, possible
read_into_async/read_slice_asyncadditions, Tokio version decision.
Full Changelog: v0.9.6...v0.9.7