v0.9.7 — Performance Unleashed #4
jamesgober
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Release Notes for v0.9.7 - Performance Unleashed
Version 0.9.7 - 2026-05-12
The performance milestone. Every read path that previously allocated or memcpy'd is now zero-copy, and the hot loops that grabbed locks per chunk now grab them once.
as_sliceworks uniformly on all three mapping modes (ReadOnly, CopyOnWrite, and ReadWrite); it returns a newMappedSlice<'_>wrapper that derefs to&[u8]and, on RW, holds a read guard for its lifetime so concurrentresize()blocks until the slice is dropped. Thechunks()/pages()iterators no longer allocate aVec<u8>per item: they yieldMappedSlice<'a>borrowed directly from the mapped region.touch_pageswas rewritten as a tightptr::read_volatileloop holding the lock once.chunks_mut().for_each_mut(...)similarly holds the write guard once for the entire iteration and dropped its triple-nestedResult<Result<(), E>>for a flatResult<()>.The headline numbers come from the H1 redesign: a 1 GiB scan at 4 KiB chunks goes from 262,144 heap allocations and 2x memory bandwidth (mmap → buffer → clone → caller) to zero allocations and 1x bandwidth (direct slice into the mapping). Audit items H1, H2, H4, and E4 all close in this release. H3 (lock-free RW reads via
arc-swap) intentionally stays open and is rescoped to a 1.0 design conversation rather than a tactical fix; the currentRwLock<MmapMut>is sound and bounded, and replacing it is a memory-model question, not a tuning question.Highlights
MappedSlice<'a>wrapper: the unifying read-side type. ImplementsDeref<Target = [u8]>,AsRef<[u8]>,Debug, andPartialEqagainst[u8]/&[u8]/[u8; N]/&[u8; N]so call sites work as if it were a byte slice. On RO and COW the wrapper is theOwned(&'a [u8])variant (lock-free, the underlying mapping is immutable). On RW the wrapper is theGuarded { guard, range }variant that holds theRwLockread guard for its lifetime. Re-exported from the crate root.as_sliceworks on RW (BREAKING): previously returnedMmapIoError::InvalidModeon RW, forcing callers toread_into(which copies). Now returnsResult<MappedSlice<'_>>uniformly across all three modes. Callers that previously caughtInvalidModeshould remove the branch.ChunkIterator::ItemandPageIterator::Itemare nowMappedSlice<'a>(wasResult<Vec<u8>>). The iterator captures the mapping's base pointer and total length once at construction; eachnext()is a pointer arithmetic +slice::from_raw_partswith no heap traffic. The iterator holds the RW read guard for its full lifetime, blocking concurrentresize()until iteration is done. Migration aidschunks_owned()andpages_owned()preserve theVec<u8>ergonomics for the (rare) case where callers genuinely need owned buffers.for_each_mutflattened (BREAKING):ChunkIteratorMut::for_each_mut(F) -> Result<()>whereF: FnMut(u64, &mut [u8]) -> Result<()>. The old triple-nestedResult<Result<(), E>>is gone, and the write guard is acquired ONCE for the entire iteration instead of per-chunk. Callers that returnedOk::<(), std::io::Error>(())should returnOk(())and map foreign errors intoMmapIoError::Io(...)before returning.touch_pages/touch_pages_rangerewritten (H2): previously calledread_into(offset, &mut [0u8; 1])per page, which acquired the lock, validated bounds, and memcpy'd one byte 262,144 times for a 1 GiB file. The new implementation acquires the lock ONCE, walks the mapping withstd::ptr::read_volatile::<u8>wrapped instd::hint::black_box, and steps bypage_size(). Expected speedup on multi-GiB files: 50-100x.sequential_read(1 / 16 / 256 MiB,as_slicevsread_into),random_read(xorshift64 PRNG, no new dep),sequential_writeunder three flush policies includingEveryMillis(10)now that C2 made it actually work,iterator_throughputcomparing zero-copychunks()tochunks_owned()to show the H1 win directly,touch_pages_largeon 1 GiB, andatomic_contentionacross 1 / 2 / 4 / 8 threads.bench-regression.yml: runs the full bench suite on every push and PR, uploadstarget/criterion/as an artifact for diff against the checked-in baseline. The hard >10% regression gate is rescoped to 0.9.10 alongside the rest of the pre-1.0 stabilization pass.Breaking changes
Three API breaks land in this release. All have mechanical migration paths.
as_slicereturn type. Old:Result<&[u8]>(errors withInvalidModeon RW). New:Result<MappedSlice<'_>>for all three modes.MappedSlicederefs to[u8], so most call sites compile unchanged because indexing, iteration, and.len()go throughDeref. Sites that bound the result aslet s: &[u8] = mmap.as_slice(...)?need to change tolet s = mmap.as_slice(...)?;(let the type be inferred) orlet s: &[u8] = &*mmap.as_slice(...)?(deref explicitly). Sites that caughtInvalidModeon the RW path should remove that branch.Iterator items. Old:
Iterator<Item = Result<Vec<u8>>>. New:Iterator<Item = MappedSlice<'a>>. Patterns likeif let Some(Ok(chunk)) = iter.next()becomeif let Some(chunk) = iter.next();for chunk in mmap.chunks(N) { let chunk = chunk?; ... }becomesfor chunk in mmap.chunks(N) { ... }. Sites that genuinely need owned buffers (e.g., handing data to a thread that outlives the mapping borrow) should switch tochunks_owned()/pages_owned(), which preserve the oldResult<Vec<u8>>shape.ChunkIteratorMut::for_each_mutsignature. Old:fn for_each_mut<F, E>(F) -> Result<Result<(), E>>. New:fn for_each_mut<F>(F) -> Result<()>whereF: FnMut(u64, &mut [u8]) -> Result<()>(the crate'sResult, notstd::result::Result<(), E>). Migration: returnOk(())instead ofOk::<(), io::Error>(())and map foreign errors with.map_err(|e| MmapIoError::Io(...))before returning. The double??unwrap pattern at the call site collapses to a single?.Performance
iterator_throughputbench compareschunks()tochunks_owned()directly so the delta is on the bench output sheet.touch_pagestight loop (H2): 1 GiB file with 4 KiB pages = 262,144 page touches. Before: per-pageread_into(offset, &mut buf[..1])= lock acquisition + bounds check + memcpy of 1 byte, 262,144 times. After: one lock acquisition +ptr::read_volatilestep loop. Expected ~50-100x speedup on thetouch_pages_largebench.for_each_mutsingle guard (E4 follow-on): total time-window the write lock is held is unchanged (the iteration was always exclusive), but the per-chunk lock-acquire / lock-release overhead is gone. Tight RMW loops over many small chunks (e.g., zeroing a 1 GiB file at 4 KiB) save the per-iteration uncontended parking_lot overhead.MappedSliceoverhead: zero on RO and COW (theOwned(&[u8])variant is a thin wrapper). On RW the wrapper holds anRwLockReadGuardwhose destructor releases the lock. No allocation, no virtual dispatch.Dereflowers to a direct pointer access at the use site.Tests
as_slice_rw_invalid_mode) was rewritten asas_slice_rw_returns_mapped_sliceto verify the new RW path. Plus new iterator tests for the zero-copy shape,chunks_ownedmigration aid, andfor_each_mutsingle-guard behavior.--all-features(up from 99 in 0.9.6), 4 ignored (3 polling-watch tests gated on Windows mtime granularity, 1 hugepages fallback), 0 failed.--no-default-features(60 + doctests) and--no-default-features --features "cow locking advise"(75 + doctests). Banned-words scan zero hits.cargo +1.75 build --all-featuresclean. The iterator'sSend + Syncimpls and theMappedSliceDeref / PartialEq stack work on 1.75 without GAT use or other recent features.Notes
proptest(from 0.9.6) remains the only[dev-dependencies]addition for this milestone. The random-offset benches use a hand-rolled xorshift64 PRNG rather than pulling inrand.MappedSliceandMappedSliceMutare now part of the stable-through-0.9.x public surface. Both are re-exported from the crate root.as_slice_mutwas already returningMappedSliceMut<'_>before this release. This release addsDeref<Target = [u8]>andDerefMutimpls to it so it can be used as a&mut [u8]directly. Pluslen()andis_empty()accessors.Deferred (with documented reason)
RwLock<MmapMut>design is sound and bounded; replacing it witharc-swapor anUnsafeCelldesign is a memory-model question (do we accept torn reads from concurrent writers on the same mapping?), not a tuning question. Rescoped from "tactical 0.9.x fix" to a 1.0 design conversation.bench-regression.ymlruns the suite and uploads artifacts on every push. The >10% regression threshold check is part of the 0.9.10 pre-1.0 stabilization milestone alongsidecargo-semver-checksandcargo-fuzz.docs/PERFORMANCE.md: now that the workload-pattern benches exist, the next step is running them on the maintainer's reference machine and publishing measured P50 / P99 numbers per workload. That doc is part of the same 0.9.10 sweep.read_into_async/read_slice_asyncadditions, Tokio version decision.Full Changelog: v0.9.6...v0.9.7
This discussion was created from the release v0.9.7 — Performance Unleashed.
Beta Was this translation helpful? Give feedback.
All reactions