A benchmark tool for answering one question:
When does
io_uringactually make file I/O faster — and when is it pure overhead?
The blog post that spawned this code is The Layer Below: io_uring on nazquadri.dev. Read it first if you want the full story — this repo is the code that produced the numbers in that post.
io-uring-bench implements four I/O strategies and benchmarks them against each other across two very different workload profiles:
Strategies:
sync— the classicpread()loop, single thread.threaded—pread()distributed across N worker threads (default: number of CPUs).tokio— tokio's async file I/O (which internally uses aspawn_blockingthread pool).iouring—io_uringwith registered buffers, registered file descriptors,IOPOLL,COOP_TASKRUN, andSINGLE_ISSUER— the full optimised path.
Profiles:
buffered— sequential reads through the page cache. Reading cached data is just amemcpyfrom kernel pages to userspace, so the workload is dominated by memory bandwidth, not I/O. io_uring does not help here. Threaded reads win because they run memcpy on multiple cores in parallel.direct—O_DIRECTrandom 4K reads across a 10GB working set, bypassing the page cache entirely. This is where io_uring actually wins, by keeping dozens or hundreds of I/O operations in flight on the NVMe hardware queues simultaneously.
The whole point of building this was to see both sides of the story. If you only run a microbenchmark on warm page cache you'll conclude io_uring is worse than pread(). That's what I concluded the first time, and I wasted a weekend reading papers about why before I realised I was benchmarking memory bandwidth, not disk I/O. The lesson is in the post.
100,000 random 4K reads across 10GB on ext4 (real NVMe, not tmpfs), IOPOLL active with nvme poll_queues=16, single thread except where noted:
| Strategy | IOPS | Throughput | vs sync |
|---|---|---|---|
| sync pread (1 thread) | 28,418 | 111 MiB/s | 1.0× |
| threaded pread (1 thread) | 28,720 | 112 MiB/s | 1.0× |
| io_uring qd=1 | 31,796 | 124 MiB/s | 1.1× |
| io_uring qd=4 | 124,320 | 486 MiB/s | 4.4× |
| io_uring qd=16 | 426,580 | 1,666 MiB/s | 15.0× |
| io_uring qd=32 | 752,937 | 2,941 MiB/s | 26.5× |
| io_uring qd=64 | 950,330 | 3,712 MiB/s | 33.4× |
| io_uring qd=128 | 974,495 | 3,807 MiB/s | 34.3× |
| threaded pread (24 cores) | 564,188 | 2,204 MiB/s | 19.9× |
A few things to notice:
io_uring at QD=128 hits ~974K IOPS from a single thread. That's ~34× faster than sync pread, and it beats 24-thread pread by 73% — so io_uring on one core beats pread on twenty-four.
At QD=1, io_uring is barely faster than sync (only ~12% edge, all of it from avoiding interrupts via IOPOLL). The real scaling comes from queue depth: QD=4 is already ~4× faster, QD=32 gives you ~26×, and you hit the NVMe ceiling somewhere between QD=64 and QD=128 where the gains flatten out.
The pread() loop is bottlenecked by one thing only: at any given moment there's exactly one read in flight on the device. The NVMe hardware sits idle between every syscall. io_uring fixes this by keeping the device queue full. That's the entire story.
On warm page cache, the same benchmark produces opposite conclusions:
| Strategy | Workload | Winner |
|---|---|---|
| Buffered, 10 × 100MB files | Threaded pread | ~46 GB/s (multi-core memcpy) |
| Buffered, same workload | sync pread | ~8 GB/s |
| Buffered, same workload | io_uring | ~7 GB/s (slower than sync) |
When the "I/O" is a memcpy from kernel page cache, io_uring's ring management costs more than the syscall it replaces. The device never gets involved, so there's nothing to overlap. Multi-threaded reads win because they saturate multiple memory channels. This is the part of the blog post where I put the files on tmpfs by accident, got confusing numbers for a weekend, and then realised what I'd done.
Moral: don't benchmark io_uring against warm page cache and conclude it's slow. You're measuring memory bandwidth.
IOPOLL gives you ~10-15% on top of the regular io_uring path by skipping the kernel's interrupt handler and polling the NVMe completion queue directly. It requires NVMe poll queues to be enabled, and on most Linux distributions they are not by default.
Check your current config:
cat /sys/module/nvme/parameters/poll_queuesIf that prints 0, add this to /etc/modprobe.d/nvme.conf:
options nvme poll_queues=16
Then regenerate initramfs (the nvme module loads from initramfs because root is typically on NVMe) and reboot:
sudo update-initramfs -u # Debian/Ubuntu
# or
sudo dracut -f # Fedora/RHEL
sudo rebootAfter reboot, verify:
cat /sys/module/nvme/parameters/poll_queues
# 16Without this, IOPOLL silently falls back to interrupt mode and your IOPS numbers drop by an order of magnitude. The benchmark still works, the results are just less interesting.
O_DIRECT on tmpfs does not do what you think. It measures RAM speed, not storage speed. Do not run the direct profile against /tmp if /tmp is tmpfs on your system — which it is on most modern desktop Linuxes. Point DIRECT_DIR at a path on a real block-device filesystem (ext4/xfs/btrfs on your NVMe).
cargo build --releaseRust 2024 edition. The io-uring crate needs Linux (kernel 5.1+ for basics, 5.19+ for best results, 6.0+ for zero-copy features). Other modes work on any Linux.
Single benchmark:
# Direct profile (the headline workload)
./target/release/io-uring-bench /path/to/nvme/test/files \
--mode iouring --profile direct \
--num-reads 100000 --queue-depth 128
# Sync baseline
./target/release/io-uring-bench /path/to/nvme/test/files \
--mode sync --profile direct --num-reads 100000
# Buffered profile (the "why this is actually slower" workload)
./target/release/io-uring-bench /path/to/test/files \
--mode iouring --profile bufferedFull suite (creates test data, runs all four sub-benchmarks, cleans up):
./bench.shOverride paths if your /tmp is tmpfs (it probably is):
DIRECT_DIR=~/nvme-scratch/io-bench-direct ./bench.shThe bench script will print a warning if it detects DIRECT_DIR is on tmpfs — you want ext4 or xfs on a real NVMe for the direct profile.
All logic lives in src/main.rs as a single file. It's intentionally flat: the point is to make each I/O strategy easy to read and compare side-by-side, not to demonstrate modular Rust. The four bench functions for each profile are stacked one after another so you can diff them in your head.
The io_uring path exercises most of the optimisation surface:
IORING_SETUP_IOPOLL— polled completions, no interrupt handlingIORING_SETUP_SINGLE_ISSUER— single-thread optimisationIORING_SETUP_COOP_TASKRUN— cooperative task runregister_files— the set of open file descriptors is registered onceregister_buffers— aligned read buffers are registered once; subsequent reads useReadFixed
If IOPOLL ring creation fails (no poll queues), it falls back to a regular ring.
MIT — see LICENSE.
- The Layer Below: io_uring — the blog post this code was built for. Includes the full story, the security caveats, and why RocksDB uses it.
- ls-alpha — the other experiment from the same series, answering the same question for
stat()-heavy workloads instead of file reads.