io-uring-bench

A benchmark tool for answering one question:

When does io_uring actually make file I/O faster — and when is it pure overhead?

The blog post that spawned this code is The Layer Below: io_uring on nazquadri.dev. Read it first if you want the full story — this repo is the code that produced the numbers in that post.

What it does

io-uring-bench implements four I/O strategies and benchmarks them against each other across two very different workload profiles:

Strategies:

sync — the classic pread() loop, single thread.
threaded — pread() distributed across N worker threads (default: number of CPUs).
tokio — tokio's async file I/O (which internally uses a spawn_blocking thread pool).
iouring — io_uring with registered buffers, registered file descriptors, IOPOLL, COOP_TASKRUN, and SINGLE_ISSUER — the full optimised path.

Profiles:

buffered — sequential reads through the page cache. Reading cached data is just a memcpy from kernel pages to userspace, so the workload is dominated by memory bandwidth, not I/O. io_uring does not help here. Threaded reads win because they run memcpy on multiple cores in parallel.
direct — O_DIRECT random 4K reads across a 10GB working set, bypassing the page cache entirely. This is where io_uring actually wins, by keeping dozens or hundreds of I/O operations in flight on the NVMe hardware queues simultaneously.

The whole point of building this was to see both sides of the story. If you only run a microbenchmark on warm page cache you'll conclude io_uring is worse than pread(). That's what I concluded the first time, and I wasted a weekend reading papers about why before I realised I was benchmarking memory bandwidth, not disk I/O. The lesson is in the post.

The results

100,000 random 4K reads across 10GB on ext4 (real NVMe, not tmpfs), IOPOLL active with nvme poll_queues=16, single thread except where noted:

Strategy	IOPS	Throughput	vs sync
sync pread (1 thread)	28,418	111 MiB/s	1.0×
threaded pread (1 thread)	28,720	112 MiB/s	1.0×
io_uring qd=1	31,796	124 MiB/s	1.1×
io_uring qd=4	124,320	486 MiB/s	4.4×
io_uring qd=16	426,580	1,666 MiB/s	15.0×
io_uring qd=32	752,937	2,941 MiB/s	26.5×
io_uring qd=64	950,330	3,712 MiB/s	33.4×
io_uring qd=128	974,495	3,807 MiB/s	34.3×
threaded pread (24 cores)	564,188	2,204 MiB/s	19.9×

A few things to notice:

io_uring at QD=128 hits ~974K IOPS from a single thread. That's ~34× faster than sync pread, and it beats 24-thread pread by 73% — so io_uring on one core beats pread on twenty-four.

At QD=1, io_uring is barely faster than sync (only ~12% edge, all of it from avoiding interrupts via IOPOLL). The real scaling comes from queue depth: QD=4 is already ~4× faster, QD=32 gives you ~26×, and you hit the NVMe ceiling somewhere between QD=64 and QD=128 where the gains flatten out.

The pread() loop is bottlenecked by one thing only: at any given moment there's exactly one read in flight on the device. The NVMe hardware sits idle between every syscall. io_uring fixes this by keeping the device queue full. That's the entire story.

The counter-intuitive result (buffered profile)

On warm page cache, the same benchmark produces opposite conclusions:

Strategy	Workload	Winner
Buffered, 10 × 100MB files	Threaded pread	~46 GB/s (multi-core memcpy)
Buffered, same workload	sync pread	~8 GB/s
Buffered, same workload	io_uring	~7 GB/s (slower than sync)

When the "I/O" is a memcpy from kernel page cache, io_uring's ring management costs more than the syscall it replaces. The device never gets involved, so there's nothing to overlap. Multi-threaded reads win because they saturate multiple memory channels. This is the part of the blog post where I put the files on tmpfs by accident, got confusing numbers for a weekend, and then realised what I'd done.

Moral: don't benchmark io_uring against warm page cache and conclude it's slow. You're measuring memory bandwidth.

Prerequisites

For full IOPOLL performance: NVMe poll queues

IOPOLL gives you ~10-15% on top of the regular io_uring path by skipping the kernel's interrupt handler and polling the NVMe completion queue directly. It requires NVMe poll queues to be enabled, and on most Linux distributions they are not by default.

Check your current config:

cat /sys/module/nvme/parameters/poll_queues

If that prints 0, add this to /etc/modprobe.d/nvme.conf:

options nvme poll_queues=16

Then regenerate initramfs (the nvme module loads from initramfs because root is typically on NVMe) and reboot:

sudo update-initramfs -u  # Debian/Ubuntu
# or
sudo dracut -f            # Fedora/RHEL
sudo reboot

After reboot, verify:

cat /sys/module/nvme/parameters/poll_queues
# 16

Without this, IOPOLL silently falls back to interrupt mode and your IOPS numbers drop by an order of magnitude. The benchmark still works, the results are just less interesting.

For the O_DIRECT profile: a real filesystem

O_DIRECT on tmpfs does not do what you think. It measures RAM speed, not storage speed. Do not run the direct profile against /tmp if /tmp is tmpfs on your system — which it is on most modern desktop Linuxes. Point DIRECT_DIR at a path on a real block-device filesystem (ext4/xfs/btrfs on your NVMe).

Building

cargo build --release

Rust 2024 edition. The io-uring crate needs Linux (kernel 5.1+ for basics, 5.19+ for best results, 6.0+ for zero-copy features). Other modes work on any Linux.

Running

Single benchmark:

# Direct profile (the headline workload)
./target/release/io-uring-bench /path/to/nvme/test/files \
    --mode iouring --profile direct \
    --num-reads 100000 --queue-depth 128

# Sync baseline
./target/release/io-uring-bench /path/to/nvme/test/files \
    --mode sync --profile direct --num-reads 100000

# Buffered profile (the "why this is actually slower" workload)
./target/release/io-uring-bench /path/to/test/files \
    --mode iouring --profile buffered

Full suite (creates test data, runs all four sub-benchmarks, cleans up):

./bench.sh

Override paths if your /tmp is tmpfs (it probably is):

DIRECT_DIR=~/nvme-scratch/io-bench-direct ./bench.sh

The bench script will print a warning if it detects DIRECT_DIR is on tmpfs — you want ext4 or xfs on a real NVMe for the direct profile.

Architecture

All logic lives in src/main.rs as a single file. It's intentionally flat: the point is to make each I/O strategy easy to read and compare side-by-side, not to demonstrate modular Rust. The four bench functions for each profile are stacked one after another so you can diff them in your head.

The io_uring path exercises most of the optimisation surface:

IORING_SETUP_IOPOLL — polled completions, no interrupt handling
IORING_SETUP_SINGLE_ISSUER — single-thread optimisation
IORING_SETUP_COOP_TASKRUN — cooperative task run
register_files — the set of open file descriptors is registered once
register_buffers — aligned read buffers are registered once; subsequent reads use ReadFixed

If IOPOLL ring creation fails (no poll queues), it falls back to a regular ring.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
bench.sh		bench.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

io-uring-bench

What it does

The results

The counter-intuitive result (buffered profile)

Prerequisites

For full IOPOLL performance: NVMe poll queues

For the O_DIRECT profile: a real filesystem

Building

Running

Architecture

License

Related

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

io-uring-bench

What it does

The results

The counter-intuitive result (buffered profile)

Prerequisites

For full IOPOLL performance: NVMe poll queues

For the O_DIRECT profile: a real filesystem

Building

Running

Architecture

License

Related

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages