Skip to content

nazq/io-uring-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

io-uring-bench

A benchmark tool for answering one question:

When does io_uring actually make file I/O faster — and when is it pure overhead?

The blog post that spawned this code is The Layer Below: io_uring on nazquadri.dev. Read it first if you want the full story — this repo is the code that produced the numbers in that post.

What it does

io-uring-bench implements four I/O strategies and benchmarks them against each other across two very different workload profiles:

Strategies:

  • sync — the classic pread() loop, single thread.
  • threadedpread() distributed across N worker threads (default: number of CPUs).
  • tokio — tokio's async file I/O (which internally uses a spawn_blocking thread pool).
  • iouringio_uring with registered buffers, registered file descriptors, IOPOLL, COOP_TASKRUN, and SINGLE_ISSUER — the full optimised path.

Profiles:

  • buffered — sequential reads through the page cache. Reading cached data is just a memcpy from kernel pages to userspace, so the workload is dominated by memory bandwidth, not I/O. io_uring does not help here. Threaded reads win because they run memcpy on multiple cores in parallel.
  • directO_DIRECT random 4K reads across a 10GB working set, bypassing the page cache entirely. This is where io_uring actually wins, by keeping dozens or hundreds of I/O operations in flight on the NVMe hardware queues simultaneously.

The whole point of building this was to see both sides of the story. If you only run a microbenchmark on warm page cache you'll conclude io_uring is worse than pread(). That's what I concluded the first time, and I wasted a weekend reading papers about why before I realised I was benchmarking memory bandwidth, not disk I/O. The lesson is in the post.

The results

100,000 random 4K reads across 10GB on ext4 (real NVMe, not tmpfs), IOPOLL active with nvme poll_queues=16, single thread except where noted:

Strategy IOPS Throughput vs sync
sync pread (1 thread) 28,418 111 MiB/s 1.0×
threaded pread (1 thread) 28,720 112 MiB/s 1.0×
io_uring qd=1 31,796 124 MiB/s 1.1×
io_uring qd=4 124,320 486 MiB/s 4.4×
io_uring qd=16 426,580 1,666 MiB/s 15.0×
io_uring qd=32 752,937 2,941 MiB/s 26.5×
io_uring qd=64 950,330 3,712 MiB/s 33.4×
io_uring qd=128 974,495 3,807 MiB/s 34.3×
threaded pread (24 cores) 564,188 2,204 MiB/s 19.9×

A few things to notice:

io_uring at QD=128 hits ~974K IOPS from a single thread. That's ~34× faster than sync pread, and it beats 24-thread pread by 73% — so io_uring on one core beats pread on twenty-four.

At QD=1, io_uring is barely faster than sync (only ~12% edge, all of it from avoiding interrupts via IOPOLL). The real scaling comes from queue depth: QD=4 is already ~4× faster, QD=32 gives you ~26×, and you hit the NVMe ceiling somewhere between QD=64 and QD=128 where the gains flatten out.

The pread() loop is bottlenecked by one thing only: at any given moment there's exactly one read in flight on the device. The NVMe hardware sits idle between every syscall. io_uring fixes this by keeping the device queue full. That's the entire story.

The counter-intuitive result (buffered profile)

On warm page cache, the same benchmark produces opposite conclusions:

Strategy Workload Winner
Buffered, 10 × 100MB files Threaded pread ~46 GB/s (multi-core memcpy)
Buffered, same workload sync pread ~8 GB/s
Buffered, same workload io_uring ~7 GB/s (slower than sync)

When the "I/O" is a memcpy from kernel page cache, io_uring's ring management costs more than the syscall it replaces. The device never gets involved, so there's nothing to overlap. Multi-threaded reads win because they saturate multiple memory channels. This is the part of the blog post where I put the files on tmpfs by accident, got confusing numbers for a weekend, and then realised what I'd done.

Moral: don't benchmark io_uring against warm page cache and conclude it's slow. You're measuring memory bandwidth.

Prerequisites

For full IOPOLL performance: NVMe poll queues

IOPOLL gives you ~10-15% on top of the regular io_uring path by skipping the kernel's interrupt handler and polling the NVMe completion queue directly. It requires NVMe poll queues to be enabled, and on most Linux distributions they are not by default.

Check your current config:

cat /sys/module/nvme/parameters/poll_queues

If that prints 0, add this to /etc/modprobe.d/nvme.conf:

options nvme poll_queues=16

Then regenerate initramfs (the nvme module loads from initramfs because root is typically on NVMe) and reboot:

sudo update-initramfs -u  # Debian/Ubuntu
# or
sudo dracut -f            # Fedora/RHEL
sudo reboot

After reboot, verify:

cat /sys/module/nvme/parameters/poll_queues
# 16

Without this, IOPOLL silently falls back to interrupt mode and your IOPS numbers drop by an order of magnitude. The benchmark still works, the results are just less interesting.

For the O_DIRECT profile: a real filesystem

O_DIRECT on tmpfs does not do what you think. It measures RAM speed, not storage speed. Do not run the direct profile against /tmp if /tmp is tmpfs on your system — which it is on most modern desktop Linuxes. Point DIRECT_DIR at a path on a real block-device filesystem (ext4/xfs/btrfs on your NVMe).

Building

cargo build --release

Rust 2024 edition. The io-uring crate needs Linux (kernel 5.1+ for basics, 5.19+ for best results, 6.0+ for zero-copy features). Other modes work on any Linux.

Running

Single benchmark:

# Direct profile (the headline workload)
./target/release/io-uring-bench /path/to/nvme/test/files \
    --mode iouring --profile direct \
    --num-reads 100000 --queue-depth 128

# Sync baseline
./target/release/io-uring-bench /path/to/nvme/test/files \
    --mode sync --profile direct --num-reads 100000

# Buffered profile (the "why this is actually slower" workload)
./target/release/io-uring-bench /path/to/test/files \
    --mode iouring --profile buffered

Full suite (creates test data, runs all four sub-benchmarks, cleans up):

./bench.sh

Override paths if your /tmp is tmpfs (it probably is):

DIRECT_DIR=~/nvme-scratch/io-bench-direct ./bench.sh

The bench script will print a warning if it detects DIRECT_DIR is on tmpfs — you want ext4 or xfs on a real NVMe for the direct profile.

Architecture

All logic lives in src/main.rs as a single file. It's intentionally flat: the point is to make each I/O strategy easy to read and compare side-by-side, not to demonstrate modular Rust. The four bench functions for each profile are stacked one after another so you can diff them in your head.

The io_uring path exercises most of the optimisation surface:

  • IORING_SETUP_IOPOLL — polled completions, no interrupt handling
  • IORING_SETUP_SINGLE_ISSUER — single-thread optimisation
  • IORING_SETUP_COOP_TASKRUN — cooperative task run
  • register_files — the set of open file descriptors is registered once
  • register_buffers — aligned read buffers are registered once; subsequent reads use ReadFixed

If IOPOLL ring creation fails (no poll queues), it falls back to a regular ring.

License

MIT — see LICENSE.

Related

  • The Layer Below: io_uring — the blog post this code was built for. Includes the full story, the security caveats, and why RocksDB uses it.
  • ls-alpha — the other experiment from the same series, answering the same question for stat()-heavy workloads instead of file reads.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors