Skip to content
/ diskorder Public

read files in physical order to reduce hard drive seeks

Notifications You must be signed in to change notification settings

nh2/diskorder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

diskorder - read files in physical order to reduce hard drive seeks

Reading files in physical on-disk order can significantly reduce hard drive seeks for spinning disks, and thus improve performance.

Usage

Given some files (as arguments or stdin), diskorder prints them sorted by physical order.

Assuming your current directory contains hard.txt and drives.txt:

$ ./diskorder.py hard.txt drives.txt
drives.txt
hard.txt

You can also pipe them in:

$ ls -1 | ./diskorder.py
diskorder.py
drives.txt
hard.txt

Use this with tar, rsync etc to get a significant reduction of seeks and thus speedup for small files or on disk-busy systems.

On an otherwise idle system, this pays off when your files are smaller than yourHDspeedPerSecond / yourHdSpeedLatency, so e.g. smaller than 160 MB/s * 8 ms = 1.2 MB for current generation hard drives.

Of course if your disk is not idle, it reduced seeks pay off even when your files are much smaller.

How it works

  1. Get list of all files involved
  2. Sort by physical address using the FIEMAP (file extent map) Linux ioctl
  3. Access files in that order

Performance can be optionally be improved further by combining this approach with readahead().

Example benchmark

On an Ubuntu 16.04 system:

$ du -sh /usr/lib
345M    /usr/lib

$ find /usr/lib | wc -l
8198

$ find /usr/lib -type f                  > files-in-find-order.txt
$ find /usr/lib -type f | ./diskorder.py > files-in-disk-order.txt

$ echo 3 > /proc/sys/vm/drop_caches # drop filesystem caches

$ time tar -c -T files-in-find-order.txt | cat > /dev/null
real    0m32.608s
user    0m0.088s
sys     0m1.048s

$ echo 3 > /proc/sys/vm/drop_caches # drop filesystem caches

$ time tar -c -T files-in-disk-order.txt | cat > /dev/null
ad7ac10cc07f3875dd493df33da3a4d5  -
real    0m15.621s
user    0m0.096s
sys     0m0.908s

Note we had to pipe tar | cat, otherwise tar is too smart and is much faster (cheating).

Summary:

tar default        32 seconds
tar in disk order  16 seconds

~2x speedup in this case.

Caveats

  • Only works on Linux.
  • Only tested with EXT4 so far.
  • Probably doens't work for file systems that don't have extent or don't support FIEMAP.

TODO

  • Add warning when not used on spinning disks using the BLKROTATIONAL ioctl.
  • Use O(n) integer sort for sorting physical addresses.
  • Query only the first extent (fm_extent_count = 1) for a small performance improvement of diskorder.py.
  • Write it in a faster language than Python. Python isn't a bottleneck as soon as the disk is involved (e.g. time ./diskorder.py < files-in-find-order.txt > /dev/null from the benchmark above runs in real 0.314s, user 0.276s, sys 0.036s which is negligible), but it's still slower than it could be.

Acknowledgements

Similar work

  • platter-walk - Rust library for HDD-aware directory traversal
  • coretuils rm -r and other coreutils programs sort files by inode order before operating on them
    • However, I found that rm -r is still quadratic
  • The authors of the HTree method that's used in ext4 write:

    [...] it could cause some performance regressions for workloads that used readdir() to perform some operation of all of the files in a large directory [...] This performance regression can be easily fixed by modifying applications to sort the directory entries returned by readdir() by inode number

About

read files in physical order to reduce hard drive seeks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages