Skip to content

Enumerates very, very large directories quickly by directly using kernel syscalls. For POSIX and Windows. WARNING THIS IS OBSOLETE. USE BOOST.AFIO INSTEAD.

Notifications You must be signed in to change notification settings

ned14/FastDirectoryEnumerator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FastDirectoryEnumerator
-=-=-=-=-=-=-=-=-=-=-=-
(C) 2013 Niall Douglas

WARNING: Use the code at https://github.com/BoostGSoC13/boost.afio/blob/master/include/boost/afio/detail/impl/afio_iocp.ipp#L614
in preference to this code. This code was used to prototype the AFIO implementation. This code has known bugs, the
AFIO code is thought to be bug free and is regularly tested by a CI.




Herein lies C++11 code which enumerates and stat()'s very, very large directories quickly
by directly using kernel syscalls. For POSIX and Windows.

Differences from struct stat:

* st_type contains the S_IF* flags part of the st_mode field only. st_mode is the same as stat.
* st_allocated is st_blksize multiplied by st_blocks.

On POSIX directly uses the getdents() syscall (http://man7.org/linux/man-pages/man2/getdents.2.html).
This syscall returns the leafname, st_ino and st_type fields only.

On Windows directly uses the NtQueryDirectoryFile() syscall (http://msdn.microsoft.com/en-us/library/windows/hardware/ff567047(v=vs.85).aspx).
This syscall returns the leafname and the following stat fields:

* st_ino (yes NTFS does provide a unique POSIX compliant inode number at the kernel level, it's 64 bits though)
* st_type
* st_atimespec
* st_mtimespec
* st_ctimespec (yes, NTFS also provides this at the kernel level)
* st_size
* st_allocated (this is st_blksize * st_blocks)
* st_birthtimespec

Therefore fields missing from enumeration are:

* st_nlink
* st_blocks
* st_blksize

Fetching these invoke a slow code path: Windows gives you no choice but to open a handle
to the file being queried, something which is very slow on NT. Expect a max of 30k slow
path queries per second as Windows can only open 30k HANDLEs per second.

There is something else you should be aware of: class directory_entry optimises same-directory
metadata lookups by caching per-thread a handle to the containing directory. Given that
metadata lookups tend to be from the same directory, and assuming that none of the slow
code path fields above have been requested, directory_entry can fetch metadata by doing
a globbed enumeration instead, yielding about 250k lookups per second max. As soon as you
change contained directory, or ask for any slow code path fields, expect 30k lookups per
second max.

Some quick benchmarks:
----------------------

Intel Core i7-3770K CPU @ 3.50Ghz:

Windows 8 x64, NTFS: approx 1.8m entries enumerated per second max (i.e. from RAM cache)
with common stat fields listed above. Makes little difference if SSD or conventional magnetic.
Note that setting namesonly=true is actually a bit slower, only 1.5m per second max. I assume
this is due to an less optimised code path.

Linux 3.2 x64, ext4: approx 2.5m entries enumerated per second max (i.e. from RAM cache)
with st_ino and st_type fields only on SSD. lstat() can do 1.1m entries per sec max to
fill in the remaining fields also on SSD.

About

Enumerates very, very large directories quickly by directly using kernel syscalls. For POSIX and Windows. WARNING THIS IS OBSOLETE. USE BOOST.AFIO INSTEAD.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages