dscan is a concurrent directory scanner for Python 3.12+. It wraps os.scandir in a thread pool with a work-stealing queue, exposing a filtering API that covers most of what you'd otherwise implement by hand on top of os.walk.
Two modes: scan_entries yields raw os.DirEntry objects with minimal overhead; scan yields dataclass models with pre-computed metadata.
On a local SSD, directory traversal is fast enough that threading adds more overhead than it saves. scan_entries still matches or edges out os.walk, but the real case for concurrency is network-attached storage.
On SMB shares, NFS mounts, or any high-latency filesystem, each scandir call blocks waiting for a server response. os.walk does this serially — one directory at a time. dscan keeps multiple directories in-flight simultaneously, so workers aren't sitting idle while the network responds. On deep trees with many subdirectories, this compounds significantly.
On Windows, the underlying FindNextFile API returns full file metadata — including size and timestamps — in the same call as the directory listing. This means DirEntry.stat() is effectively free; no additional syscalls are needed to populate a FileEntry model.
This makes scan() model mode on Windows significantly more efficient than on Linux or macOS, where stat requires a separate syscall per entry. The structured output you get from scan() comes at almost no extra cost over scan_entries.
Combined with the concurrency win on high-latency mounts, Windows users scanning SMB network shares or mapped corporate drives get the best of both worlds: concurrent traversal and rich metadata at near-zero overhead. This is the scenario where dscan provides the clearest, most measurable improvement over os.walk.
Recommended for:
- Corporate environments with large SMB file servers
- NAS devices accessed over Windows network shares
- Any mapped drive with deep directory trees
Tuning for high-latency mounts:
# Increase workers to match network latency
for entry in scan("//fileserver/share", max_workers=32):
print(entry.path)| entries | time | |
|---|---|---|
os.walk (no stat) |
4,046,505 | 33.30s |
os.walk (+ stat) |
4,039,313 | 85.24s |
dscan.scan_entries |
4,046,502 | 31.90s |
dscan.scan (models) |
4,014,758 | 140.15s |
scan_entries is on par with bare os.walk. scan is slower because stat calls happen on the main thread serially — the workers parallelise scandir, not stat. Use scan when you want the structured output; use scan_entries when throughput matters.
Note: This benchmark was run on macOS where
statrequires a separate syscall per entry. On Windows,scan()performance is substantially better due toFindNextFilebundling metadata. See the Windows + SMB section above.
# rough simulation
import time, os
_real = os.scandir
os.scandir = lambda p: (time.sleep(0.005), _real(p))[1]| time | |
|---|---|
os.walk |
~linear with directory count |
dscan.scan_entries |
scales with max_workers |
At 5ms latency per directory, a tree with 10,000 directories takes ~50s serially. With 16 workers dscan brings that to ~4s. The deeper and wider the tree, the bigger the difference.
pip install dscanpyRequires Python 3.12+. No other dependencies.
from dscan import scan
for entry in scan("."):
print(f"{entry.name} - {entry.path}")from dscan import scan_entries
for entry in scan_entries("~/Documents", max_depth=2):
if entry.is_file():
print(entry.name)# Only Python and Markdown files
for file in scan(".", extensions={".py", ".md"}):
print(file.path)
# Skip compiled files
for file in scan(".", ignore_extensions={".bin", ".exe"}):
print(file.path)# Only test files
for entry in scan(".", match="test_*"):
print(entry.name)
# Skip hidden files and directories
for entry in scan(".", ignore_pattern=".*"):
print(entry.name)# Immediate children only
for entry in scan(".", max_depth=0):
print(entry.name)
# Only descend into src/ and lib/
for entry in scan(".", only_dirs=["src", "lib"]):
print(entry.path)
# Skip specific directories
# .git, .idea, .venv, __pycache__ are skipped by default
for entry in scan(".", ignore_dirs=["node_modules", "dist"]):
print(entry.path)
# Disable all default ignores
for entry in scan(".", ignore_dirs=[]):
print(entry.path)def is_large_file(entry):
return entry.is_file() and entry.stat().st_size > 1_000_000
for entry in scan(".", custom_filter=is_large_file):
print(entry.name)# default is min(32, cpu_count * 2)
# increase on high-latency mounts
for entry in scan_entries("/mnt/nas", max_workers=32):
print(entry.path)scan() returns FileEntry or DirectoryEntry dataclasses.
| field | description |
|---|---|
name |
filename without extension |
extension |
lowercase extension, no leading dot |
path |
full path |
dir_path |
containing directory |
size |
bytes |
created_at |
datetime |
modified_at |
datetime |
| field | description |
|---|---|
name |
directory name |
path |
full path |
parent_path |
parent directory |
created_at |
datetime |
modified_at |
datetime |
os.walk |
pathlib.rglob |
dscan |
|
|---|---|---|---|
| Concurrent traversal | No | No | Yes |
| Built-in models | No | No | Yes |
| Depth limit | Manual | No | Yes |
| Directory exclusions | Manual | No | Yes |
- Move stat into workers — on Linux/macOS over NFS or high-latency mounts,
statis a separate network round-trip per entry, just likescandir. Running stat inside the worker threads would let latency overlap across concurrent workers, significantly improvingscan()model performance on those platforms. getattrlistbulksupport (macOS) — macOS exposes a syscall that returns full file attributes (including size and timestamps) for all entries in a single directory call, equivalent to what Windows gets fromFindNextFile. Implementing this would bringscan()performance on local macOS disk in line with Windows, and close the current gap betweenscan()andscan_entries()shown in the benchmarks above.
MIT