perf(layout): drop Path.resolve() from data-loading hot path (11.9x) by mohitgargai · Pull Request #11 · lica-world/GDB

mohitgargai · 2026-04-24T14:14:56Z

Summary

The four _resolve_manifest_* helpers in src/gdb/tasks/layout.py used pathlib.Path.resolve() to canonicalize each manifest-relative path, which walks every path component with lstat to detect symlinks. The GDB dataset tree contains zero symlinks, so the entire realpath walk is pure overhead.

For layout-2 (2,044 samples × ~25 path probes/sample) that meant ~670,000 lstat syscalls and 18 minutes just to build the samples list. This dominated upload time in scripts/upload_to_hf.py and made gdb eval --dataset-root painful for anything in the PartialLayoutCompletion family.

Fix: replace .resolve() + pathlib.Path.is_file() with os.path.join + os.path.isfile. One stat per probe instead of ~14 lstats.

Measurements (full `layout-2.load_data()`, 2,044 samples)

	before	after	ratio
wall time	1080.6 s	90.8 s	11.9x
lstat	~670,000	15	~44,000x
stat	~190,000	45,377	4.2x

Other layout benchmarks were already sub-second (they use different manifest code paths); layout-2 was the pathological case.

Correctness

Byte-identical samples: snapshotted 100 samples from the unmodified code, compared to the fixed code via json.dumps(..., sort_keys=True). Match.
Stub eval on layout-2/3/8: gdb eval --stub-model --benchmarks layout-2 layout-3 layout-8 --n 2 runs with 0 failures; all metrics compute.
ruff check clean, pytest tests/ -q → 42/42 passing.

Why this is safe

find data/gdb-dataset -type l returns 0 — no symlinks to resolve.
Manifest paths don't contain .. / . segments (they come from a generator), so os.path.normpath isn't needed.
The root argument passed in is already absolute and resolved at load_data() entry (root = Path(data_dir).resolve()).

Test plan

ruff check src/ scripts/ tests/
pytest tests/ -q
Full layout-2.load_data() wall-clock timing + syscall count
Byte-identical sample diff (100 samples, pre/post)
Stub-model eval for layout-2, layout-3, layout-8
CI green

Made with Cursor

The four _resolve_manifest_* helpers used pathlib's `.resolve()` to canonicalize each manifest-relative path, which walks every path component with `lstat` to detect symlinks. The GDB dataset tree contains zero symlinks, so the entire realpath walk is pure overhead. For layout-2 (2,044 samples, 25 path probes/sample) that meant ~670,000 `lstat` syscalls and 18 minutes just to build the samples list. This dominated upload time in scripts/upload_to_hf.py and made `gdb eval` against --dataset-root painful for anything in the PartialLayoutCompletion family. Replace `.resolve()` + `pathlib.Path.is_file()` with `os.path.join` and `os.path.isfile`. One stat per probe instead of ~14 lstats. Measured on the full 2,044-sample layout-2 load_data(): before after ratio wall time 1080.6 s 90.8 s 11.9x lstat ~670,000 15 ~44,000x stat ~190,000 45,377 4.2x Output is byte-identical (diffed against a pre-change snapshot, 100 samples, json.dumps sort_keys). Other layout benchmarks were already sub-second (they use different manifest code paths); layout-2 was the pathological case. Made-with: Cursor

Follow-up on the perf fix: switching wholesale to ``os.path.*`` was inconsistent with the rest of this file. Only ``.resolve()`` was the problem (it realpaths every path component); ``is_file()`` / ``is_dir()`` are single-stat operations, no need to replace them. Also drops the ``sample_dir.is_dir()`` pre-check in ``_resolve_component_asset`` — ``is_file`` on a non-existent dir is still a single stat (ENOENT), same net cost, simpler code. Timing on full layout-2 load_data() is 99.6 s vs 90.8 s for the pure ``os.path`` version (11% slower, pathlib boxing per stat) — still an ~11x improvement over the 1080.6 s baseline, and the code reads like the rest of the module. Verified: byte-identical samples, 42/42 tests pass, ruff clean. Made-with: Cursor

mohitgargai requested a review from purvanshi as a code owner April 24, 2026 14:14

mohitgargai merged commit 7d0b055 into main Apr 24, 2026
12 checks passed

mohitgargai deleted the perf/layout-load-data-stat-storm branch April 24, 2026 16:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11

perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11
mohitgargai merged 2 commits intomainfrom
perf/layout-load-data-stat-storm

mohitgargai commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mohitgargai commented Apr 24, 2026

Summary

Measurements (full layout-2.load_data(), 2,044 samples)

Correctness

Why this is safe

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measurements (full `layout-2.load_data()`, 2,044 samples)