perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11
Merged
mohitgargai merged 2 commits intomainfrom Apr 24, 2026
Merged
perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11mohitgargai merged 2 commits intomainfrom
mohitgargai merged 2 commits intomainfrom
Conversation
The four _resolve_manifest_* helpers used pathlib's `.resolve()` to
canonicalize each manifest-relative path, which walks every path
component with `lstat` to detect symlinks. The GDB dataset tree contains
zero symlinks, so the entire realpath walk is pure overhead.
For layout-2 (2,044 samples, 25 path probes/sample) that meant ~670,000
`lstat` syscalls and 18 minutes just to build the samples list. This
dominated upload time in scripts/upload_to_hf.py and made `gdb eval`
against --dataset-root painful for anything in the PartialLayoutCompletion
family.
Replace `.resolve()` + `pathlib.Path.is_file()` with `os.path.join` and
`os.path.isfile`. One stat per probe instead of ~14 lstats.
Measured on the full 2,044-sample layout-2 load_data():
before after ratio
wall time 1080.6 s 90.8 s 11.9x
lstat ~670,000 15 ~44,000x
stat ~190,000 45,377 4.2x
Output is byte-identical (diffed against a pre-change snapshot, 100
samples, json.dumps sort_keys).
Other layout benchmarks were already sub-second (they use different
manifest code paths); layout-2 was the pathological case.
Made-with: Cursor
Follow-up on the perf fix: switching wholesale to ``os.path.*`` was inconsistent with the rest of this file. Only ``.resolve()`` was the problem (it realpaths every path component); ``is_file()`` / ``is_dir()`` are single-stat operations, no need to replace them. Also drops the ``sample_dir.is_dir()`` pre-check in ``_resolve_component_asset`` — ``is_file`` on a non-existent dir is still a single stat (ENOENT), same net cost, simpler code. Timing on full layout-2 load_data() is 99.6 s vs 90.8 s for the pure ``os.path`` version (11% slower, pathlib boxing per stat) — still an ~11x improvement over the 1080.6 s baseline, and the code reads like the rest of the module. Verified: byte-identical samples, 42/42 tests pass, ruff clean. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The four
_resolve_manifest_*helpers insrc/gdb/tasks/layout.pyusedpathlib.Path.resolve()to canonicalize each manifest-relative path, which walks every path component withlstatto detect symlinks. The GDB dataset tree contains zero symlinks, so the entire realpath walk is pure overhead.For
layout-2(2,044 samples × ~25 path probes/sample) that meant ~670,000lstatsyscalls and 18 minutes just to build the samples list. This dominated upload time inscripts/upload_to_hf.pyand madegdb eval --dataset-rootpainful for anything in thePartialLayoutCompletionfamily.Fix: replace
.resolve()+pathlib.Path.is_file()withos.path.join+os.path.isfile. One stat per probe instead of ~14 lstats.Measurements (full
layout-2.load_data(), 2,044 samples)Other layout benchmarks were already sub-second (they use different manifest code paths);
layout-2was the pathological case.Correctness
json.dumps(..., sort_keys=True). Match.gdb eval --stub-model --benchmarks layout-2 layout-3 layout-8 --n 2runs with 0 failures; all metrics compute.ruff checkclean,pytest tests/ -q→ 42/42 passing.Why this is safe
find data/gdb-dataset -type lreturns 0 — no symlinks to resolve.../.segments (they come from a generator), soos.path.normpathisn't needed.rootargument passed in is already absolute and resolved atload_data()entry (root = Path(data_dir).resolve()).Test plan
ruff check src/ scripts/ tests/pytest tests/ -qlayout-2.load_data()wall-clock timing + syscall countMade with Cursor