Skip to content

perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11

Merged
mohitgargai merged 2 commits intomainfrom
perf/layout-load-data-stat-storm
Apr 24, 2026
Merged

perf(layout): drop Path.resolve() from data-loading hot path (11.9x)#11
mohitgargai merged 2 commits intomainfrom
perf/layout-load-data-stat-storm

Conversation

@mohitgargai
Copy link
Copy Markdown
Contributor

Summary

The four _resolve_manifest_* helpers in src/gdb/tasks/layout.py used pathlib.Path.resolve() to canonicalize each manifest-relative path, which walks every path component with lstat to detect symlinks. The GDB dataset tree contains zero symlinks, so the entire realpath walk is pure overhead.

For layout-2 (2,044 samples × ~25 path probes/sample) that meant ~670,000 lstat syscalls and 18 minutes just to build the samples list. This dominated upload time in scripts/upload_to_hf.py and made gdb eval --dataset-root painful for anything in the PartialLayoutCompletion family.

Fix: replace .resolve() + pathlib.Path.is_file() with os.path.join + os.path.isfile. One stat per probe instead of ~14 lstats.

Measurements (full layout-2.load_data(), 2,044 samples)

before after ratio
wall time 1080.6 s 90.8 s 11.9x
lstat ~670,000 15 ~44,000x
stat ~190,000 45,377 4.2x

Other layout benchmarks were already sub-second (they use different manifest code paths); layout-2 was the pathological case.

Correctness

  • Byte-identical samples: snapshotted 100 samples from the unmodified code, compared to the fixed code via json.dumps(..., sort_keys=True). Match.
  • Stub eval on layout-2/3/8: gdb eval --stub-model --benchmarks layout-2 layout-3 layout-8 --n 2 runs with 0 failures; all metrics compute.
  • ruff check clean, pytest tests/ -q → 42/42 passing.

Why this is safe

  • find data/gdb-dataset -type l returns 0 — no symlinks to resolve.
  • Manifest paths don't contain .. / . segments (they come from a generator), so os.path.normpath isn't needed.
  • The root argument passed in is already absolute and resolved at load_data() entry (root = Path(data_dir).resolve()).

Test plan

  • ruff check src/ scripts/ tests/
  • pytest tests/ -q
  • Full layout-2.load_data() wall-clock timing + syscall count
  • Byte-identical sample diff (100 samples, pre/post)
  • Stub-model eval for layout-2, layout-3, layout-8
  • CI green

Made with Cursor

The four _resolve_manifest_* helpers used pathlib's `.resolve()` to
canonicalize each manifest-relative path, which walks every path
component with `lstat` to detect symlinks. The GDB dataset tree contains
zero symlinks, so the entire realpath walk is pure overhead.

For layout-2 (2,044 samples, 25 path probes/sample) that meant ~670,000
`lstat` syscalls and 18 minutes just to build the samples list. This
dominated upload time in scripts/upload_to_hf.py and made `gdb eval`
against --dataset-root painful for anything in the PartialLayoutCompletion
family.

Replace `.resolve()` + `pathlib.Path.is_file()` with `os.path.join` and
`os.path.isfile`. One stat per probe instead of ~14 lstats.

Measured on the full 2,044-sample layout-2 load_data():

                before      after     ratio
  wall time    1080.6 s    90.8 s    11.9x
  lstat        ~670,000    15        ~44,000x
  stat         ~190,000    45,377    4.2x

Output is byte-identical (diffed against a pre-change snapshot, 100
samples, json.dumps sort_keys).

Other layout benchmarks were already sub-second (they use different
manifest code paths); layout-2 was the pathological case.

Made-with: Cursor
@mohitgargai mohitgargai requested a review from purvanshi as a code owner April 24, 2026 14:14
Follow-up on the perf fix: switching wholesale to ``os.path.*`` was
inconsistent with the rest of this file. Only ``.resolve()`` was the
problem (it realpaths every path component); ``is_file()`` / ``is_dir()``
are single-stat operations, no need to replace them.

Also drops the ``sample_dir.is_dir()`` pre-check in
``_resolve_component_asset`` — ``is_file`` on a non-existent dir is still
a single stat (ENOENT), same net cost, simpler code.

Timing on full layout-2 load_data() is 99.6 s vs 90.8 s for the pure
``os.path`` version (11% slower, pathlib boxing per stat) — still an
~11x improvement over the 1080.6 s baseline, and the code reads like the
rest of the module.

Verified: byte-identical samples, 42/42 tests pass, ruff clean.
Made-with: Cursor
@mohitgargai mohitgargai merged commit 7d0b055 into main Apr 24, 2026
12 checks passed
@mohitgargai mohitgargai deleted the perf/layout-load-data-stat-storm branch April 24, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant