Skip to content

perf(discovery): prune excluded subtrees during traversal#870

Open
Roozi489 wants to merge 2 commits intomicrosoft:mainfrom
Roozi489:perf/discovery-prune
Open

perf(discovery): prune excluded subtrees during traversal#870
Roozi489 wants to merge 2 commits intomicrosoft:mainfrom
Roozi489:perf/discovery-prune

Conversation

@Roozi489
Copy link
Copy Markdown

@Roozi489 Roozi489 commented Apr 23, 2026

APM compilation was too slow on large workspaces with many generated files (like Unreal Engine project, where it was more than 30 mins). These changes take into account excluded directories when scanning folders which in my case led (along with #871) to an at least 10x speedup for apm compile.

Replace glob.glob(recursive=True) in find_primitive_files() with os.walk and early directory pruning so compilation.exclude patterns prevent traversal into large subtrees. Adds DEFAULT_SKIP_DIRS constant. 125 new unit tests.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve primitive-file discovery performance by replacing glob.glob(recursive=True) with an os.walk traversal that can prune excluded directories early, reducing time spent in large excluded subtrees.

Changes:

  • Reworked find_primitive_files() to use os.walk() and prune directories using DEFAULT_SKIP_DIRS + exclude_patterns.
  • Added a shared DEFAULT_SKIP_DIRS constant for unconditional directory skipping during traversals.
  • Added a new unit test module covering the new walk-based discovery helpers and expected behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
src/apm_cli/primitives/discovery.py Switches primitive discovery from glob.glob to os.walk, adds _glob_match + early directory pruning hooks, and threads exclude_patterns into local discovery.
src/apm_cli/constants.py Introduces DEFAULT_SKIP_DIRS shared skip set for traversal pruning.
tests/unit/primitives/test_discovery_walk.py Adds tests for the new walk-based discovery behavior and related helpers (also asserts behavior for ContextOptimizer).
CHANGELOG.md Adds an Unreleased entry describing the discovery perf change and new shared constant.

Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +453 to +460
# Match files against the requested patterns
for file_name in files:
file_path = current / file_name
rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
for pattern in patterns:
if _glob_match(rel_str, pattern):
all_files.append(file_path)
break
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exclude_patterns are only used to prune directories, but are no longer applied to matched files. This changes behavior from the previous implementation (callers used to should_exclude() each candidate) and means file-level excludes like **/*.instructions.md will no longer work unless they exclude an entire directory. Consider filtering matched files with should_exclude(file_path, base_path, exclude_patterns) before adding/returning them, while keeping directory pruning for perf.

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +396 to +414
"""Match a forward-slash relative path against a glob pattern with ** support.

fnmatch treats * as matching everything (including /), so ** works for
one-or-more path segments. However, ** is also supposed to match *zero*
segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
We handle that by also trying the pattern with the leading ``**/``
stripped (the "zero-segments" case).
"""
if '**/' not in pattern:
return fnmatch.fnmatch(rel_path, pattern)

prefix, suffix = pattern.split('**/', 1)
# Zero-segment case: ** matches nothing, so collapse it away
if fnmatch.fnmatch(rel_path, prefix + suffix):
return True
# One-or-more segment case: fnmatch * already crosses /
return fnmatch.fnmatch(rel_path, pattern)


Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_glob_match() uses fnmatch.fnmatch() on the full path string, but fnmatch allows * to match path separators. That makes patterns like **/.apm/instructions/*.instructions.md also match deeper paths such as .apm/instructions/subdir/x.instructions.md, which differs from glob semantics and could cause over-inclusion. Consider switching to a segment-aware matcher (e.g., compare path parts, or reuse the utils.exclude recursive matcher logic) so * does not cross directory boundaries.

Suggested change
"""Match a forward-slash relative path against a glob pattern with ** support.
fnmatch treats * as matching everything (including /), so ** works for
one-or-more path segments. However, ** is also supposed to match *zero*
segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
We handle that by also trying the pattern with the leading ``**/``
stripped (the "zero-segments" case).
"""
if '**/' not in pattern:
return fnmatch.fnmatch(rel_path, pattern)
prefix, suffix = pattern.split('**/', 1)
# Zero-segment case: ** matches nothing, so collapse it away
if fnmatch.fnmatch(rel_path, prefix + suffix):
return True
# One-or-more segment case: fnmatch * already crosses /
return fnmatch.fnmatch(rel_path, pattern)
"""Match a forward-slash relative path against a glob pattern.
This matcher is segment-aware: ``*`` and ``?`` only match within a single
path segment, while ``**`` matches zero or more complete segments. This
preserves standard glob semantics for patterns such as
``**/.apm/instructions/*.instructions.md`` so that the final ``*`` does
not match through nested directories.
Args:
rel_path (str): Relative path using forward slashes.
pattern (str): Glob pattern using forward slashes.
Returns:
bool: True when the path matches the pattern, otherwise False.
"""
path_parts: List[str] = [part for part in rel_path.split("/") if part]
pattern_parts: List[str] = [part for part in pattern.split("/") if part]
memo: Dict[tuple[int, int], bool] = {}
def _match(path_index: int, pattern_index: int) -> bool:
"""Recursively match path and pattern segments."""
key = (path_index, pattern_index)
if key in memo:
return memo[key]
if pattern_index == len(pattern_parts):
result = path_index == len(path_parts)
memo[key] = result
return result
current_pattern = pattern_parts[pattern_index]
if current_pattern == "**":
result = _match(path_index, pattern_index + 1)
if not result and path_index < len(path_parts):
result = _match(path_index + 1, pattern_index)
memo[key] = result
return result
if path_index >= len(path_parts):
memo[key] = False
return False
result = (
fnmatch.fnmatchcase(path_parts[path_index], current_pattern)
and _match(path_index + 1, pattern_index + 1)
)
memo[key] = result
return result
return _match(0, 0)

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment on lines +453 to +456
# Match files against the requested patterns
for file_name in files:
file_path = current / file_name
rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.walk() does not guarantee ordering for the files list. Since the previous glob.glob() path was deterministically sorted, this can introduce nondeterministic discovery order across platforms/filesystems. Consider iterating for file_name in sorted(files): (and/or sorting the final valid_files) to keep results stable.

Copilot uses AI. Check for mistakes.
Comment thread src/apm_cli/constants.py Outdated
Comment on lines +35 to +53
# Directory names unconditionally skipped during file-tree traversal.
# These never contain APM primitives or user source files and can be
# very large (e.g. node_modules, .git objects). Used by discovery,
# compilation, and content hashing to avoid expensive walks.
# NOTE: .apm is intentionally absent -- it is where primitives live.
# ---------------------------------------------------------------------------
DEFAULT_SKIP_DIRS: frozenset = frozenset({
".git",
"node_modules",
"__pycache__",
".pytest_cache",
".venv",
"venv",
".tox",
"build",
"dist",
".mypy_cache",
"apm_modules",
})
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says DEFAULT_SKIP_DIRS is used by "discovery, compilation, and content hashing", but in this PR it is only referenced from primitives/discovery.py (no other DEFAULT_SKIP_DIRS usages found). Either update the compilation/content-hashing walkers to use this shared constant, or adjust the comment to avoid promising behavior that isn't implemented.

Copilot uses AI. Check for mistakes.
Comment on lines +241 to +261
def test_cached_glob_respects_exclude_patterns(self):
"""_cached_glob should not return files under excluded directories."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

# Create files in included and excluded dirs
(self.base / "src").mkdir()
(self.base / "src" / "app.py").touch()
(self.base / "vendor" / "lib").mkdir(parents=True)
(self.base / "vendor" / "lib" / "dep.py").touch()

optimizer = ContextOptimizer(
base_dir=str(self.base),
exclude_patterns=["vendor"],
)

matches = optimizer._cached_glob("**/*.py")
match_strs = [m.replace("\\", "/") for m in matches]

self.assertTrue(any("src/app.py" in m for m in match_strs))
self.assertFalse(any("vendor" in m for m in match_strs))

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests assume ContextOptimizer._cached_glob() filters via a pre-built file list and respects exclude_patterns, but the current implementation uses glob.glob(..., recursive=True) and does not filter out excluded directories. As written, this test will fail unless ContextOptimizer is updated accordingly (or the test is rewritten to match the actual behavior/entry point being changed).

Copilot uses AI. Check for mistakes.
Comment on lines +272 to +309
def test_directory_files_cache_skips_default_dirs(self):
"""_directory_files_cache must not include files from DEFAULT_SKIP_DIRS."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

(self.base / "src").mkdir()
(self.base / "src" / "ok.py").touch()
(self.base / "node_modules" / "pkg").mkdir(parents=True)
(self.base / "node_modules" / "pkg" / "bad.js").touch()
(self.base / "__pycache__").mkdir()
(self.base / "__pycache__" / "mod.pyc").touch()

optimizer = ContextOptimizer(base_dir=str(self.base))
optimizer._analyze_project_structure()
all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]

self.assertTrue(any("ok.py" in s for s in all_files))
self.assertFalse(any("node_modules" in s for s in all_files))
self.assertFalse(any("__pycache__" in s for s in all_files))

def test_directory_files_cache_skips_custom_excludes(self):
"""_directory_files_cache must also respect user-supplied exclude_patterns."""
from apm_cli.compilation.context_optimizer import ContextOptimizer

(self.base / "src").mkdir()
(self.base / "src" / "ok.py").touch()
(self.base / "Binaries" / "Win64").mkdir(parents=True)
(self.base / "Binaries" / "Win64" / "huge.dll").touch()

optimizer = ContextOptimizer(
base_dir=str(self.base),
exclude_patterns=["Binaries"],
)
optimizer._analyze_project_structure()
all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]

self.assertTrue(any("ok.py" in s for s in all_files))
self.assertFalse(any("Binaries" in s for s in all_files))

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ContextOptimizer does not define _directory_files_cache anywhere (searching the codebase finds no such attribute), so these tests will raise AttributeError. Either update ContextOptimizer to expose/maintain the cache being asserted here, or change the test to assert against the actual data structure it uses (e.g., _file_list_cache / _directory_cache, depending on the intended contract).

Copilot uses AI. Check for mistakes.
Comment thread CHANGELOG.md Outdated

### Changed

- `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation.
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry does not follow the repo's Keep-a-Changelog rule of one line per PR ending with a PR number (e.g., (#123)), and it also claims DEFAULT_SKIP_DIRS is used by compilation even though this PR introduces no DEFAULT_SKIP_DIRS usage outside primitives/discovery.py. Please add the PR reference and align the wording with what actually changed (or update compilation to use the shared constant).

Copilot generated this review using guidance from repository custom instructions.
@Roozi489
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Replace glob.glob(recursive=True) in find_primitive_files() with an
os.walk loop that prunes directories before descending into them.

This allows compilation.exclude patterns (and the new DEFAULT_SKIP_DIRS
constant) to short-circuit traversal into large subtrees -- e.g. a UE5
game repo with 265 K files no longer hangs because node_modules / build
/ apm_modules are skipped at the first encountered node.

Changes:
- constants.py: add DEFAULT_SKIP_DIRS frozenset
- discovery.py: os.walk + early prune; thread exclude_patterns into
  traversal; add segment-aware _glob_match() helper that prevents *
  from crossing path separators (preserves real glob semantics);
  apply file-level exclude per file; sort os.walk file lists for
  deterministic discovery
- test_discovery_walk.py: new unit tests covering prune behaviour,
  exclude_patterns (directory + file level), symlink rejection, and
  segment-aware ** / * pattern matching

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread src/apm_cli/primitives/discovery.py Outdated
Comment thread src/apm_cli/constants.py
Comment on lines +41 to +53
DEFAULT_SKIP_DIRS: frozenset = frozenset({
".git",
"node_modules",
"__pycache__",
".pytest_cache",
".venv",
"venv",
".tox",
"build",
"dist",
".mypy_cache",
"apm_modules",
})
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_SKIP_DIRS is annotated as an unparameterized frozenset, which loses type information for static checking. Consider annotating it as frozenset[str] (and optionally typing.Final) so callers get correct types and IDE help.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants