perf(discovery): prune excluded subtrees during traversal by Roozi489 · Pull Request #870 · microsoft/apm

Roozi489 · 2026-04-23T09:30:34Z

APM compilation was too slow on large workspaces with many generated files (like Unreal Engine project, where it was more than 30 mins). These changes take into account excluded directories when scanning folders which in my case led (along with #871) to an at least 10x speedup for apm compile.

Replace glob.glob(recursive=True) in find_primitive_files() with os.walk and early directory pruning so compilation.exclude patterns prevent traversal into large subtrees. Adds DEFAULT_SKIP_DIRS constant. 125 new unit tests.

Copilot

Pull request overview

This PR aims to improve primitive-file discovery performance by replacing glob.glob(recursive=True) with an os.walk traversal that can prune excluded directories early, reducing time spent in large excluded subtrees.

Changes:

Reworked find_primitive_files() to use os.walk() and prune directories using DEFAULT_SKIP_DIRS + exclude_patterns.
Added a shared DEFAULT_SKIP_DIRS constant for unconditional directory skipping during traversals.
Added a new unit test module covering the new walk-based discovery helpers and expected behaviors.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
`src/apm_cli/primitives/discovery.py`	Switches primitive discovery from `glob.glob` to `os.walk`, adds `_glob_match` + early directory pruning hooks, and threads `exclude_patterns` into local discovery.
`src/apm_cli/constants.py`	Introduces `DEFAULT_SKIP_DIRS` shared skip set for traversal pruning.
`tests/unit/primitives/test_discovery_walk.py`	Adds tests for the new walk-based discovery behavior and related helpers (also asserts behavior for `ContextOptimizer`).
`CHANGELOG.md`	Adds an Unreleased entry describing the discovery perf change and new shared constant.

Copilot · 2026-04-23T09:36:55Z

+        # Match files against the requested patterns
+        for file_name in files:
+            file_path = current / file_name
+            rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')
+            for pattern in patterns:
+                if _glob_match(rel_str, pattern):
+                    all_files.append(file_path)
+                    break


exclude_patterns are only used to prune directories, but are no longer applied to matched files. This changes behavior from the previous implementation (callers used to should_exclude() each candidate) and means file-level excludes like **/*.instructions.md will no longer work unless they exclude an entire directory. Consider filtering matched files with should_exclude(file_path, base_path, exclude_patterns) before adding/returning them, while keeping directory pruning for perf.

Copilot · 2026-04-23T09:36:55Z

+    """Match a forward-slash relative path against a glob pattern with ** support.
+
+    fnmatch treats * as matching everything (including /), so ** works for
+    one-or-more path segments.  However, ** is also supposed to match *zero*
+    segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
+    We handle that by also trying the pattern with the leading ``**/``
+    stripped (the "zero-segments" case).
+    """
+    if '**/' not in pattern:
+        return fnmatch.fnmatch(rel_path, pattern)
+
+    prefix, suffix = pattern.split('**/', 1)
+    # Zero-segment case: ** matches nothing, so collapse it away
+    if fnmatch.fnmatch(rel_path, prefix + suffix):
+        return True
+    # One-or-more segment case: fnmatch * already crosses /
+    return fnmatch.fnmatch(rel_path, pattern)
+
+


_glob_match() uses fnmatch.fnmatch() on the full path string, but fnmatch allows * to match path separators. That makes patterns like **/.apm/instructions/*.instructions.md also match deeper paths such as .apm/instructions/subdir/x.instructions.md, which differs from glob semantics and could cause over-inclusion. Consider switching to a segment-aware matcher (e.g., compare path parts, or reuse the utils.exclude recursive matcher logic) so * does not cross directory boundaries.

Suggested change

"""Match a forward-slash relative path against a glob pattern with ** support.

fnmatch treats * as matching everything (including /), so ** works for

one-or-more path segments. However, ** is also supposed to match *zero*

segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).

We handle that by also trying the pattern with the leading ``**/``

stripped (the "zero-segments" case).

"""

if '**/' not in pattern:

return fnmatch.fnmatch(rel_path, pattern)

prefix, suffix = pattern.split('**/', 1)

# Zero-segment case: ** matches nothing, so collapse it away

if fnmatch.fnmatch(rel_path, prefix + suffix):

return True

# One-or-more segment case: fnmatch * already crosses /

return fnmatch.fnmatch(rel_path, pattern)

"""Match a forward-slash relative path against a glob pattern.

This matcher is segment-aware: ``*`` and ``?`` only match within a single

path segment, while ``**`` matches zero or more complete segments. This

preserves standard glob semantics for patterns such as

``**/.apm/instructions/*.instructions.md`` so that the final ``*`` does

not match through nested directories.

Args:

rel_path (str): Relative path using forward slashes.

pattern (str): Glob pattern using forward slashes.

Returns:

bool: True when the path matches the pattern, otherwise False.

"""

path_parts: List[str] = [part for part in rel_path.split("/") if part]

pattern_parts: List[str] = [part for part in pattern.split("/") if part]

memo: Dict[tuple[int, int], bool] = {}

def _match(path_index: int, pattern_index: int) -> bool:

"""Recursively match path and pattern segments."""

key = (path_index, pattern_index)

if key in memo:

return memo[key]

if pattern_index == len(pattern_parts):

result = path_index == len(path_parts)

memo[key] = result

return result

current_pattern = pattern_parts[pattern_index]

if current_pattern == "**":

result = _match(path_index, pattern_index + 1)

if not result and path_index < len(path_parts):

result = _match(path_index + 1, pattern_index)

memo[key] = result

return result

if path_index >= len(path_parts):

memo[key] = False

return False

result = (

fnmatch.fnmatchcase(path_parts[path_index], current_pattern)

and _match(path_index + 1, pattern_index + 1)

)

memo[key] = result

return result

return _match(0, 0)

Copilot · 2026-04-23T09:36:56Z

+        # Match files against the requested patterns
+        for file_name in files:
+            file_path = current / file_name
+            rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/')


os.walk() does not guarantee ordering for the files list. Since the previous glob.glob() path was deterministically sorted, this can introduce nondeterministic discovery order across platforms/filesystems. Consider iterating for file_name in sorted(files): (and/or sorting the final valid_files) to keep results stable.

Copilot · 2026-04-23T09:36:56Z

+# Directory names unconditionally skipped during file-tree traversal.
+# These never contain APM primitives or user source files and can be
+# very large (e.g. node_modules, .git objects).  Used by discovery,
+# compilation, and content hashing to avoid expensive walks.
+# NOTE: .apm is intentionally absent -- it is where primitives live.
+# ---------------------------------------------------------------------------
+DEFAULT_SKIP_DIRS: frozenset = frozenset({
+    ".git",
+    "node_modules",
+    "__pycache__",
+    ".pytest_cache",
+    ".venv",
+    "venv",
+    ".tox",
+    "build",
+    "dist",
+    ".mypy_cache",
+    "apm_modules",
+})


The comment says DEFAULT_SKIP_DIRS is used by "discovery, compilation, and content hashing", but in this PR it is only referenced from primitives/discovery.py (no other DEFAULT_SKIP_DIRS usages found). Either update the compilation/content-hashing walkers to use this shared constant, or adjust the comment to avoid promising behavior that isn't implemented.

Copilot · 2026-04-23T09:36:56Z

+    def test_cached_glob_respects_exclude_patterns(self):
+        """_cached_glob should not return files under excluded directories."""
+        from apm_cli.compilation.context_optimizer import ContextOptimizer
+
+        # Create files in included and excluded dirs
+        (self.base / "src").mkdir()
+        (self.base / "src" / "app.py").touch()
+        (self.base / "vendor" / "lib").mkdir(parents=True)
+        (self.base / "vendor" / "lib" / "dep.py").touch()
+
+        optimizer = ContextOptimizer(
+            base_dir=str(self.base),
+            exclude_patterns=["vendor"],
+        )
+
+        matches = optimizer._cached_glob("**/*.py")
+        match_strs = [m.replace("\\", "/") for m in matches]
+
+        self.assertTrue(any("src/app.py" in m for m in match_strs))
+        self.assertFalse(any("vendor" in m for m in match_strs))
+


These tests assume ContextOptimizer._cached_glob() filters via a pre-built file list and respects exclude_patterns, but the current implementation uses glob.glob(..., recursive=True) and does not filter out excluded directories. As written, this test will fail unless ContextOptimizer is updated accordingly (or the test is rewritten to match the actual behavior/entry point being changed).

Copilot · 2026-04-23T09:36:56Z

+    def test_directory_files_cache_skips_default_dirs(self):
+        """_directory_files_cache must not include files from DEFAULT_SKIP_DIRS."""
+        from apm_cli.compilation.context_optimizer import ContextOptimizer
+
+        (self.base / "src").mkdir()
+        (self.base / "src" / "ok.py").touch()
+        (self.base / "node_modules" / "pkg").mkdir(parents=True)
+        (self.base / "node_modules" / "pkg" / "bad.js").touch()
+        (self.base / "__pycache__").mkdir()
+        (self.base / "__pycache__" / "mod.pyc").touch()
+
+        optimizer = ContextOptimizer(base_dir=str(self.base))
+        optimizer._analyze_project_structure()
+        all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]
+
+        self.assertTrue(any("ok.py" in s for s in all_files))
+        self.assertFalse(any("node_modules" in s for s in all_files))
+        self.assertFalse(any("__pycache__" in s for s in all_files))
+
+    def test_directory_files_cache_skips_custom_excludes(self):
+        """_directory_files_cache must also respect user-supplied exclude_patterns."""
+        from apm_cli.compilation.context_optimizer import ContextOptimizer
+
+        (self.base / "src").mkdir()
+        (self.base / "src" / "ok.py").touch()
+        (self.base / "Binaries" / "Win64").mkdir(parents=True)
+        (self.base / "Binaries" / "Win64" / "huge.dll").touch()
+
+        optimizer = ContextOptimizer(
+            base_dir=str(self.base),
+            exclude_patterns=["Binaries"],
+        )
+        optimizer._analyze_project_structure()
+        all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files]
+
+        self.assertTrue(any("ok.py" in s for s in all_files))
+        self.assertFalse(any("Binaries" in s for s in all_files))
+


ContextOptimizer does not define _directory_files_cache anywhere (searching the codebase finds no such attribute), so these tests will raise AttributeError. Either update ContextOptimizer to expose/maintain the cache being asserted here, or change the test to assert against the actual data structure it uses (e.g., _file_list_cache / _directory_cache, depending on the intended contract).

Copilot · 2026-04-23T09:36:57Z


+### Changed
+
+- `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation.


Changelog entry does not follow the repo's Keep-a-Changelog rule of one line per PR ending with a PR number (e.g., (#123)), and it also claims DEFAULT_SKIP_DIRS is used by compilation even though this PR introduces no DEFAULT_SKIP_DIRS usage outside primitives/discovery.py. Please add the PR reference and align the wording with what actually changed (or update compilation to use the shared constant).

Roozi489 · 2026-04-23T10:35:17Z

@microsoft-github-policy-service agree

Replace glob.glob(recursive=True) in find_primitive_files() with an os.walk loop that prunes directories before descending into them. This allows compilation.exclude patterns (and the new DEFAULT_SKIP_DIRS constant) to short-circuit traversal into large subtrees -- e.g. a UE5 game repo with 265 K files no longer hangs because node_modules / build / apm_modules are skipped at the first encountered node. Changes: - constants.py: add DEFAULT_SKIP_DIRS frozenset - discovery.py: os.walk + early prune; thread exclude_patterns into traversal; add segment-aware _glob_match() helper that prevents * from crossing path separators (preserves real glob semantics); apply file-level exclude per file; sort os.walk file lists for deterministic discovery - test_discovery_walk.py: new unit tests covering prune behaviour, exclude_patterns (directory + file level), symlink rejection, and segment-aware ** / * pattern matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-23T12:05:33Z

+DEFAULT_SKIP_DIRS: frozenset = frozenset({
+    ".git",
+    "node_modules",
+    "__pycache__",
+    ".pytest_cache",
+    ".venv",
+    "venv",
+    ".tox",
+    "build",
+    "dist",
+    ".mypy_cache",
+    "apm_modules",
+})


DEFAULT_SKIP_DIRS is annotated as an unparameterized frozenset, which loses type information for static checking. Consider annotating it as frozenset[str] (and optionally typing.Final) so callers get correct types and IDE help.

@copilot apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 23, 2026 09:30

Roozi489 requested review from danielmeppiel and sergio-sisternes-epam as code owners April 23, 2026 09:30

Copilot started reviewing on behalf of Roozi489 April 23, 2026 09:31 View session

Roozi489 mentioned this pull request Apr 23, 2026

perf(compile): cache file walk and fix placement for narrow patterns #871

Open

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Roozi489 force-pushed the perf/discovery-prune branch from 0f9973b to 487d464 Compare April 23, 2026 10:45

Roozi489 requested a review from Copilot April 23, 2026 11:59

Copilot started reviewing on behalf of Roozi489 April 23, 2026 12:00 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Use platform-aware case match in _glob_match

313eaad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(discovery): prune excluded subtrees during traversal#870

perf(discovery): prune excluded subtrees during traversal#870
Roozi489 wants to merge 2 commits intomicrosoft:mainfrom
Roozi489:perf/discovery-prune

Roozi489 commented Apr 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Roozi489 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Roozi489 Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    """Match a forward-slash relative path against a glob pattern with ** support.
-    fnmatch treats * as matching everything (including /), so ** works for
-    one-or-more path segments.  However, ** is also supposed to match *zero*
-    segments (e.g. ``**/*.md`` should match ``readme.md`` at the root).
-    We handle that by also trying the pattern with the leading ``**/``
-    stripped (the "zero-segments" case).
-    """
-    if '**/' not in pattern:
-        return fnmatch.fnmatch(rel_path, pattern)
-    prefix, suffix = pattern.split('**/', 1)
-    # Zero-segment case: ** matches nothing, so collapse it away
-    if fnmatch.fnmatch(rel_path, prefix + suffix):
-        return True
-    # One-or-more segment case: fnmatch * already crosses /
-    return fnmatch.fnmatch(rel_path, pattern)
+    """Match a forward-slash relative path against a glob pattern.
+    This matcher is segment-aware: ``*`` and ``?`` only match within a single
+    path segment, while ``**`` matches zero or more complete segments. This
+    preserves standard glob semantics for patterns such as
+    ``**/.apm/instructions/*.instructions.md`` so that the final ``*`` does
+    not match through nested directories.
+    Args:
+        rel_path (str): Relative path using forward slashes.
+        pattern (str): Glob pattern using forward slashes.
+    Returns:
+        bool: True when the path matches the pattern, otherwise False.
+    """
+    path_parts: List[str] = [part for part in rel_path.split("/") if part]
+    pattern_parts: List[str] = [part for part in pattern.split("/") if part]
+    memo: Dict[tuple[int, int], bool] = {}
+    def _match(path_index: int, pattern_index: int) -> bool:
+        """Recursively match path and pattern segments."""
+        key = (path_index, pattern_index)
+        if key in memo:
+            return memo[key]
+        if pattern_index == len(pattern_parts):
+            result = path_index == len(path_parts)
+            memo[key] = result
+            return result
+        current_pattern = pattern_parts[pattern_index]
+        if current_pattern == "**":
+            result = _match(path_index, pattern_index + 1)
+            if not result and path_index < len(path_parts):
+                result = _match(path_index + 1, pattern_index)
+            memo[key] = result
+            return result
+        if path_index >= len(path_parts):
+            memo[key] = False
+            return False
+        result = (
+            fnmatch.fnmatchcase(path_parts[path_index], current_pattern)
+            and _match(path_index + 1, pattern_index + 1)
+        )
+        memo[key] = result
+        return result
+    return _match(0, 0)


		### Changed

		- `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation.

Conversation

Roozi489 commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Roozi489 commented Apr 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Roozi489 Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Roozi489 commented Apr 23, 2026 •

edited

Loading