perf(discovery): prune excluded subtrees during traversal#870
perf(discovery): prune excluded subtrees during traversal#870Roozi489 wants to merge 2 commits intomicrosoft:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to improve primitive-file discovery performance by replacing glob.glob(recursive=True) with an os.walk traversal that can prune excluded directories early, reducing time spent in large excluded subtrees.
Changes:
- Reworked
find_primitive_files()to useos.walk()and prune directories usingDEFAULT_SKIP_DIRS+exclude_patterns. - Added a shared
DEFAULT_SKIP_DIRSconstant for unconditional directory skipping during traversals. - Added a new unit test module covering the new walk-based discovery helpers and expected behaviors.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
src/apm_cli/primitives/discovery.py |
Switches primitive discovery from glob.glob to os.walk, adds _glob_match + early directory pruning hooks, and threads exclude_patterns into local discovery. |
src/apm_cli/constants.py |
Introduces DEFAULT_SKIP_DIRS shared skip set for traversal pruning. |
tests/unit/primitives/test_discovery_walk.py |
Adds tests for the new walk-based discovery behavior and related helpers (also asserts behavior for ContextOptimizer). |
CHANGELOG.md |
Adds an Unreleased entry describing the discovery perf change and new shared constant. |
| # Match files against the requested patterns | ||
| for file_name in files: | ||
| file_path = current / file_name | ||
| rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/') | ||
| for pattern in patterns: | ||
| if _glob_match(rel_str, pattern): | ||
| all_files.append(file_path) | ||
| break |
There was a problem hiding this comment.
exclude_patterns are only used to prune directories, but are no longer applied to matched files. This changes behavior from the previous implementation (callers used to should_exclude() each candidate) and means file-level excludes like **/*.instructions.md will no longer work unless they exclude an entire directory. Consider filtering matched files with should_exclude(file_path, base_path, exclude_patterns) before adding/returning them, while keeping directory pruning for perf.
| """Match a forward-slash relative path against a glob pattern with ** support. | ||
|
|
||
| fnmatch treats * as matching everything (including /), so ** works for | ||
| one-or-more path segments. However, ** is also supposed to match *zero* | ||
| segments (e.g. ``**/*.md`` should match ``readme.md`` at the root). | ||
| We handle that by also trying the pattern with the leading ``**/`` | ||
| stripped (the "zero-segments" case). | ||
| """ | ||
| if '**/' not in pattern: | ||
| return fnmatch.fnmatch(rel_path, pattern) | ||
|
|
||
| prefix, suffix = pattern.split('**/', 1) | ||
| # Zero-segment case: ** matches nothing, so collapse it away | ||
| if fnmatch.fnmatch(rel_path, prefix + suffix): | ||
| return True | ||
| # One-or-more segment case: fnmatch * already crosses / | ||
| return fnmatch.fnmatch(rel_path, pattern) | ||
|
|
||
|
|
There was a problem hiding this comment.
_glob_match() uses fnmatch.fnmatch() on the full path string, but fnmatch allows * to match path separators. That makes patterns like **/.apm/instructions/*.instructions.md also match deeper paths such as .apm/instructions/subdir/x.instructions.md, which differs from glob semantics and could cause over-inclusion. Consider switching to a segment-aware matcher (e.g., compare path parts, or reuse the utils.exclude recursive matcher logic) so * does not cross directory boundaries.
| """Match a forward-slash relative path against a glob pattern with ** support. | |
| fnmatch treats * as matching everything (including /), so ** works for | |
| one-or-more path segments. However, ** is also supposed to match *zero* | |
| segments (e.g. ``**/*.md`` should match ``readme.md`` at the root). | |
| We handle that by also trying the pattern with the leading ``**/`` | |
| stripped (the "zero-segments" case). | |
| """ | |
| if '**/' not in pattern: | |
| return fnmatch.fnmatch(rel_path, pattern) | |
| prefix, suffix = pattern.split('**/', 1) | |
| # Zero-segment case: ** matches nothing, so collapse it away | |
| if fnmatch.fnmatch(rel_path, prefix + suffix): | |
| return True | |
| # One-or-more segment case: fnmatch * already crosses / | |
| return fnmatch.fnmatch(rel_path, pattern) | |
| """Match a forward-slash relative path against a glob pattern. | |
| This matcher is segment-aware: ``*`` and ``?`` only match within a single | |
| path segment, while ``**`` matches zero or more complete segments. This | |
| preserves standard glob semantics for patterns such as | |
| ``**/.apm/instructions/*.instructions.md`` so that the final ``*`` does | |
| not match through nested directories. | |
| Args: | |
| rel_path (str): Relative path using forward slashes. | |
| pattern (str): Glob pattern using forward slashes. | |
| Returns: | |
| bool: True when the path matches the pattern, otherwise False. | |
| """ | |
| path_parts: List[str] = [part for part in rel_path.split("/") if part] | |
| pattern_parts: List[str] = [part for part in pattern.split("/") if part] | |
| memo: Dict[tuple[int, int], bool] = {} | |
| def _match(path_index: int, pattern_index: int) -> bool: | |
| """Recursively match path and pattern segments.""" | |
| key = (path_index, pattern_index) | |
| if key in memo: | |
| return memo[key] | |
| if pattern_index == len(pattern_parts): | |
| result = path_index == len(path_parts) | |
| memo[key] = result | |
| return result | |
| current_pattern = pattern_parts[pattern_index] | |
| if current_pattern == "**": | |
| result = _match(path_index, pattern_index + 1) | |
| if not result and path_index < len(path_parts): | |
| result = _match(path_index + 1, pattern_index) | |
| memo[key] = result | |
| return result | |
| if path_index >= len(path_parts): | |
| memo[key] = False | |
| return False | |
| result = ( | |
| fnmatch.fnmatchcase(path_parts[path_index], current_pattern) | |
| and _match(path_index + 1, pattern_index + 1) | |
| ) | |
| memo[key] = result | |
| return result | |
| return _match(0, 0) |
| # Match files against the requested patterns | ||
| for file_name in files: | ||
| file_path = current / file_name | ||
| rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/') |
There was a problem hiding this comment.
os.walk() does not guarantee ordering for the files list. Since the previous glob.glob() path was deterministically sorted, this can introduce nondeterministic discovery order across platforms/filesystems. Consider iterating for file_name in sorted(files): (and/or sorting the final valid_files) to keep results stable.
| # Directory names unconditionally skipped during file-tree traversal. | ||
| # These never contain APM primitives or user source files and can be | ||
| # very large (e.g. node_modules, .git objects). Used by discovery, | ||
| # compilation, and content hashing to avoid expensive walks. | ||
| # NOTE: .apm is intentionally absent -- it is where primitives live. | ||
| # --------------------------------------------------------------------------- | ||
| DEFAULT_SKIP_DIRS: frozenset = frozenset({ | ||
| ".git", | ||
| "node_modules", | ||
| "__pycache__", | ||
| ".pytest_cache", | ||
| ".venv", | ||
| "venv", | ||
| ".tox", | ||
| "build", | ||
| "dist", | ||
| ".mypy_cache", | ||
| "apm_modules", | ||
| }) |
There was a problem hiding this comment.
The comment says DEFAULT_SKIP_DIRS is used by "discovery, compilation, and content hashing", but in this PR it is only referenced from primitives/discovery.py (no other DEFAULT_SKIP_DIRS usages found). Either update the compilation/content-hashing walkers to use this shared constant, or adjust the comment to avoid promising behavior that isn't implemented.
| def test_cached_glob_respects_exclude_patterns(self): | ||
| """_cached_glob should not return files under excluded directories.""" | ||
| from apm_cli.compilation.context_optimizer import ContextOptimizer | ||
|
|
||
| # Create files in included and excluded dirs | ||
| (self.base / "src").mkdir() | ||
| (self.base / "src" / "app.py").touch() | ||
| (self.base / "vendor" / "lib").mkdir(parents=True) | ||
| (self.base / "vendor" / "lib" / "dep.py").touch() | ||
|
|
||
| optimizer = ContextOptimizer( | ||
| base_dir=str(self.base), | ||
| exclude_patterns=["vendor"], | ||
| ) | ||
|
|
||
| matches = optimizer._cached_glob("**/*.py") | ||
| match_strs = [m.replace("\\", "/") for m in matches] | ||
|
|
||
| self.assertTrue(any("src/app.py" in m for m in match_strs)) | ||
| self.assertFalse(any("vendor" in m for m in match_strs)) | ||
|
|
There was a problem hiding this comment.
These tests assume ContextOptimizer._cached_glob() filters via a pre-built file list and respects exclude_patterns, but the current implementation uses glob.glob(..., recursive=True) and does not filter out excluded directories. As written, this test will fail unless ContextOptimizer is updated accordingly (or the test is rewritten to match the actual behavior/entry point being changed).
| def test_directory_files_cache_skips_default_dirs(self): | ||
| """_directory_files_cache must not include files from DEFAULT_SKIP_DIRS.""" | ||
| from apm_cli.compilation.context_optimizer import ContextOptimizer | ||
|
|
||
| (self.base / "src").mkdir() | ||
| (self.base / "src" / "ok.py").touch() | ||
| (self.base / "node_modules" / "pkg").mkdir(parents=True) | ||
| (self.base / "node_modules" / "pkg" / "bad.js").touch() | ||
| (self.base / "__pycache__").mkdir() | ||
| (self.base / "__pycache__" / "mod.pyc").touch() | ||
|
|
||
| optimizer = ContextOptimizer(base_dir=str(self.base)) | ||
| optimizer._analyze_project_structure() | ||
| all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files] | ||
|
|
||
| self.assertTrue(any("ok.py" in s for s in all_files)) | ||
| self.assertFalse(any("node_modules" in s for s in all_files)) | ||
| self.assertFalse(any("__pycache__" in s for s in all_files)) | ||
|
|
||
| def test_directory_files_cache_skips_custom_excludes(self): | ||
| """_directory_files_cache must also respect user-supplied exclude_patterns.""" | ||
| from apm_cli.compilation.context_optimizer import ContextOptimizer | ||
|
|
||
| (self.base / "src").mkdir() | ||
| (self.base / "src" / "ok.py").touch() | ||
| (self.base / "Binaries" / "Win64").mkdir(parents=True) | ||
| (self.base / "Binaries" / "Win64" / "huge.dll").touch() | ||
|
|
||
| optimizer = ContextOptimizer( | ||
| base_dir=str(self.base), | ||
| exclude_patterns=["Binaries"], | ||
| ) | ||
| optimizer._analyze_project_structure() | ||
| all_files = [str(f) for files in optimizer._directory_files_cache.values() for f in files] | ||
|
|
||
| self.assertTrue(any("ok.py" in s for s in all_files)) | ||
| self.assertFalse(any("Binaries" in s for s in all_files)) | ||
|
|
There was a problem hiding this comment.
ContextOptimizer does not define _directory_files_cache anywhere (searching the codebase finds no such attribute), so these tests will raise AttributeError. Either update ContextOptimizer to expose/maintain the cache being asserted here, or change the test to assert against the actual data structure it uses (e.g., _file_list_cache / _directory_cache, depending on the intended contract).
|
|
||
| ### Changed | ||
|
|
||
| - `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation. |
There was a problem hiding this comment.
Changelog entry does not follow the repo's Keep-a-Changelog rule of one line per PR ending with a PR number (e.g., (#123)), and it also claims DEFAULT_SKIP_DIRS is used by compilation even though this PR introduces no DEFAULT_SKIP_DIRS usage outside primitives/discovery.py. Please add the PR reference and align the wording with what actually changed (or update compilation to use the shared constant).
|
@microsoft-github-policy-service agree |
Replace glob.glob(recursive=True) in find_primitive_files() with an os.walk loop that prunes directories before descending into them. This allows compilation.exclude patterns (and the new DEFAULT_SKIP_DIRS constant) to short-circuit traversal into large subtrees -- e.g. a UE5 game repo with 265 K files no longer hangs because node_modules / build / apm_modules are skipped at the first encountered node. Changes: - constants.py: add DEFAULT_SKIP_DIRS frozenset - discovery.py: os.walk + early prune; thread exclude_patterns into traversal; add segment-aware _glob_match() helper that prevents * from crossing path separators (preserves real glob semantics); apply file-level exclude per file; sort os.walk file lists for deterministic discovery - test_discovery_walk.py: new unit tests covering prune behaviour, exclude_patterns (directory + file level), symlink rejection, and segment-aware ** / * pattern matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0f9973b to
487d464
Compare
| DEFAULT_SKIP_DIRS: frozenset = frozenset({ | ||
| ".git", | ||
| "node_modules", | ||
| "__pycache__", | ||
| ".pytest_cache", | ||
| ".venv", | ||
| "venv", | ||
| ".tox", | ||
| "build", | ||
| "dist", | ||
| ".mypy_cache", | ||
| "apm_modules", | ||
| }) |
There was a problem hiding this comment.
DEFAULT_SKIP_DIRS is annotated as an unparameterized frozenset, which loses type information for static checking. Consider annotating it as frozenset[str] (and optionally typing.Final) so callers get correct types and IDE help.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
APM compilation was too slow on large workspaces with many generated files (like Unreal Engine project, where it was more than 30 mins). These changes take into account excluded directories when scanning folders which in my case led (along with #871) to an at least 10x speedup for apm compile.
Replace glob.glob(recursive=True) in find_primitive_files() with os.walk and early directory pruning so compilation.exclude patterns prevent traversal into large subtrees. Adds DEFAULT_SKIP_DIRS constant. 125 new unit tests.