perf(compile): cache file walk and fix placement for narrow patterns#871
perf(compile): cache file walk and fix placement for narrow patterns#871Roozi489 wants to merge 3 commits intomicrosoft:mainfrom
Conversation
Replace glob.glob(recursive=True) in find_primitive_files() with an os.walk loop that prunes directories before descending into them. This allows compilation.exclude patterns (and the new DEFAULT_SKIP_DIRS constant) to short-circuit traversal into large subtrees -- e.g. a UE5 game repo with 265 K files no longer hangs because node_modules / build / apm_modules are skipped at the first encountered node. Changes: - constants.py: add DEFAULT_SKIP_DIRS frozenset (replaces duplicate inline sets in discovery.py and context_optimizer.py) - discovery.py: os.walk + early prune; thread exclude_patterns into traversal so callers no longer need a post-filter pass; add _glob_match() helper for ** zero-segment matching - test_discovery_walk.py: new unit tests covering prune behaviour, exclude_patterns, symlink rejection, and ** pattern matching Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build a _directory_files_cache during _analyze_project_structure() so that _cached_glob(), _find_matching_directories(), and _directory_matches_pattern() all work from the same in-memory file list instead of issuing repeated os.walk / iterdir() calls against disk. Key changes in context_optimizer.py: - _analyze_project_structure: move dirs[:] pruning BEFORE the depth / exclusion checks so os.walk never descends into excluded subtrees; populate _directory_files_cache[dir] = [file, ...] for later use - _cached_glob: replaces glob.glob(cwd=base_dir) with a scan of _directory_files_cache using _glob_match() from discovery.py - _find_matching_directories: fast path for ** patterns derives directory hits from the cached glob set (no iterdir()); slow path for non-recursive patterns iterates cached files - _calculate_optimization_stats: rewrite O(N^2) efficiency loop to O(N) using pre-computed pattern_dir_sets from _pattern_cache - _optimize_low_distribution_placement: go straight to _find_minimal_coverage_placement (lowest common ancestor) instead of the pollution-scored candidate search that biased toward root; fixes instruction files for narrow applyTo globs landing at ./ when all matching files live under a specific subtree - Drop local DEFAULT_EXCLUDED_DIRNAMES; use DEFAULT_SKIP_DIRS from constants (introduced in perf/discovery-prune) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@Roozi489 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
This PR improves apm compile performance and correctness by caching filesystem traversal results and adjusting low-distribution instruction placement to use lowest-common-ancestor coverage, reducing repeated os.walk/iterdir() work and avoiding incorrect root placements for narrow applyTo patterns.
Changes:
- Switch primitive discovery
find_primitive_files()fromglob.glob(recursive=True)toos.walkwith early directory pruning and shared skip dirs. - Add
DEFAULT_SKIP_DIRSconstant and use it to prune traversal in both primitives discovery and compilation analysis. - Update
ContextOptimizerto cache per-directory file lists and use them for glob matching, directory matching, and stats computation; adjust low-distribution placement to use minimal-coverage (LCA) placement.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/primitives/test_discovery_walk.py | Adds unit tests for the new walk-based discovery helpers and verifies ContextOptimizer’s cached glob behavior. |
| src/apm_cli/primitives/discovery.py | Implements _glob_match and updates find_primitive_files() to walk/prune instead of glob.glob(). |
| src/apm_cli/constants.py | Introduces DEFAULT_SKIP_DIRS to centralize unconditional traversal skips. |
| src/apm_cli/compilation/context_optimizer.py | Adds _directory_files_cache and rewires _cached_glob / matching / stats; changes single-point placement to use LCA. |
| CHANGELOG.md | Documents the behavior/perf changes under Unreleased. |
| Strategy: Place at the lowest common ancestor of all matching directories. | ||
| This is the most specific directory that still provides full hierarchical | ||
| coverage, avoiding pollution of unrelated subtrees. | ||
| """ | ||
| candidates = self._generate_all_candidates(matching_directories, instruction) | ||
| # Find the deepest directory that covers all matches | ||
| minimal_coverage = self._find_minimal_coverage_placement(matching_directories) | ||
| if minimal_coverage and minimal_coverage in self._directory_cache: | ||
| return [minimal_coverage] |
There was a problem hiding this comment.
_optimize_single_point_placement() now uses lowest-common-ancestor placement, which is the crux of the bugfix described in the PR. There are existing tests for single-point placement and for LCA-at-root sibling coverage, but there is no test that asserts a non-root LCA is chosen when multiple matching directories exist under a deep subtree (the regression you mention like Engine/Plugins/PCG*/**/*). Add a unit test that constructs such a tree and asserts the placement directory is that subtree LCA, not the project root.
| ### Changed | ||
|
|
||
| - `find_primitive_files()` now uses `os.walk` with early directory pruning instead of `glob.glob(recursive=True)`, so `compilation.exclude` patterns prevent traversal into expensive subtrees (e.g. large game-engine asset trees). Adds shared `DEFAULT_SKIP_DIRS` constant used by both discovery and compilation. | ||
| - `ContextOptimizer` now builds a `_directory_files_cache` during `_analyze_project_structure()` and uses it for `_cached_glob`, `_find_matching_directories`, and `_directory_matches_pattern`, eliminating repeated `os.walk` / `iterdir()` calls. Directory pruning is moved before the depth check so excluded subtrees are never descended into. Stats loop rewritten from O(N^2) per-directory inheritance walk to O(N) with pre-computed pattern sets. Low-distribution placement now uses `_find_minimal_coverage_placement` directly instead of a pollution-scored candidate search that incorrectly biased toward the project root. |
There was a problem hiding this comment.
The new changelog bullets under ## [Unreleased] do not follow the repo format rules: each entry should end with a PR reference like (#<number>), and be one logical change per line. Please add the appropriate PR number(s) to these entries (and consider splitting the long ContextOptimizer bullet if needed).
| # Match files against the requested patterns | ||
| for file_name in files: | ||
| file_path = current / file_name | ||
| rel_str = str(file_path.relative_to(base_path)).replace(os.sep, '/') | ||
| for pattern in patterns: | ||
| if _glob_match(rel_str, pattern): | ||
| all_files.append(file_path) | ||
| break |
There was a problem hiding this comment.
find_primitive_files() iterates files from os.walk() without sorting, which can make primitive discovery order nondeterministic across filesystems. Because local primitives with the same name are conflict-resolved by first-seen order, this can lead to different winners on different machines. Sort files (and ideally return a consistently sorted valid_files) to preserve deterministic behavior similar to the old glob.glob() path ordering.
| def _should_skip_directory(dir_path: str) -> bool: | ||
| """Check if a directory should be skipped during scanning. | ||
|
|
||
| Args: | ||
| dir_path (str): Directory path to check. | ||
|
|
||
| Returns: | ||
| bool: True if directory should be skipped, False otherwise. | ||
| """ | ||
| skip_patterns = { | ||
| '.git', | ||
| 'node_modules', | ||
| '__pycache__', | ||
| '.pytest_cache', | ||
| '.venv', | ||
| 'venv', | ||
| '.tox', | ||
| 'build', | ||
| 'dist', | ||
| '.mypy_cache' | ||
| } | ||
|
|
||
| dir_name = os.path.basename(dir_path) | ||
| return dir_name in skip_patterns | ||
| return dir_name in DEFAULT_SKIP_DIRS |
There was a problem hiding this comment.
_should_skip_directory() is now unused (no references in the repo) and duplicates the DEFAULT_SKIP_DIRS check already done inline during traversal. Consider removing it to avoid dead code, or reintroduce a call site if it is meant to be the canonical skip check.
| # These never contain APM primitives or user source files and can be | ||
| # very large (e.g. node_modules, .git objects). Used by discovery, | ||
| # compilation, and content hashing to avoid expensive walks. | ||
| # NOTE: .apm is intentionally absent -- it is where primitives live. |
There was a problem hiding this comment.
The comment for DEFAULT_SKIP_DIRS says these directories "never contain APM primitives", but apm_modules/ can contain dependency primitives (it is just intentionally excluded from local traversal). Reword the comment to reflect that these dirs are skipped because they are not relevant for project-source analysis / local primitive discovery. Also consider updating the docs default-exclusions list (e.g. reference/cli-commands.md) since venv/ is now an unconditional skip.
| # These never contain APM primitives or user source files and can be | |
| # very large (e.g. node_modules, .git objects). Used by discovery, | |
| # compilation, and content hashing to avoid expensive walks. | |
| # NOTE: .apm is intentionally absent -- it is where primitives live. | |
| # These directories are not relevant to project-source analysis or local | |
| # primitive discovery and may be very large (e.g. node_modules, .git | |
| # objects). Used by discovery, compilation, and content hashing to avoid | |
| # expensive walks. | |
| # NOTE: .apm is intentionally absent because local project primitives may | |
| # live there and should still be discovered. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@microsoft-github-policy-service agree |
Depends on #870.
Build a single _directory_files_cache during project analysis and use it for all subsequent glob matching, eliminating repeated os.walk/iterdir() calls. Also fixes instruction files with narrow applyTo globs (e.g. Engine/Plugins/PCG*/**/*) incorrectly landing at ./ instead of their target subtree -- _optimize_low_distribution_placement now uses _find_minimal_coverage_placement (lowest common ancestor) instead of a pollution-scored search that biased toward root. Stats loop rewritten from O(N^2) to O(N).