Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of SnapshotSubset #9706

Closed
Eric-Arellano opened this issue May 6, 2020 · 1 comment · Fixed by #9779
Closed

Improve performance of SnapshotSubset #9706

Eric-Arellano opened this issue May 6, 2020 · 1 comment · Fixed by #9779

Comments

@Eric-Arellano
Copy link
Contributor

As first discovered in #9702 (comment), calling await Get[Snapshot](SnapshotSubset(original_snapshot.digest, PathGlobs(original_snapshot.files) with 10k files in the snapshot takes 35 seconds.

To reproduce, use this diff and run ./v2 test src/python/pants/core/util_rules/strip_source_roots_test.py:

diff --git a/src/python/pants/core/util_rules/strip_source_roots_test.py b/src/python/pants/core/util_rules/strip_source_roots_test.py
index 77d4d28eb..fbd72b0e0 100644
--- a/src/python/pants/core/util_rules/strip_source_roots_test.py
+++ b/src/python/pants/core/util_rules/strip_source_roots_test.py
@@ -13,8 +13,9 @@ from pants.core.util_rules.strip_source_roots import (
 )
 from pants.core.util_rules.strip_source_roots import rules as strip_source_root_rules
 from pants.engine.addresses import Address
-from pants.engine.fs import EMPTY_SNAPSHOT
+from pants.engine.fs import EMPTY_SNAPSHOT, PathGlobs, Snapshot, SnapshotSubset
 from pants.engine.internals.scheduler import ExecutionError
+from pants.engine.rules import RootRule
 from pants.engine.selectors import Params
 from pants.engine.target import Sources as SourcesField
 from pants.testutil.option.util import create_options_bootstrapper
@@ -24,7 +25,7 @@ from pants.testutil.test_base import TestBase
 class StripSourceRootsTest(TestBase):
     @classmethod
     def rules(cls):
-        return (*super().rules(), *strip_source_root_rules())
+        return (*super().rules(), *strip_source_root_rules(), RootRule(SnapshotSubset))
 
     def get_stripped_files(
         self,
@@ -64,6 +65,15 @@ class StripSourceRootsTest(TestBase):
             "project_test/example.py"
         ]
 
+        import time
+
+        input_snapshot = self.make_snapshot_of_empty_files(f"f{n}.txt" for n in range(10_000))
+        start = time.time()
+        self.request_single_product(
+            Snapshot, SnapshotSubset(input_snapshot.digest, PathGlobs(input_snapshot.files))
+        )
+        raise ValueError(round(time.time() - start, 3))
+
         # Unrecognized source root
         unrecognized_source_root = "no-source-root/example.txt"
         assert get_stripped_files_for_snapshot([unrecognized_source_root]) == ["example.txt"]
cosmicexplorer added a commit that referenced this issue Jun 23, 2020
[ci skip-jvm-tests]

### Problem

Fixes #9706.

This PR blocks #9781.

### Solution

This PR is split into two sets of commits:
1. Add a [criterion](https://github.com/bheisler/criterion.rs) benchmark to test the issue described in #9706.
  - We add a benchmark for both the merge and subset operations.
2. Move many higher-level operations on `Snapshot` into a separate file `snapshot_ops.rs`, managed by a `SnapshotOps` trait.
  - Improve the performance of the snapshot `subset()` operation as described below.

In more depth:

#### Subsetting
Snapshot subsetting currently (with a glob) will have to recursively walk all subdirectories of a digest. When we select with the glob `src/**`, we needlessly traverse all files recursively within the directory `src/`, when really all we need to do is just select the single `src/` DirectoryNode from the containing Directory and create a new Directory proto from that, which is much, much less work.

The above was performed by partially reimplementing glob matching logic to support early termination in `snapshot_ops.py`. Notably, this solution requires no extra caches.

### Result

This output was produced by running `./build-support/bin/native/cargo bench --manifest-path=src/rust/engine/Cargo.toml --package store` on the first commit, then the last commit:
```
materialize_directory/materialize_directory
                        time:   [79.315 ms 82.801 ms 90.041 ms]
                        change: [+85.021% +102.67% +126.80%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high severe

snapshot_subset/wildcard
                        time:   [1.0343 ms 1.0643 ms 1.0909 ms]
                        change: [-99.882% -99.875% -99.866%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

snapshot_merge/snapshot_merge
                        time:   [1.2689 s 1.3645 s 1.4771 s]
                        change: [+17.873% +23.450% +31.007%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
```
@Eric-Arellano
Copy link
Contributor Author

Thanks @cosmicexplorer for fixing this!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant