Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify glob matching and directly match in-memory globs as Patterns. #7402

Merged
merged 4 commits into from Mar 20, 2019

Conversation

Projects
None yet
4 participants
@stuhood
Copy link
Member

commented Mar 19, 2019

Problem

#7299 aligned the behaviour of in-memory (eager) and on-disk (lazy) glob matching, but unfortunately caused a serious performance regression in detection of file owners.

The on-disk (lazy) implementation of glob matching from GlobMatchingImplementation::expand was imposing about a 30x overhead when matching globs in memory... and unfortunately, we do that more often than I had realized.

Solution

In two independently reviewable commits:

  1. Simplify GlobMatchingImplementation::expand based on the realization that the PathGlob caching that we had been doing was a holdout from when we used to memoize them in the graph, and which would never hit due to the unique trailing components of each PathGlob.
  2. Introduce and use PathGlobs::matches(Vec<PathBuf>) -> bool to replace MemFS::expand with an eager operation. Expand FilespecTest to confirm that this redundant implementation does not go out of sync.

Result

GlobMatchingImplementation::expand is ~3x faster (representing a 20% speedup for BuildGraph hydration in a large repository), and owners detection for a command like:

time ./pants --changed-diffspec='2c9c338cd7^..2c9c338cd7' list

... runs at approximately the speed that it did before #7299.

@stuhood stuhood marked this pull request as ready for review Mar 19, 2019

@stuhood stuhood added this to the 1.15.x milestone Mar 20, 2019

@stuhood stuhood force-pushed the twitter:stuhood/glob-matching-perf branch from e40e397 to 813b7cf Mar 20, 2019

@stuhood stuhood changed the title Simplify glob matching. Simplify glob matching and directly match in-memory globs as Patterns. Mar 20, 2019

@stuhood

This comment has been minimized.

Copy link
Member Author

commented Mar 20, 2019

This is reviewable: the commits are independent.

@cosmicexplorer
Copy link
Contributor

left a comment

Question about why the test results are different, but this is a fantastic change, it makes glob matching significantly easier to read and extend, and the in-memory glob matching on PathGlobs::match is a really nice operation to have.


expansion.outputs.extend(path_stats.clone());
let mut sources = Vec::new();
let mut roots = Vec::new();

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer Mar 20, 2019

Contributor

.zip() in general makes me more uncomfortable than it should, but one way to rewrite this would be to use a vec of tuples and avoid the zipping below:

    let mut sources_with_roots = Vec::new();
    for pgie in include {
      let source = Arc::new(pgie.input);
      for path_glob in pgie.globs {
        sources_with_roots.push((source.clone(), 
                                 self.expand_single(result.clone(), exclude.clone(), path_glob)));
      }
      // ...
        let matching_inputs = sources_with_roots.iter()
        // ...
    }

This feels more clear to me, since the sources and roots are coming from the same place (and it feels easier to make edits to). It's not necessary at all.

This comment has been minimized.

Copy link
@stuhood

stuhood Mar 20, 2019

Author Member

I thought about doing this initially, but I think that the issue was that the futures in sources_with_roots would need to move into the join_all... and that would mean unzipping it anyway.

Show resolved Hide resolved src/rust/engine/fs/src/lib.rs Outdated
/// NB: This implementation is independent from VFS::expand, and must be kept in sync via unit
/// tests. The lazy filesystem traversal in VFS::expand is (currently) too expensive to use for
/// in-memory matching (such as via MemFS).
///

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer Mar 20, 2019

Contributor

I don't know if I see why this needs to be kept in sync with VFS::expand. To me it seems perfectly natural to have a method that matches globs against literal paths, and to have that be much faster / in memory compared to expanding those globs against an arbitrary filesystem. I would at least make it clear that this method does this matching quickly and in-memory -- this comment is a little confusing to me.

If this method is extremely fast in comparison, one way to check the correctness of glob matching is to run this .matches() method on every path returned from VFS::expand, every time we expand a glob against the VFS, and raise some "internal error" message if there is a dissonance. I think this could be a good idea but that might just be me.

Also, I think it's actually GlobMatchingImplementation::expand() now?

This comment has been minimized.

Copy link
@illicitonion

illicitonion Mar 20, 2019

Contributor

It's not clear to me what needs to be kept in sync here, or what unit tests ensure it is. Can you flesh out this comment?

This comment has been minimized.

Copy link
@stuhood

stuhood Mar 20, 2019

Author Member

Expanded it a bit further to reference the test that it is keeping these synched.

@@ -47,27 +62,27 @@ def test_matches_single_star_4_neg(self):
self.assert_rule_match('foo*/bar', ('foofighters/baz/bar',), negate=True)

def test_matches_double_star_0(self):
self.assert_rule_match('**', ('a/b/c', 'a'))
self.assert_rule_match('**', ('a/b/c', 'b'))

This comment has been minimized.

Copy link
@cosmicexplorer

cosmicexplorer Mar 20, 2019

Contributor

Why are these results different?

This comment has been minimized.

Copy link
@stuhood

stuhood Mar 20, 2019

Author Member

This changed because when it comes to real files on disk, both a and a/b/c cannot be files... a would need to be a directory. There is at least one other change for this reason in here.

@illicitonion
Copy link
Contributor

left a comment

Maybe should stop using PathBuf and start using a Vec<String>, given how many times we're having to unwrap across the String/PathBuf boundary...

path_stats,
globs,
} = exp;
let result = Arc::new(Mutex::new(IndexSet::default()));

This comment has been minimized.

Copy link
@illicitonion

illicitonion Mar 20, 2019

Contributor

It seems strange having an IndexSet inside a Mutex; the Mutex suggests it will be written to from multiple threads, but the IndexSet suggests we care about ordering of writes.

What ordering are we aiming for? It feels like we should probably have each future return a Vec, and then merge them into an IndexSet one at a time root by root?

This comment has been minimized.

Copy link
@stuhood

stuhood Mar 20, 2019

Author Member

Good idea!

So: while it would be very functional to recursively merge vecs as output from these functions, it felt like that would likely result in a bunch of allocation that could potentially be more expensive than the lock acquisition. I expect that we'll do further refactoring in here now that it is simplified, so I left this using the Mutex accumulator, but switched it to collecting into a Vec, and sorting/deduping at the end (since I expect that dupes will be rare).

This comment has been minimized.

Copy link
@illicitonion

illicitonion Mar 20, 2019

Contributor

Sounds good :)

The other option here would be to accumulate in a Arc<Mutex<BTreeSet>> and then just result.into_iter().collect() - I have no real preference :)

/// NB: This implementation is independent from VFS::expand, and must be kept in sync via unit
/// tests. The lazy filesystem traversal in VFS::expand is (currently) too expensive to use for
/// in-memory matching (such as via MemFS).
///

This comment has been minimized.

Copy link
@illicitonion

illicitonion Mar 20, 2019

Contributor

It's not clear to me what needs to be kept in sync here, or what unit tests ensure it is. Can you flesh out this comment?

@illicitonion
Copy link
Contributor

left a comment

Thanks!

@ity

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

i did a brief read of this after you have already addressed comments - looking good to me!

@stuhood stuhood merged commit 8d64796 into pantsbuild:master Mar 20, 2019

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/glob-matching-perf branch Mar 20, 2019

stuhood added a commit that referenced this pull request Mar 20, 2019

Simplify glob matching and directly match in-memory globs as Patterns. (
#7402)

### Problem

#7299 aligned the behaviour of in-memory (eager) and on-disk (lazy) glob matching, but unfortunately caused a serious performance regression in detection of file owners.

The on-disk (lazy) implementation of glob matching from `GlobMatchingImplementation::expand` was imposing about a 30x overhead when matching globs in memory... and unfortunately, we do that more often than I had realized.

### Solution

In two independently reviewable commits:
1. Simplify `GlobMatchingImplementation::expand` based on the realization that the `PathGlob` caching that we had been doing was a holdout from when we used to memoize them in the graph, and which would never hit due to the unique trailing components of each `PathGlob`.
2. Introduce and use `PathGlobs::matches(Vec<PathBuf>) -> bool` to replace `MemFS::expand` with an eager operation. Expand `FilespecTest` to confirm that this redundant implementation does not go out of sync.

### Result

`GlobMatchingImplementation::expand` is ~3x faster (representing a 20% speedup for `BuildGraph` hydration in a large repository), and owners detection for a command like:
```
time ./pants --changed-diffspec='2c9c338cd7^..2c9c338' list
```
... runs at approximately the speed that it did before #7299.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.