New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move file owners computation into the engine and make lighter #6790

Merged
merged 3 commits into from Nov 20, 2018

Conversation

Projects
None yet
3 participants
@stuhood
Member

stuhood commented Nov 18, 2018

Problem

#5579 added an optimization to avoid hydrating targets, of which a significant portion was later rolled back by #5636. #5636 reasoned that it was necessary to fully construct the graph in order to detect when targets have been deleted.

The solution used in #5636 though, was heavy-handed. By requesting TransitiveHydratedTargets, we fully constructed the graph of dependencies and then later also constructed the dependent graph (in DependentGraph). Moreover, fully hydrating the targets meant unnecessarily fully expanding their source globs.

Solution

Create a find_owners rule that consumes an OwnersRequest to compute the owning targets of some sources (and, optionally, their transitive dependents). Moving this logic into an @rule allows us to directly consume the symbol_table_contraint, without a wrapping collection (as in #5579).

Additionally, switch back to requesting target adaptor subclasses while building the DependentGraph. The graph-validation that use to occur via requesting TransitiveHydratedTargets now occurs as an explicit step during construction of a DependentGraph. A TODO there explains an existing weakness of change-detection in the engine.

Result

./pants --changed-diffspec='49cc12e5cccc6164a2c4aa83c23300ad7254836f^..49cc12e5cccc6164a2c4aa83c23300ad7254836f' --changed-include-dependees=direct --glob-expansion-failure=ignore list

is 30 percent faster in the pantsbuild pants repo, primarily due to not expanding source globs.

@stuhood stuhood requested review from kwlzn, jsirois and illicitonion Nov 19, 2018

@illicitonion

Generally looks great, but a few things worth a little discussion - thanks for putting this together

def dependents_of_addresses(self, addresses):
"""Given an iterable of addresses, yield all of those addresses dependents."""
seen = set(addresses)

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

This feels like the kind of thing that should probably have a consistent order, for downstream caching purposes? Maybe an OrderedSet?

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

Rather than ordering these, I think I'll go ahead and just sort the output for consumers.

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

Nevermind... will use an OrderedSet.

def transitive_dependents_of_addresses(self, addresses):
"""Given an iterable of addresses, yield all of those addresses dependents, transitively."""
addresses_to_visit = set(addresses)

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Possibly worth separating this into addresses_to_visited and visited_addresses so that we don't keep re-visiting already-visited addresses. Possibly not.

addresses = dependents.difference(addresses_to_visit)
addresses_to_visit.update(dependents)
return self.dependents_of_addresses(addresses_to_visit)

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

This last call should always no-op, right? Skip it and just return addresses_to_visit?

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

Going to leave this and the previous comment as is, because I only lightly touched this logic while moving it, and would rather not touch it more aggressively here.

def transitive_dependents_of_addresses(self, addresses):
"""Given an iterable of addresses, yield all of those addresses dependents, transitively."""
addresses_to_visit = set(addresses)

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Again, ordered?

# NB: Deleted files can only be matched against the 'filespec' (ie, `PathGlobs`) for a target,
# so we don't actually call `fileset.matches` here.
target_sources = target_kwargs.get('sources', None)
if target_sources and any_matches_filespec(sources_set, target_sources.filespec):

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

I don't love that this means we have two implementations of glob matching (one in Python, one in Rust) which do disagree for some excludes... (Python does pure regex, Rust does git-style ignore).

At a brief inspection, owners finding and the interpreter cache are the only things which use the python way - can we remove the python implementation (or only use it for the interpreter cache), and somehow delegate to the rust globs here? It shouldn't be a crazy thing to expose through the ffi...

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

This was an existing organization: I'll add a TODO here.

def find_owners(symbol_table, address_mapper, owners_request):
sources_set = set(owners_request.sources)
dirs_set = {dirname(source) for source in sources_set}

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Ordered (and the line above)

@@ -296,6 +389,58 @@ class HydratedTargets(Collection.of(HydratedTarget)):
"""An intransitive set of HydratedTarget objects."""
class OwnersRequest(datatype([
('sources', tuple),
('include_dependees', str),

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Can we make this an enum-like thing, or override its __new__ to only allow three values (none, direct, transitive or something)?

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

I just looked into doing this, and while it's definitely easier to do now, given that these values come from a global option, it would currently cause a fair amount of churn to enumify it without making things redundant. Leaving for now.

# If the OwnersRequest does not require dependees, then we're done: otherwise, find dependees.
if owners_request.include_dependees not in ('direct', 'transitive'):
# NB: A yield without a LHS represents an early return.

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Because this is non-obvious, I'd prefer the rest of the function to be in an else block (which is very against my ordinary code style - boo Python!)

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

An alternative is:

yield last
return  # Never reached

But I think I agree with @illicitonion here with the added benefit of the assert owners_request.include_dependees == 'transitive' probably being un-needed at that point.

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

It's not possible to mix yield and return... I'll nest the remainder of the body.

This comment has been minimized.

@jsirois

jsirois Nov 20, 2018

Member

It is when return returns nothing, it ends the generator. Right now the generator is only early return because the controlling code in nodes.rs stops its loop when it receives its 1st non Get*. It could continue to advance the generator in this code and the python generator machinery would not stop early but continue on to the Gets.

For example:

>>> def coroutine(exit_early):
...   if exit_early:
...     yield 42
...   yield 37
...   x = yield
...   y = yield
...   yield x * y
...   yield 1/137
... 
>>> cr_early = coroutine(True)
>>> next(cr_early)
42
>>> next(cr_early)
37
>>> next(cr_early)
>>> cr_early.send(3)
>>> cr_early.send(5)
15
>>> next(cr_early)
0
>>> next(cr_early)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Versus an early return forced by the controlled code:

>>> def coroutine(exit_early):
...   if exit_early:
...     yield 42
...     return
...   yield 37
...   x = yield
...   y = yield
...   yield x * y
...   yield 1/137
... 
>>> cr_early = coroutine(True)
>>> next(cr_early)
42
>>> next(cr_early)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Versus the syntax error you're thinking of:

>>> def coroutine(exit_early):
...   if exit_early:
...     yield 42
...     return 3
...   yield 37
...   x = yield
...   y = yield
...   yield x * y
...   yield 1/137
... 
  File "<stdin>", line 4
SyntaxError: 'return' with argument inside generator

This comment has been minimized.

@stuhood

stuhood Nov 20, 2018

Member

Ahhh. You are completely correct. Sorry.

@@ -417,6 +562,9 @@ def create_legacy_graph_tasks(symbol_table):
"""Create tasks to recursively parse the legacy graph."""
symbol_table_constraint = symbol_table.constraint()
partial_find_owners = functools.partial(find_owners, symbol_table)
partial_find_owners.__name__ = find_owners.__name__

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

Maybe extract a helper for this?

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

Done already by stdlib: functools.update_wrapper(functools.partial(find_owners, symbol_table), find_owners).

@@ -430,6 +578,17 @@ def create_legacy_graph_tasks(symbol_table):
Get(HydratedField, BundlesField),
]
),
TaskRule(

This comment has been minimized.

@illicitonion

illicitonion Nov 19, 2018

Contributor

When will this TaskRule thing go away? It feels like it's here because find_owners isn't decorated with @rule? Is that because params aren't properly injectable or something?

I don't like how much of this duplicates information from the body of the function itself...

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

Is that because params aren't properly injectable or something?

Correct. This is here because of symbol_table_constraint, which is the same underlying problem as #4535. Need a sealed universe of types that are legal in a particular position: in this case, we manually build up and pass around the set of types that can legally be parsed from a BUILD file. I'll add a comment on #4535 to that effect.

@@ -417,6 +562,9 @@ def create_legacy_graph_tasks(symbol_table):
"""Create tasks to recursively parse the legacy graph."""
symbol_table_constraint = symbol_table.constraint()
partial_find_owners = functools.partial(find_owners, symbol_table)
partial_find_owners.__name__ = find_owners.__name__

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

Done already by stdlib: functools.update_wrapper(functools.partial(find_owners, symbol_table), find_owners).

@classmethod
def changed_files(cls, scm, changes_since=None, diffspec=None):
"""Determines the files changed according to SCM/workspace and options."""
workspace =ScmWorkspace(scm)

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

s/=ScmWorkspace/= ScmWorkspace/

def changed_files(cls, scm, changes_since=None, diffspec=None):
"""Determines the files changed according to SCM/workspace and options."""
workspace =ScmWorkspace(scm)
diffspec = diffspec or diffspec

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

No-op - kill.

changed_addresses = change_calculator.changed_target_addresses(changed_request)
request = OwnersRequest(sources=tuple(changed_files),
include_dependees=str(changed_request.include_dependees))
changed_addresses, = session.product_request(BuildFileAddresses, [request])

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

Not for this change, but it seems like a session.product_request_single should exist to avoid the awkwardness of the more explicit changed_addresses = ...; assert len(changed_addresses) == 1; changed_address = changed_addresses[0] and the visual subtlety of a trailing comma singleton tuple assert/extract.

def transitive_dependents_of_addresses(self, addresses):
"""Given an iterable of addresses, yield all of those addresses dependents, transitively."""
addresses_to_visit = set(addresses)
while 1:

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

Its hard to swallow 1 vs True when, for example, self.dependents_of_addresses isn't hoisted into a local var to avoid __dict__ lookup on each call. Its a partial optimization making things harder to read and the likely bigger optimization is skipped.

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

This is an artifact of the previous implementation, but agreed. Will fix.

)
)
def _inject_target(self, target_adaptor):

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

It would be nice to hoist _inject_target and _validate into classmethods / staticmethods that prepared the dicts to keep the factory / instance line sharp.

This comment has been minimized.

@stuhood

stuhood Nov 19, 2018

Member

Agreed, but this was the previous impl... going to leave for now.

sources_set = set(owners_request.sources)
dirs_set = {dirname(source) for source in sources_set}
# Walk up the buildroot looking for targets that would conceivably claim changed sources.
candidate_specs = tuple(AscendantAddresses(directory=d) for d in dirs_set)

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

This is not used by owns_any_source. How about moving it together with it's one use to yield hydrated_targets. IE: Move owns_any_source down in-between hydrated_targets = ... and direct_owners = ....

# If the OwnersRequest does not require dependees, then we're done: otherwise, find dependees.
if owners_request.include_dependees not in ('direct', 'transitive'):
# NB: A yield without a LHS represents an early return.

This comment has been minimized.

@jsirois

jsirois Nov 19, 2018

Member

An alternative is:

yield last
return  # Never reached

But I think I agree with @illicitonion here with the added benefit of the assert owners_request.include_dependees == 'transitive' probably being un-needed at that point.

@stuhood

This comment has been minimized.

Member

stuhood commented Nov 19, 2018

Hey gang, thanks for the review! I failed to preserve it via the commits, but one thing to note is that about 80% of this is moved code (DependentGraph moved between files, and the body of find_owners came from SourceMapper), which is the reason for some of the idioms. I changed some things, but didn't want to change all of it.

There is a lot of good feedback, but to reduce churn and lower risk, I'd prefer to avoid doing a total overhaul here.

@illicitonion

This comment has been minimized.

Contributor

illicitonion commented Nov 19, 2018

I think the sortedness is an important thing to address before merging, because I don't like the idea of nodes in the engine being non-deterministic.

I'd also like to remove the python filespec implementation pretty eagerly (if not before merging, in a very fast follow), again because of functional inconsistency...

The others, happy to hold off on :)

@stuhood

This comment has been minimized.

Member

stuhood commented Nov 19, 2018

Have addressed all of the ordering issues, and left some of the other comments unaddressed. Based on @illicitonion's comments though, I'm going to go ahead and merge this on green.

@stuhood

This comment has been minimized.

Member

stuhood commented Nov 19, 2018

I'd also like to remove the python filespec implementation pretty eagerly (if not before merging, in a very fast follow), again because of functional inconsistency...

Agreed, but this split has existed for a long time. Opened #6795

@illicitonion

Thanks!

@stuhood stuhood merged commit 13a1ce8 into pantsbuild:master Nov 20, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/find-owners-rule branch Nov 20, 2018

@stuhood stuhood added this to the 1.11.x milestone Nov 20, 2018

stuhood added a commit that referenced this pull request Nov 20, 2018

Move file owners computation into the engine and make lighter (#6790)
The solution used in #5636 though, was heavy-handed. By requesting `TransitiveHydratedTargets`, we fully constructed the graph of dependencies and then later _also_ constructed the dependent graph (in `DependentGraph`). Moreover, fully hydrating the targets meant unnecessarily fully expanding their source globs.

Create a `find_owners` rule that consumes an `OwnersRequest` to compute the owning targets of some sources (and, optionally, their transitive dependents). Moving this logic into an `@rule` allows us to directly consume the symbol_table_contraint, without a wrapping collection (as in #5579).

Additionally, switch back to requesting target adaptor subclasses while building the `DependentGraph`. The graph-validation that use to occur via requesting `TransitiveHydratedTargets` now occurs as an explicit step during construction of a `DependentGraph`. A TODO there explains an existing weakness of change-detection in the engine.

```
./pants --changed-diffspec='49cc12e5cccc6164a2c4aa83c23300ad7254836f^..49cc12e' --changed-include-dependees=direct --glob-expansion-failure=ignore list
```
is 30 percent faster in the pantsbuild pants repo, primarily due to not expanding source globs.

wisechengyi added a commit that referenced this pull request Nov 20, 2018

Move file owners computation into the engine and make lighter (#6790)
### Problem

#5579 added an optimization to avoid hydrating targets, of which a significant portion was later rolled back by #5636. #5636 reasoned that it was necessary to fully construct the graph in order to detect when targets have been deleted.

The solution used in #5636 though, was heavy-handed. By requesting `TransitiveHydratedTargets`, we fully constructed the graph of dependencies and then later _also_ constructed the dependent graph (in `DependentGraph`). Moreover, fully hydrating the targets meant unnecessarily fully expanding their source globs.

### Solution

Create a `find_owners` rule that consumes an `OwnersRequest` to compute the owning targets of some sources (and, optionally, their transitive dependents). Moving this logic into an `@rule` allows us to directly consume the symbol_table_contraint, without a wrapping collection (as in #5579).

Additionally, switch back to requesting target adaptor subclasses while building the `DependentGraph`. The graph-validation that use to occur via requesting `TransitiveHydratedTargets` now occurs as an explicit step during construction of a `DependentGraph`. A TODO there explains an existing weakness of change-detection in the engine.

### Result

```
./pants --changed-diffspec='49cc12e5cccc6164a2c4aa83c23300ad7254836f^..49cc12e' --changed-include-dependees=direct --glob-expansion-failure=ignore list
```
is 30 percent faster in the pantsbuild pants repo, primarily due to not expanding source globs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment