New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for reusing Graph node values if their inputs haven't changed #6059

Merged
merged 8 commits into from Jul 6, 2018

Conversation

Projects
None yet
2 participants
@stuhood
Copy link
Member

stuhood commented Jul 2, 2018

Problem

As described in #4558, we currently completely delete Nodes from the Graph when their inputs have changed.

One concrete case where this is problematic is that all Snapshots in the graph end up with a dependency on the scandir outputs of all of their parent directories, because we need to expand symlinks recursively from the root when consuming a Path (in order to see whether any path component on the way down is a symlink). This means that changes anywhere above a Snapshot invalidate that Snapshot, and changes at the root of the repo invalidate all Snapshots (although 99% of the syscalls they depend on are not invalidated, having no dependencies of their own).

But this case is just one of many cases affected by the current implementation: there are many other times where we re-compute more than we should due to the current Node invalidation strategy.

Solution

Implement node "dirtying", as described on #4558.

There are a few components to this work:

  • In addition to being Entry::cleared (which will force a Node to re-run), a Node may be Entry::dirtyed. A "dirty" Node is eligible to be "cleaned" if its dependencies have not changed since it was dirtied.
  • Each Node records a Generation value that acts as proxy for "my output has changed". The Node locally maintains this value, and when a Node re-runs for any reason (either due to being dirtied or cleared), it compares its new output value to its old output value to determine whether to increment the Generation.
  • Each Node records the Generation values of the dependencies that it used to run, at the point when it runs. When a dirtied Node is deciding whether to re-run, it compares the previous generation values of its dependencies to their current dependency values: if they are equal, then the Node can be "cleaned": ie, its previous value can be used without re-running it.

This patch also expands the testing of Graph to differentiate dirtying a Node from clearing it, and confirms that the correct Nodes re-run in each of those cases.

Result

Cleaning all Nodes involved in ./pants list :: after touching pants.ini completes 6 times faster than recomputing them from scratch (56 seconds vs 336 seconds in our repository). More gains are likely possible by implementing the performance improvement(s) described on #6013.

Fixes #4558 and fixes #4394.

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from 1f2363c to 8e0e2b7 Jul 2, 2018

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from c1cafb1 to 1494abf Jul 3, 2018

@stuhood stuhood changed the title WIP: Add support for reusing Graph node values if their inputs haven't changed Add support for reusing Graph node values if their inputs haven't changed Jul 3, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 3, 2018

This depends on #6061 and #6013, so the reviewable portion starts after the "Remove inaccurate __eq__ implementations." commit.

stuhood added a commit that referenced this pull request Jul 3, 2018

Remove and prevent inaccurate __eq__ implementations on datatype (#6061)
### Problem

Overriding `__eq__` on `datatype` violates structural equality, and the assumptions that people have about `datatype` instances. Moreover, it is very error prone in the presence of #6059, which will be using `__eq__` to determine whether objects have changed.

### Solution

Make it impossible to override `__eq__` accidentally on `datatype` by attaching a canary to the `__eq__` method definition on the generated class. Remove a few `__eq__` implementations that violated structural equality and were thus causing issues in #6059.

### Result

`datatypes` behave as expected more frequently, and bugs are avoided. There is no noticeable impact on performance in my testing (likely these were overridden back when _every_ object ended up memoized, rather than just `Key` instances).
@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 3, 2018

Now just depends on #6013: the reviewable portion begins after "Add an invalidation unit test to replace some integration-level tests...". Individual commits should be (relatively) useful.

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from 1494abf to 23b4051 Jul 3, 2018

@illicitonion
Copy link
Contributor

illicitonion left a comment

Looks very nice!

/// generation is recorded on the consuming edge, and can later used to determine whether the
/// inputs to a node have changed.
///
/// Unlike the RunToken (which is incremented whenever a node re-runs), the Generation is only

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

I'd be tempted to name these RunGeneration and OutputGeneration or something similarly contrasting.

This comment has been minimized.

@stuhood

stuhood Jul 6, 2018

Member

They have very different uses, and aren't really related to one another at that level... so would rather keep the existing separation.

NotStarted {
run_token: RunToken,
generation: Generation,
previous_result: Option<Result<N::Item, N::Error>>,

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

This could be a shared-across-instances Arc<Mutex<Vec<Result<N::Item, N::Error>> to get more than n-1 re-use of generations, but unclear what usage patterns are actually going to be to know which is going to be more efficient. (If people flip up and back a lot, a Vec saves a lot of computation; if they tend to make linear changes, the Mutex and linear scan may be more costly)

(An Arc<Mutex<HashMap<Vec<Generation>, Result<N::Item, N::Error>>>> would come to mind, except for how we learnt about the minimum sizes of HashMaps)

This comment has been minimized.

@stuhood

stuhood Jul 6, 2018

Member

As discussed in Slack, I'd like to hold off on storing multiple values in this first edition. We can easily add it later if we're not seeing any unexpected memory pressure due to just holding one.

let run_token = run_token.next();
match entry_key {
&EntryKey::Valid(ref n) => {
let context2 = context.clone_for(entry_id);

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

We've generally been using context2 to mean "clone of context which I was required to make because of annoying move semantics"; possibly call this context_for_entry_id?

This comment has been minimized.

@stuhood

stuhood Jul 6, 2018

Member

Fixed.

// NB: The unwrap here avoids a clone and a comparison: see the method docs.
(
generation,
previous_result.unwrap_or_else(|| {

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

previous_result.expect("A node cannot be marked clean without a previous result.")?

@@ -152,7 +160,7 @@ impl fmt::Debug for Value {
}
}

#[derive(Debug, Clone)]
#[derive(Clone, Debug, Eq, PartialEq)]

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

I think this can also be Copy?

This comment has been minimized.

@stuhood

stuhood Jul 3, 2018

Member

Throw contains a Value with a python exception in it, so it's good to be explicit about copies.

@@ -487,6 +508,7 @@ impl<N: Node> InnerGraph<N> {
self.nodes.get(node)
}

// TODO: Now that we never delete Entries, we should consider making this infalliable.

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

We're probably going to want to start deleting entries again at some point for memory reasons, just in a more principled/less aggressive way

run_token: RunToken,
generation: Generation,
start_time: Instant,
waiters: Vec<oneshot::Sender<Result<(N::Item, Generation), N::Error>>>,

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

Does this signature suggest that Errors are always a new generation, even if they're the same? Possibly this should be a oneshot::Sender<(Result<N::Item, N::Error>, Generation)> so that identical errors don't trigger downstream invalidations?

future::ok(()).to_boxed()
} else {
// The Node needs to (re-)run!
let context = context2.clone();

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

Definitely don't re-use the name context here, as it's actually different from context in the enclosing scope

This comment has been minimized.

@stuhood

stuhood Jul 6, 2018

Member

Fixed.

.map(move |res| (res, generation))
.to_boxed();
}
_ => {}

This comment has been minimized.

@illicitonion

illicitonion Jul 3, 2018

Contributor

Comment in here that we're falling through to the second match?

enum EntryState<N: Node> {
NotStarted(RunToken),
// A node that has either been explicitly cleared, or has not yet started Running for the first
// time. In this state there is no need for a dirty bit because the generation is either in its

This comment has been minimized.

@stuhood

stuhood Jul 3, 2018

Member

Note to self: the Generation is not incremented when a node is cleared: only the RunToken.

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from 23b4051 to 73d017c Jul 3, 2018

stuhood added some commits Jun 24, 2018

Record `Generation` values per Node, and increment them by comparing …
…a Node's result to a stored `previous_result`.
Record dependency Generations when we complete a Node, for comparison…
… after we dirty a Node.

TODO: Storing these in a Vec on the state is potentially less efficient
than storing them on the edges in the graph, since that avoids a bunch of
extra allocation.
Make invalidation test more useful by having the context record which…
… Nodes ran, and having Nodes include a Context.id in their output.

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from 73d017c to c42e376 Jul 3, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 3, 2018

This no longer has dependencies, and is definitely reviewable. Thanks!

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 3, 2018

Note to self: two bugs to fix here.

  1. Files that do not exist do not always seem to be invalidated correctly (this is causing the wiki page target failure in travis)
  2. The invalidate_randomly test can occasionally experience nodes being concurrently invalidated. Likely this calls for the same kind of retry that we use in Scheduler::execute.
Add an `invalidate_all` method to invalidate all filesystem nodes. Ea…
…gerly prune edges for cleared entries, and lazily prune edges for dirtied entries.

@stuhood stuhood force-pushed the twitter:stuhood/dirty-nodes branch from f2f0a76 to 9d090ac Jul 6, 2018

stuhood added some commits Jul 6, 2018

Retry `create` while concurrent invalidations are occurring. The Sche…
…duler already accounts for this, and it seems reasonable to not bake retry into the Graph.

@stuhood stuhood merged commit 14a3f94 into pantsbuild:master Jul 6, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/dirty-nodes branch Jul 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment