Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean the graph speculatively, and cancel nodes when interest is lost #11308

Merged
merged 4 commits into from
Dec 15, 2020

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Dec 15, 2020

Problem

#11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources.

Solution

As mentioned on #11290, other monadic build systems have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an @rule MultiGet all the way down into a single request to the Graph: without custom tuple-based combinators, we would lose type safety when we needed to request a Vec<Node<_>> rather than future::join_alling a series of type-safe requests.

Instead, this change takes a different strategy, and speculates while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: --process-execution-local-parallelism). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303).

In order to cancel work in the Graph based on interest, we add an AsyncValue type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252.

Result

Fixes #11290, and should improve the latency of all types of incremental rebuilds.

…when all waiters go away.

[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
@stuhood
Copy link
Sponsor Member Author

stuhood commented Dec 15, 2020

The commits are useful to review independently. The first two commits deal with AsyncValue, which adds support for "interest based" cancelation: the latter two switch to speculatively cleaning nodes.

/// ties together the generation of a node with specifics of edges would require careful
/// consideration of locking (probably it would require merging the EntryState locks and Graph
/// locks, or working out something clever).
///
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment had been stale for a while: we do clean up stale edges (see clear_edges).

TContext::new(graph.clone()).with_delays(delays)
};

// Start a run, but cancel it well before the delayed middle node can complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of relying on on the timing to allow this test to run successfully, why not introduce a node that is the Node Graph equivalent of futures::future::pending (i.e., a Future that never resolves)? Then the node that is delaying for two seconds would just be a node that never completes.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly... the harness doesn't have infrastructure for that right now. A downside of using tasks that wait forever though is that when those tests fail they deadlock rather than failing with an error. So even when I do want to test that something "has not happened", I like to use a timeout to avoid hanging tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then just use this PR to have an annotation you can add to any test to have a timeout on it: tokio-rs/tokio#1823

The Tokio maintainers declined to accept my PR, but it might be useful for this use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree with you, my comment is moot if the graph testing harness does not support indefinitely pending nodes.

]
.into_iter()
.collect::<HashMap<_, _>>();
let delay = Duration::from_millis(2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment about using a Node that never completes.

@stuhood stuhood merged commit 0bcbfe9 into pantsbuild:master Dec 15, 2020
@stuhood stuhood deleted the stuhood/cancel-cleaning branch December 15, 2020 20:38
///
/// A cancellable value computed by one sender, and broadcast to multiple receivers.
///
/// Supports canceling the work associated with the pool either:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the pool" makes an appearance twice here and no-where else. Can the comment terminology be brought more in line with the code nouns and verbs in this same file?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot. Yea: I struggled with naming this struct, and that was one of the previous names.

pub struct AsyncValue<T: Clone + Send + Sync + 'static> {
item_receiver: Weak<watch::Receiver<Option<T>>>,
// NB: Stored only for drop.
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does renaming to _abort_sender work instead of having to use an attribute?

generation,
previous_result,
}
| EntryState::Running {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding: This is the fallthrough mentioned above - Running but cancelled (the pending value was dropped); thus, spawning a new execution is in fact the desired result.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct!

pub fn is_running(&self) -> bool {
match *self.state.lock() {
EntryState::Running { .. } => true,
EntryState::Completed { .. } | EntryState::NotStarted { .. } => false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just _. You shouldn't be forced to edit this arm when you add a new non-Running state, or even just use matches!(...) to leave off the arm altogether.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a small number of states, I've found that it is helpful to force thinking about their relevance to any particular match. It's unlikely but if we were to introduce a RunningUnderLock or something, we'd want to consider that here.

if dependency_ids.len() != previous_dep_generations.len() {
// If we don't have the same number of current dependencies as there were generations
// previously, then they cannot match.
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a bit jarring reading this early return. Is there any reason not to lift the dependency_ids collection and length check with early return out of the generation_matches binding block?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly that we'd need to re-acquire the lock to get access to the Entrys again after the check.

stuhood added a commit to stuhood/pants that referenced this pull request Dec 15, 2020
…pantsbuild#11308)

As mentioned on pantsbuild#11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests.

Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see pantsbuild#11303).

In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for pantsbuild#11252.

Fixes pantsbuild#11290, and should improve the latency of all types of incremental rebuilds.
stuhood added a commit to stuhood/pants that referenced this pull request Dec 15, 2020
…pantsbuild#11308)

As mentioned on pantsbuild#11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests.

Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see pantsbuild#11303).

In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for pantsbuild#11252.

Fixes pantsbuild#11290, and should improve the latency of all types of incremental rebuilds.
stuhood added a commit that referenced this pull request Dec 15, 2020
… (cherrypick of #11308) (#11311)

### Problem

#11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources.

### Solution

As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests.

Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303).

In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252.

### Result

Fixes #11290, and should improve the latency of all types of incremental rebuilds.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Dec 15, 2020
… (cherrypick of #11308) (#11310)

### Problem

#11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources.

### Solution

As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests.

Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303).

In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252.

### Result

Fixes #11290, and should improve the latency of all types of incremental rebuilds.

[ci skip-build-wheels]
stuhood added a commit that referenced this pull request Dec 16, 2020
### Problem

As described in #11252, we don't currently cancel work running in the `Graph` when `Scheduler::execute` observes `Session::cancelled` (which is set when a client goes/is-going away due to `Ctrl+C`).

### Solution

With the new interest-based cancellation from #11308, cancelling work on the `Graph` only requires the caller to "go away". So removing the call to `spawn` in `Scheduler::execute` causes `Graph` tasks to be canceled when the loop exits.

### Result

Killing the client of a `pantsd` run with `Ctrl+C` (or in any other way) causes the serverside work to be canceled. Fixes #11252.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Work unit timer resets to zero in dynamic UI
3 participants