Clean the graph speculatively, and cancel nodes when interest is lost #11308

stuhood · 2020-12-15T04:46:36Z

Problem

#11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources.

Solution

As mentioned on #11290, other monadic build systems have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an @rule MultiGet all the way down into a single request to the Graph: without custom tuple-based combinators, we would lose type safety when we needed to request a Vec<Node<_>> rather than future::join_alling a series of type-safe requests.

Instead, this change takes a different strategy, and speculates while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: --process-execution-local-parallelism). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303).

In order to cancel work in the Graph based on interest, we add an AsyncValue type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252.

Result

Fixes #11290, and should improve the latency of all types of incremental rebuilds.

…odes in the Graph. [ci skip-build-wheels]

…when all waiters go away. [ci skip-build-wheels]

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

…ncels the rest. [ci skip-build-wheels]

stuhood · 2020-12-15T04:47:40Z

The commits are useful to review independently. The first two commits deal with AsyncValue, which adds support for "interest based" cancelation: the latter two switch to speculatively cleaning nodes.

stuhood · 2020-12-15T05:03:29Z

src/rust/engine/graph/src/lib.rs

-  /// ties together the generation of a node with specifics of edges would require careful
-  /// consideration of locking (probably it would require merging the EntryState locks and Graph
-  /// locks, or working out something clever).
-  ///


This comment had been stale for a while: we do clean up stale edges (see clear_edges).

tdyas · 2020-12-15T05:46:52Z

src/rust/engine/graph/src/tests.rs

+    TContext::new(graph.clone()).with_delays(delays)
+  };
+
+  // Start a run, but cancel it well before the delayed middle node can complete.


Instead of relying on on the timing to allow this test to run successfully, why not introduce a node that is the Node Graph equivalent of futures::future::pending (i.e., a Future that never resolves)? Then the node that is delaying for two seconds would just be a node that never completes.

Possibly... the harness doesn't have infrastructure for that right now. A downside of using tasks that wait forever though is that when those tests fail they deadlock rather than failing with an error. So even when I do want to test that something "has not happened", I like to use a timeout to avoid hanging tests.

Then just use this PR to have an annotation you can add to any test to have a timeout on it: tokio-rs/tokio#1823

The Tokio maintainers declined to accept my PR, but it might be useful for this use case.

But I agree with you, my comment is moot if the graph testing harness does not support indefinitely pending nodes.

tdyas · 2020-12-15T05:49:30Z

src/rust/engine/graph/src/tests.rs

+  ]
+  .into_iter()
+  .collect::<HashMap<_, _>>();
+  let delay = Duration::from_millis(2000);


Similar comment about using a Node that never completes.

jsirois · 2020-12-15T18:59:21Z

src/rust/engine/async_value/src/lib.rs

+///
+/// A cancellable value computed by one sender, and broadcast to multiple receivers.
+///
+/// Supports canceling the work associated with the pool either:


"the pool" makes an appearance twice here and no-where else. Can the comment terminology be brought more in line with the code nouns and verbs in this same file?

Shoot. Yea: I struggled with naming this struct, and that was one of the previous names.

jsirois · 2020-12-15T19:02:02Z

src/rust/engine/async_value/src/lib.rs

+pub struct AsyncValue<T: Clone + Send + Sync + 'static> {
+  item_receiver: Weak<watch::Receiver<Option<T>>>,
+  // NB: Stored only for drop.
+  #[allow(dead_code)]


Does renaming to _abort_sender work instead of having to use an attribute?

jsirois · 2020-12-15T20:24:05Z

src/rust/engine/graph/src/entry.rs

+        generation,
+        previous_result,
+      }
+      | EntryState::Running {


Checking my understanding: This is the fallthrough mentioned above - Running but cancelled (the pending value was dropped); thus, spawning a new execution is in fact the desired result.

jsirois · 2020-12-15T20:30:32Z

src/rust/engine/graph/src/entry.rs

+  pub fn is_running(&self) -> bool {
+    match *self.state.lock() {
+      EntryState::Running { .. } => true,
+      EntryState::Completed { .. } | EntryState::NotStarted { .. } => false,


Maybe just _. You shouldn't be forced to edit this arm when you add a new non-Running state, or even just use matches!(...) to leave off the arm altogether.

For a small number of states, I've found that it is helpful to force thinking about their relevance to any particular match. It's unlikely but if we were to introduce a RunningUnderLock or something, we'd want to consider that here.

jsirois · 2020-12-15T20:40:29Z

src/rust/engine/graph/src/lib.rs

+      if dependency_ids.len() != previous_dep_generations.len() {
+        // If we don't have the same number of current dependencies as there were generations
+        // previously, then they cannot match.
+        return true;


It was a bit jarring reading this early return. Is there any reason not to lift the dependency_ids collection and length check with early return out of the generation_matches binding block?

Mostly that we'd need to re-acquire the lock to get access to the Entrys again after the check.

…pantsbuild#11308) As mentioned on pantsbuild#11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see pantsbuild#11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for pantsbuild#11252. Fixes pantsbuild#11290, and should improve the latency of all types of incremental rebuilds.

… (cherrypick of #11308) (#11311) ### Problem #11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources. ### Solution As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252. ### Result Fixes #11290, and should improve the latency of all types of incremental rebuilds. [ci skip-build-wheels]

… (cherrypick of #11308) (#11310) ### Problem #11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources. ### Solution As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252. ### Result Fixes #11290, and should improve the latency of all types of incremental rebuilds. [ci skip-build-wheels]

### Problem As described in #11252, we don't currently cancel work running in the `Graph` when `Scheduler::execute` observes `Session::cancelled` (which is set when a client goes/is-going away due to `Ctrl+C`). ### Solution With the new interest-based cancellation from #11308, cancelling work on the `Graph` only requires the caller to "go away". So removing the call to `spawn` in `Scheduler::execute` causes `Graph` tasks to be canceled when the loop exits. ### Result Killing the client of a `pantsd` run with `Ctrl+C` (or in any other way) causes the serverside work to be canceled. Fixes #11252. [ci skip-build-wheels]

stuhood added 4 commits December 14, 2020 16:41

Add an AsyncValue type to encapsulate the interaction we expect for n…

8f4a630

…odes in the Graph. [ci skip-build-wheels]

Use AsyncValue as the storage for a Running Entry in order to cancel …

b9d00a6

…when all waiters go away. [ci skip-build-wheels]

Add support for multiple dependencies to the graph test harness.

5c58c7e

# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

Speculate while cleaning nodes so that the first detected mismatch ca…

ea9908c

…ncels the rest. [ci skip-build-wheels]

stuhood requested review from jsirois, tdyas and gshuflin December 15, 2020 04:46

stuhood added the needs-cherrypick label Dec 15, 2020

stuhood commented Dec 15, 2020

View reviewed changes

tdyas reviewed Dec 15, 2020

View reviewed changes

tdyas approved these changes Dec 15, 2020

View reviewed changes

stuhood merged commit 0bcbfe9 into pantsbuild:master Dec 15, 2020

stuhood deleted the stuhood/cancel-cleaning branch December 15, 2020 20:38

jsirois approved these changes Dec 15, 2020

View reviewed changes

stuhood mentioned this pull request Dec 16, 2020

Interrupt Graph tasks on Ctrl+C #11322

Merged

stuhood removed the needs-cherrypick label Dec 17, 2020

stuhood mentioned this pull request Dec 28, 2020

Report "cancelled" workunits to the StreamingWorkunitHandler #11391

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean the graph speculatively, and cancel nodes when interest is lost #11308

Clean the graph speculatively, and cancel nodes when interest is lost #11308

stuhood commented Dec 15, 2020 •

edited

Loading

stuhood commented Dec 15, 2020

stuhood Dec 15, 2020

tdyas Dec 15, 2020

stuhood Dec 15, 2020

tdyas Dec 15, 2020

tdyas Dec 15, 2020

tdyas Dec 15, 2020

jsirois Dec 15, 2020

stuhood Dec 15, 2020

jsirois Dec 15, 2020

jsirois Dec 15, 2020

stuhood Dec 15, 2020

jsirois Dec 15, 2020

stuhood Dec 15, 2020

jsirois Dec 15, 2020

stuhood Dec 15, 2020

Clean the graph speculatively, and cancel nodes when interest is lost #11308

Clean the graph speculatively, and cancel nodes when interest is lost #11308

Conversation

stuhood commented Dec 15, 2020 • edited Loading

Problem

Solution

Result

stuhood commented Dec 15, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood commented Dec 15, 2020 •

edited

Loading