-
-
Notifications
You must be signed in to change notification settings - Fork 627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean the graph speculatively, and cancel nodes when interest is lost #11308
Conversation
…odes in the Graph. [ci skip-build-wheels]
…when all waiters go away. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
…ncels the rest. [ci skip-build-wheels]
The commits are useful to review independently. The first two commits deal with |
/// ties together the generation of a node with specifics of edges would require careful | ||
/// consideration of locking (probably it would require merging the EntryState locks and Graph | ||
/// locks, or working out something clever). | ||
/// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment had been stale for a while: we do clean up stale edges (see clear_edges
).
TContext::new(graph.clone()).with_delays(delays) | ||
}; | ||
|
||
// Start a run, but cancel it well before the delayed middle node can complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on on the timing to allow this test to run successfully, why not introduce a node that is the Node Graph equivalent of futures::future::pending
(i.e., a Future that never resolves)? Then the node that is delaying for two seconds would just be a node that never completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly... the harness doesn't have infrastructure for that right now. A downside of using tasks that wait forever though is that when those tests fail they deadlock rather than failing with an error. So even when I do want to test that something "has not happened", I like to use a timeout to avoid hanging tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then just use this PR to have an annotation you can add to any test to have a timeout on it: tokio-rs/tokio#1823
The Tokio maintainers declined to accept my PR, but it might be useful for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree with you, my comment is moot if the graph testing harness does not support indefinitely pending nodes.
] | ||
.into_iter() | ||
.collect::<HashMap<_, _>>(); | ||
let delay = Duration::from_millis(2000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment about using a Node that never completes.
/// | ||
/// A cancellable value computed by one sender, and broadcast to multiple receivers. | ||
/// | ||
/// Supports canceling the work associated with the pool either: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the pool" makes an appearance twice here and no-where else. Can the comment terminology be brought more in line with the code nouns and verbs in this same file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot. Yea: I struggled with naming this struct, and that was one of the previous names.
pub struct AsyncValue<T: Clone + Send + Sync + 'static> { | ||
item_receiver: Weak<watch::Receiver<Option<T>>>, | ||
// NB: Stored only for drop. | ||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does renaming to _abort_sender
work instead of having to use an attribute?
generation, | ||
previous_result, | ||
} | ||
| EntryState::Running { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking my understanding: This is the fallthrough mentioned above - Running but cancelled (the pending value was dropped); thus, spawning a new execution is in fact the desired result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct!
pub fn is_running(&self) -> bool { | ||
match *self.state.lock() { | ||
EntryState::Running { .. } => true, | ||
EntryState::Completed { .. } | EntryState::NotStarted { .. } => false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just _
. You shouldn't be forced to edit this arm when you add a new non-Running state, or even just use matches!(...)
to leave off the arm altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a small number of states, I've found that it is helpful to force thinking about their relevance to any particular match. It's unlikely but if we were to introduce a RunningUnderLock
or something, we'd want to consider that here.
if dependency_ids.len() != previous_dep_generations.len() { | ||
// If we don't have the same number of current dependencies as there were generations | ||
// previously, then they cannot match. | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a bit jarring reading this early return. Is there any reason not to lift the dependency_ids
collection and length check with early return out of the generation_matches binding block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly that we'd need to re-acquire the lock to get access to the Entry
s again after the check.
…pantsbuild#11308) As mentioned on pantsbuild#11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see pantsbuild#11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for pantsbuild#11252. Fixes pantsbuild#11290, and should improve the latency of all types of incremental rebuilds.
…pantsbuild#11308) As mentioned on pantsbuild#11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see pantsbuild#11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for pantsbuild#11252. Fixes pantsbuild#11290, and should improve the latency of all types of incremental rebuilds.
… (cherrypick of #11308) (#11311) ### Problem #11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources. ### Solution As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252. ### Result Fixes #11290, and should improve the latency of all types of incremental rebuilds. [ci skip-build-wheels]
… (cherrypick of #11308) (#11310) ### Problem #11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources. ### Solution As mentioned on #11290, [other monadic build systems](https://neilmitchell.blogspot.com/2020/11/data-types-for-build-system-dependencies.html) have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an `@rule` `MultiGet` all the way down into a single request to the `Graph`: without custom tuple-based combinators, we would lose type safety when we needed to request a `Vec<Node<_>>` rather than `future::join_all`ing a series of type-safe requests. Instead, this change takes a different strategy, and [speculates](https://en.wikipedia.org/wiki/Speculative_execution) while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes: `--process-execution-local-parallelism`). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303). In order to cancel work in the `Graph` based on interest, we add an `AsyncValue` type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252. ### Result Fixes #11290, and should improve the latency of all types of incremental rebuilds. [ci skip-build-wheels]
### Problem As described in #11252, we don't currently cancel work running in the `Graph` when `Scheduler::execute` observes `Session::cancelled` (which is set when a client goes/is-going away due to `Ctrl+C`). ### Solution With the new interest-based cancellation from #11308, cancelling work on the `Graph` only requires the caller to "go away". So removing the call to `spawn` in `Scheduler::execute` causes `Graph` tasks to be canceled when the loop exits. ### Result Killing the client of a `pantsd` run with `Ctrl+C` (or in any other way) causes the serverside work to be canceled. Fixes #11252. [ci skip-build-wheels]
Problem
#11290 describes a case where attempting to clean a (relatively) long-running uncacheable node (a test that has failed) causes us to run the test twice. This is caused by graph cleaning: the attempt to clean the graph requests all of the node's previous dependencies in parallel, but waits until they have all returned before determining whether the node could be cleaned or not. Concretely, this meant that although the sources of the test had changed, we would re-run the (uncacheable) failed test again to completion before noticing the changed sources, and running it a second time with the changed sources.
Solution
As mentioned on #11290, other monadic build systems have taken the approach of sequencing requests for dependencies during cleaning, rather than requesting them in parallel. In this case, that would mean sequentially requesting the sources of the node before running the test. But because Pants uses future based concurrency aggressively, it is more challenging for us to preserve the structure of an
@rule
MultiGet
all the way down into a single request to theGraph
: without custom tuple-based combinators, we would lose type safety when we needed to request aVec<Node<_>>
rather thanfuture::join_all
ing a series of type-safe requests.Instead, this change takes a different strategy, and speculates while cleaning. All dependencies are still requested in parallel, but we now fail fast as soon as we detect a mismatch, and all speculative branches that were incorrect are canceled. When compared to sequentially requesting dependencies, this should almost always be faster (*as long as the total number of processes being cleaned is smaller than the constraints we have on concurrent processes:
--process-execution-local-parallelism
). The case where more tests are being speculated on than there are concurrency slots is also interesting, but less important than per-test latency -- followup work can improve our prioritization of processes to improve that case (see #11303).In order to cancel work in the
Graph
based on interest, we add anAsyncValue
type that implements a model where if all "waiters" for the computed value go away, the work is canceled. This interest-based cancellation also represents 90% of the work for #11252.Result
Fixes #11290, and should improve the latency of all types of incremental rebuilds.