Don't rerun uncachable nodes if they are dirtied while running. #9452

hrfuller · 2020-04-02T02:19:06Z

Problem

After landing #9318 it was observed that, with invalidation present, goals which modify the workspace like fmt were causing goal rules, which have side effects and shouldn't run more than once per session to run up to 8 times (8 is the number of allowed retries in the scheduler). This was codified in #9415 and a work around was merged. This PR fixes the underlying issue which is that goal_rules, which are represented by uncacheable nodes, should not be re-run ever in the same session.

Solution

If a goal rules dependencies are invalidated while it is running, we now try to rerun them a number of times before giving up, and bubbling an Error up to the scheduler / user. A different error is used to distinguish from the scheduler retries around invalidated deps. We also don't dirty the dependents of uncacheable nodes when walking the inverted graph to invalidate nodes from fs events. This is because if we dirty them while they are running they will re-run, the scheduler will think they failed and will try to re-run them, which would cause their deps (which are uncacheable) to also re-run.

Result

Running ./v2 --experimental-fs-watcher fmt some/target/:: only runs the fmt once, and doesn't sporadically error. Additionally, changing files in the workspace manually, say while tests are running, won't cause them to print console output multiple times.

- Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying to many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. [ci skip-jvm-tests] # No JVM changes made.

Remove execution option override from tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood

Thanks!

src/rust/engine/src/core.rs

src/rust/engine/graph/src/lib.rs

src/rust/engine/graph/src/entry.rs

stuhood · 2020-04-02T02:54:52Z

src/rust/engine/graph/src/node.rs

@@ -37,7 +37,7 @@ pub trait Node: Clone + Debug + Display + Eq + Hash + Send + 'static {
  ///
  /// If the node result is cacheable, return true.
  ///
-  fn cacheable(&self, context: &Self::Context) -> bool;
+  fn cacheable(&self) -> bool;


This change is surprising, and looks like it required a lot of refactoring of the tests... it also removes some flexibility, because I could imagine needing to use the context to determine whether a Node should be cached (as the tests did).

If this was intended to make it easier to make the change to Graph::walk, it might be better to go the predicate route?

I could imagine needing to use the context to determine whether a Node should be cached (as the tests did).

It seems like the only reason this trait method was taking a context was to make the graph tests simpler by not needing to store cacheability information on the TNode. I can't think of a reason why we couldn't store some unknown future state needed to determine a nodes cacheability on the real NodeKey, like we do currently with the Task cacheability field.
I considered keeping the trait method signature the same, but I couldn't figure out a good way to have a Context available from either the scheduler, the watcher or the graph. The latter two of which exist as fields on the Context. That seems like it would require much larger refactoring of the source to resolve. I would rather prioritize simplicity in the code over simplicity of the test harness. The combination of those two things made me go this route. I'm fairly happy with how little I had to change the way TNodes are instantiated.

If this was intended to make it easier to make the change to Graph::walk, it might be better to go the predicate route?

I see these as separate concerns. Even if we go the predicate route we still need a reference to the context in the watcher and scheduler, or Graph, which I can't see a way of doing without lots of refactors. The predicate issue is more about building flexibility into Graph::walk which I can see the benefit of.

src/rust/engine/graph/src/tests.rs

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

… types # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood

Thanks!

src/rust/engine/src/nodes.rs

src/rust/engine/graph/src/lib.rs

…ook up nodes. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

src/rust/engine/graph/src/lib.rs

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood · 2020-04-03T18:56:27Z

src/rust/engine/graph/src/tests.rs

+    let mut delays = HashMap::new();
+    delays.insert(TNode::new(0), delay_for_root);
+    TContext::new(graph.clone())
+      .with_uncacheable(uncacheable)


Mm: I guess that the uncacheable dict is unused now.

No it's still used. When auto generating dependencies we check the dict to see if the dependency should be uncacheable.

https://github.com/pantsbuild/pants/pull/9452/files/8a76bb14e7bc2f7e4f7e9b7f6244216983141b9e#diff-70e2bbe341f104db40d0d2101091d382R754-R760

…sbuild#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake

@hrfuller

* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071) We're on an older version of tokio, which doesn't smoothly support usage of async/await. Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures. Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline. Progress toward more usage of async/await! * Add notify fs watcher to engine. (#9318) * Use the notify crate to implement an `InvalidationWatcher` for Graph operations. * Make watch async, and watcher log to pantsd.log. Relativize paths returned from notify to the build_root. Refactor invalidate method to be an associated method on the InvalidationWatcher. * Respond to feedback. * Use spawn on io pool instead of custom future impl * Write python fs tests * Relativize paths to invalidate to build root * invalidate nodes with parent paths. * Comments * Add rust tests. Make some things public so we can use them in tests. Use canonical path to build root for relativizing changed paths. * Refactor Python tests. Return watch errors as core::Failure all the way to user. Move task executor onto invalidation watcher. Move test_support trait impl into test_support mod. * use futures lock on watcher * Platform specific watching behavior. On Darwin recursively watch the build root at startup. On Linux watch individual directory roots. Co-authored-by: Stu Hood <stuhood@gmail.com> * Ignore notify events for pants_ignore patterns. (#9406) * Create a git ignorer on the context object. Adjust all call sites which create a posix fs to pass in an ignorer. * Ignore fsevents from files that match pants_ignore patterns. * Always pass is_dir = false to ignorer to avoid stat-ing every path the event watch thread sees. * Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416) * Add a feature gate to disable the engine fs watcher introduced in #9318 by default, to mitigate issues seen in #9415 until a fix is in place. * Don't rerun uncachable nodes if they are dirtied while running. (#9452) * Don't rerun uncachable nodes if they are dirtied while running. - Retry dependencies of uncacheable nodes a few times to get a result until we are exhausted from trying too many times. - Bubble uncacheable node retry errors up to the user, tell them things were chaning to much. - Don't propagate dirtiness past uncacheable nodes when invalidating from changed roots. Otherwise dirty dependents of uncacheable nodes will need to re-run. * enable the engine fs watcher by default, now that it won't cause issues. Remove execution option override from tests. * use reference to self in stop_walk_predicate closure * invalidate often enough that test doesn't flake * Add a flag to prevent the FsEventService and watchman from starting (#9487) * add --watchman-enable flag * disable watchman when flag is false * Don't wait for the initial watchman event if we aren't using watchman. * check invalidation watcher liveness from scheduler service * Extract a `watch` crate. (#9635) The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test. Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests. [ci skip-jvm-tests] * Simplify Scheduler::execute and unify Graph retry (#9674) Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated. Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to: ``` let maybe_display_handle = Self::maybe_display_initialize(&session); let result = loop { if let Ok(res) = receiver.recv_timeout(refresh_interval) { break Ok(res); } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) { break Err(e); } }; Self::maybe_display_teardown(session, maybe_display_handle); result ``` A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop. * Move file invalidation handling to rust (#9636) A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`: 1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed 2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity 3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph` 4. `--loop` - implemented directly in the `SchedulerService` Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them. The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`: * `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls. * `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing). These features are then used by: * `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed. * invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files. The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`. Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`. In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`. * pin tokio at exactly 0.2.13 * fix lint issues * fix mypy typing issues * Move away from the debounced notify watcher #9754 * Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] * As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again! # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] Co-authored-by: Stu Hood <stuhood@gmail.com> Co-authored-by: Stu Hood <stuhood@twitter.com>

Henry Fuller added 2 commits April 1, 2020 19:03

enable the engine fs watcher by default, now that it won't cause issues.

859bd10

Remove execution option override from tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood reviewed Apr 2, 2020

View reviewed changes

Henry Fuller added 4 commits April 2, 2020 00:58

Fix formatting and default option

a9d0265

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

spelling

52e8b75

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

Use generic throw instead of seperate filewatch and exhausted Failure…

1a6564c

… types # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

Use predicate function for deciding when to stop walking graph.

8d2608e

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

hrfuller marked this pull request as ready for review April 2, 2020 19:59

hrfuller requested a review from stuhood April 2, 2020 20:00

stuhood approved these changes Apr 2, 2020

View reviewed changes

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved

stuhood reviewed Apr 2, 2020

View reviewed changes

src/rust/engine/graph/src/lib.rs Outdated Show resolved Hide resolved

Pass a graph argument to the stop walk predicate so we don't always l…

1807c4b

…ook up nodes. # Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

stuhood reviewed Apr 3, 2020

View reviewed changes

src/rust/engine/graph/src/lib.rs Outdated Show resolved Hide resolved

Henry Fuller added 2 commits April 2, 2020 17:51

use reference to self in stop_walk_predicate closure

4142bd8

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

invalidate often enough that test doesn't flake

8a76bb1

# Delete this line to force CI to run the JVM tests. [ci skip-jvm-tests] # No JVM changes made.

hrfuller merged commit 4a89db6 into pantsbuild:master Apr 3, 2020

stuhood deleted the hfuller/fix-fmt-errors branch April 3, 2020 18:53

stuhood reviewed Apr 3, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't rerun uncachable nodes if they are dirtied while running. #9452

Don't rerun uncachable nodes if they are dirtied while running. #9452

hrfuller commented Apr 2, 2020 •

edited

Loading

stuhood left a comment

stuhood Apr 2, 2020

hrfuller Apr 2, 2020

stuhood left a comment

stuhood Apr 3, 2020

hrfuller Apr 3, 2020

hrfuller Apr 3, 2020

Don't rerun uncachable nodes if they are dirtied while running. #9452

Don't rerun uncachable nodes if they are dirtied while running. #9452

Conversation

hrfuller commented Apr 2, 2020 • edited Loading

Problem

Solution

Result

stuhood left a comment

Choose a reason for hiding this comment

stuhood Apr 2, 2020

Choose a reason for hiding this comment

hrfuller Apr 2, 2020

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

stuhood Apr 3, 2020

Choose a reason for hiding this comment

hrfuller Apr 3, 2020

Choose a reason for hiding this comment

hrfuller Apr 3, 2020

Choose a reason for hiding this comment

hrfuller commented Apr 2, 2020 •

edited

Loading