Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't rerun uncachable nodes if they are dirtied while running. #9452

Merged
merged 9 commits into from
Apr 3, 2020

Conversation

hrfuller
Copy link
Member

@hrfuller hrfuller commented Apr 2, 2020

Problem

After landing #9318 it was observed that, with invalidation present, goals which modify the workspace like fmt were causing goal rules, which have side effects and shouldn't run more than once per session to run up to 8 times (8 is the number of allowed retries in the scheduler). This was codified in #9415 and a work around was merged. This PR fixes the underlying issue which is that goal_rules, which are represented by uncacheable nodes, should not be re-run ever in the same session.

Solution

If a goal rules dependencies are invalidated while it is running, we now try to rerun them a number of times before giving up, and bubbling an Error up to the scheduler / user. A different error is used to distinguish from the scheduler retries around invalidated deps. We also don't dirty the dependents of uncacheable nodes when walking the inverted graph to invalidate nodes from fs events. This is because if we dirty them while they are running they will re-run, the scheduler will think they failed and will try to re-run them, which would cause their deps (which are uncacheable) to also re-run.

Result

Running ./v2 --experimental-fs-watcher fmt some/target/:: only runs the fmt once, and doesn't sporadically error. Additionally, changing files in the workspace manually, say while tests are running, won't cause them to print console output multiple times.

Henry Fuller added 2 commits April 1, 2020 19:03
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying to many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

[ci skip-jvm-tests]  # No JVM changes made.
Remove execution option override from tests.

[ci skip-jvm-tests]  # No JVM changes made.
Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/rust/engine/src/core.rs Outdated Show resolved Hide resolved
src/rust/engine/graph/src/lib.rs Outdated Show resolved Hide resolved
src/rust/engine/graph/src/entry.rs Outdated Show resolved Hide resolved
@@ -37,7 +37,7 @@ pub trait Node: Clone + Debug + Display + Eq + Hash + Send + 'static {
///
/// If the node result is cacheable, return true.
///
fn cacheable(&self, context: &Self::Context) -> bool;
fn cacheable(&self) -> bool;
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is surprising, and looks like it required a lot of refactoring of the tests... it also removes some flexibility, because I could imagine needing to use the context to determine whether a Node should be cached (as the tests did).

If this was intended to make it easier to make the change to Graph::walk, it might be better to go the predicate route?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could imagine needing to use the context to determine whether a Node should be cached (as the tests did).

It seems like the only reason this trait method was taking a context was to make the graph tests simpler by not needing to store cacheability information on the TNode. I can't think of a reason why we couldn't store some unknown future state needed to determine a nodes cacheability on the real NodeKey, like we do currently with the Task cacheability field.
I considered keeping the trait method signature the same, but I couldn't figure out a good way to have a Context available from either the scheduler, the watcher or the graph. The latter two of which exist as fields on the Context. That seems like it would require much larger refactoring of the source to resolve. I would rather prioritize simplicity in the code over simplicity of the test harness. The combination of those two things made me go this route. I'm fairly happy with how little I had to change the way TNodes are instantiated.

If this was intended to make it easier to make the change to Graph::walk, it might be better to go the predicate route?

I see these as separate concerns. Even if we go the predicate route we still need a reference to the context in the watcher and scheduler, or Graph, which I can't see a way of doing without lots of refactors. The predicate issue is more about building flexibility into Graph::walk which I can see the benefit of.

src/rust/engine/graph/src/tests.rs Outdated Show resolved Hide resolved
Henry Fuller added 4 commits April 2, 2020 00:58
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
… types

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
@hrfuller hrfuller marked this pull request as ready for review April 2, 2020 19:59
@hrfuller hrfuller requested a review from stuhood April 2, 2020 20:00
Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

src/rust/engine/src/nodes.rs Outdated Show resolved Hide resolved
…ook up nodes.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
Henry Fuller added 2 commits April 2, 2020 17:51
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]  # No JVM changes made.
@hrfuller hrfuller merged commit 4a89db6 into pantsbuild:master Apr 3, 2020
@stuhood stuhood deleted the hfuller/fix-fmt-errors branch April 3, 2020 18:53
let mut delays = HashMap::new();
delays.insert(TNode::new(0), delay_for_root);
TContext::new(graph.clone())
.with_uncacheable(uncacheable)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm: I guess that the uncacheable dict is unused now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's still used. When auto generating dependencies we check the dict to see if the dependency should be uncacheable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrfuller pushed a commit to twitter/pants that referenced this pull request May 11, 2020
…sbuild#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake
hrfuller pushed a commit that referenced this pull request May 15, 2020
* Port to tokio 0.2, and to stdlib futures for fs and task_executor (#9071)

We're on an older version of tokio, which doesn't smoothly support usage of async/await.

Switch to tokio 0.2, which supports directly spawning and awaiting (via its macros) stdlib futures, which is an important step toward being able to utilize async/await more broadly. Additionally, port the `fs` and `task_executor` crates to stdlib futures.

Finally, transitively fixup for the new APIs: in particular, since both `task_executor` and `tokio` now consume stdlib futures to spawn tasks, we switch all relevant tests and main methods to use the `tokio::main` and `tokio::test` macros, which annotate async methods and spawn a runtime to allow for `await`ing futures inline.

Progress toward more usage of async/await!

* Add notify fs watcher to engine. (#9318)

* Use the notify crate to implement an `InvalidationWatcher` for Graph operations.

* Make watch async, and watcher log to pantsd.log.
Relativize paths returned from notify to the build_root.
Refactor invalidate method to be an associated method on the
InvalidationWatcher.

* Respond to feedback.
* Use spawn on io pool instead of custom future impl
* Write python fs tests
* Relativize paths to invalidate to build root
* invalidate nodes with parent paths.
* Comments

* Add rust tests.
Make some things public so we can use them in tests.
Use canonical path to build root for relativizing changed paths.

* Refactor Python tests.
Return watch errors as core::Failure all the way to user.
Move task executor onto invalidation watcher.
Move test_support trait impl into test_support mod.

* use futures lock on watcher

* Platform specific watching behavior. On Darwin recursively watch the
build root at startup. On Linux watch individual directory roots.

Co-authored-by: Stu Hood <stuhood@gmail.com>

* Ignore notify events for pants_ignore patterns. (#9406)

* Create a git ignorer on the context object. Adjust all call sites which
create a posix fs to pass in an ignorer.

* Ignore fsevents from files that match pants_ignore patterns.

* Always pass is_dir = false to ignorer to avoid stat-ing every path the
event watch thread sees.

* Add a feature gate to disable the engine fs watcher introduced in #9318 (#9416)

* Add a feature gate to disable the engine fs watcher introduced in #9318
by default, to mitigate issues seen in #9415 until a fix is in place.

* Don't rerun uncachable nodes if they are dirtied while running. (#9452)

* Don't rerun uncachable nodes if they are dirtied while running.
- Retry dependencies of uncacheable nodes a few times to get a result
  until we are exhausted from trying too many times.
- Bubble uncacheable node retry errors up to the user, tell them things
  were chaning to much.
- Don't propagate dirtiness past uncacheable nodes when invalidating
  from changed roots. Otherwise dirty dependents of uncacheable nodes
  will need to re-run.

* enable the engine fs watcher by default, now that it won't cause issues.
Remove execution option override from tests.

* use reference to self in stop_walk_predicate closure

* invalidate often enough that test doesn't flake

* Add a flag to prevent the FsEventService and watchman from starting (#9487)

* add --watchman-enable flag
* disable watchman when flag is false
* Don't wait for the initial watchman event if we aren't using watchman.
* check invalidation watcher liveness from scheduler service

* Extract a `watch` crate. (#9635)

The `watch` module directly accesses the `engine` crate's `Graph`, which makes it more challenging to test.

Extract a `watch` crate which is used via an `trait Invalidatable` which is implemented for the engine's `Graph`, as well as independently in tests.

[ci skip-jvm-tests]

* Simplify Scheduler::execute and unify Graph retry (#9674)

Both the `Graph` and the `Scheduler` implemented retry for requested Nodes, but the `Scheduler` implementation was pre-async-await and much more complicated.

Unify the retry implementations into `Graph::get` (either roots or uncacheable nodes are retried), and simplify the `Scheduler`'s loop down to:
```
let maybe_display_handle = Self::maybe_display_initialize(&session);
let result = loop {
  if let Ok(res) = receiver.recv_timeout(refresh_interval) {
    break Ok(res);
  } else if let Err(e) = Self::maybe_display_render(&session, &mut tasks) {
    break Err(e);
  }
};
Self::maybe_display_teardown(session, maybe_display_handle);
result
```

A single, more modern retry implementation (thanks @hrfuller!), and a cleaner `Scheduler::execute` loop.

* Move file invalidation handling to rust (#9636)

A few different kinds of file watching span the boundary between the `SchedulerService` and `FSEventService`:
1. pantsd invalidation globs - how `pantsd` detects that its implementing code or config has changed
2. pidfile - watches `pantsd`'s pidfile to ensure that the daemon exits if it loses exclusivity
3. graph invalidation - any files changing in the workspace should invalidate the engine's `Graph`
4. `--loop` - implemented directly in the `SchedulerService`

Because of the semi-cyclic nature of the relationship between the `SchedulerService` and `FSEventService`, it's challenging to understand the interplay of these usecases. And, unsurprisingly, that lead to the `notify` crate implementation only satisfying one of them.

The fundamental change in this PR is to add support for two new parameters to engine executions which are implemented by the `Graph`:
* `poll: bool` - When `poll` is enabled, a `product_request` will wait for the requested Nodes to have changed from their last-observed values before returning. When a poll waits, an optional `poll_delay` is applied before it returns to "debounce" polls.
* `timeout: Optional[Duration]` - When a `timeout` is set, a `product_request` will wait up to the given duration for the requested Node(s) to complete (including any time `poll`ing).

These features are then used by:
* `--loop`: uses `poll` (with a `poll_delay`, but without a `timeout`) to immediately re-run a `Goal` when its inputs have changed.
* invalidation globs and pidfile watching: use `poll` (with no `poll_delay`) and `timeout` to block their `SchedulerService` thread and watch for changes to those files.

The `FSEventService` and `SchedulerService` are decoupled, and each now interacts only with the `Scheduler`: `FSEventService` to push `watchman` events to the `Graph`, and the `SchedulerService` to pull invalidation information from the `Graph`.

Because all events now flow through the `Graph`, the `notify` crate has reached feature parity with `watchman`.

In followup changes we can remove the experimental flag, disable `watchman` (and thus the `FSEventService`) by default, and remove the dependency between `--loop` and `pantsd`.

* pin tokio at exactly 0.2.13

* fix lint issues

* fix mypy typing issues

* Move away from the debounced notify watcher #9754

* Remove test that has raced file invalidation ever since the notify backend was added, but which will now fairly consistently lose that race.
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

* As explained in the comment: we can no longer create duplicate parallel BUILD files and hope that pants does not notice them before we scan the directory again!
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]

Co-authored-by: Stu Hood <stuhood@gmail.com>
Co-authored-by: Stu Hood <stuhood@twitter.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants