Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry for filesystem changes more quickly, indefinitely, and with logging #10139

Merged
merged 5 commits into from Jun 24, 2020

Conversation

stuhood
Copy link
Sponsor Member

@stuhood stuhood commented Jun 23, 2020

Problem

As described in #10081, there are a few different concerns around retrying nodes when the filesystem changes:

  1. whether we have retried enough times
  2. whether we've made it clear to the user that we're retrying
  3. whether we retry immediately upon change, or only after something has completed

Solution

To address each of the above points, we retry:

  1. indefinitely
  2. with logging
  3. immediately upon invalidation of nodes by:
    • removing the Running::dirty flag, and moving a Running node to NotStarted to trigger invalidation
    • "aborting"/dropping ongoing work for a canceled attempt using futures::Abortable

Additionally, improve the Display and user_facing_name implementations for Node to allow for better log messages from the Graph.

Result

A message like:

12:39:52.93 [INFO] Completed: Building requirements.pex
15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms...
15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms...
12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6

...is rendered immediately when a file that a test depends on is invalidated.

Fixes #10081.

@stuhood stuhood marked this pull request as ready for review June 23, 2020 20:12
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 23, 2020

The commits are useful to review independently, but the first two overlap a little bit. Thanks!

Copy link
Contributor

@Eric-Arellano Eric-Arellano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!


let delay_for_root = Duration::from_millis(2000);
let delay_for_mid = Duration::from_millis(2000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mid?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The middle of three.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe say middle. I spent a solid minute trying to figure out what mid could be an acronym for.

@@ -22,6 +22,7 @@
def launch_file_toucher(f):
"""Launch a loop to touch the given file, and return a function to call to stop and join it."""
if not os.path.isfile(f):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this was intentional.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess which file I was smoke testing with, heh.

inner.unsafe_entry_for_id(entry_id).node().clone()
};
info!(
"Filesystem changed during run: retrying `{}` in {:?}...",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce the risk of truncation:

Suggested change
"Filesystem changed during run: retrying `{}` in {:?}...",
"Filesystem changed: retrying `{}` in {:?}...",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it would be worth rewording so that the time appears before the node description also. I'm concerned some nodes may be really long.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to mention "during run", because as you said, this is independent from --loop... also, I want to send the signal that this is tied specifically to the thing that is re-running (it is: if a file unrelated to the thing that was running is changed, you won't see this).

Regarding truncation: these messages are not truncated. Perhaps they should be, but.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, fair. Thanks

NodeKey::MultiPlatformExecuteProcess(mp_epr) => mp_epr.0.user_facing_name(),
NodeKey::DigestFile(DigestFile(File { path, .. })) => {
Some(format!("Fingerprinting: {}", path.display()))
}
NodeKey::DownloadedFile(..) => None,
NodeKey::DownloadedFile(ref d) => Some(format!("Downloading: {}", d.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving these. They look great.

@@ -1040,14 +1040,16 @@ impl NodeKey {
fn user_facing_name(&self) -> Option<String> {
match self {
NodeKey::Task(ref task) => task.task.display_info.desc.as_ref().map(|s| s.to_owned()),
NodeKey::Snapshot(_) => Some(format!("{}", self)),
NodeKey::Snapshot(ref s) => Some(format!("Snapshotting: {}", s.0)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshotting might not mean much to non-Pants devs. I can't think of a better name though. More something to consider medium term.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of the PathGlobs object should help... it reads as Snapshotting: PathGlobs(globs=['thing.py'], ..)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of which, the PathGlobs help is very verbose. In a followup, I think it'd be great if it only showed PathGlobs(['foo', '!ignore']) and not the other fields.

@stuhood stuhood force-pushed the stuhood/faster-indefinite-retry branch from 948257e to 495299c Compare June 23, 2020 22:22
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 23, 2020

Note: unfortunately, rendering just the Node::user_facing_name as the Display of a Node ran afoul of #7907, by breaking the cycle detection tests. They (rightly) confirm that we actually render the name of the cyclic target, which @rule(desc=..) cannot do currently. See the updated example in the description.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
…ying.

# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
# Delete this line to force CI to run the JVM tests.
[ci skip-jvm-tests]
@stuhood stuhood force-pushed the stuhood/faster-indefinite-retry branch from 495299c to 8fe1bae Compare June 24, 2020 00:57
@stuhood stuhood merged commit 5a059c3 into pantsbuild:master Jun 24, 2020
@stuhood stuhood deleted the stuhood/faster-indefinite-retry branch June 24, 2020 02:21
stuhood added a commit to stuhood/pants that referenced this pull request Jul 1, 2020
stuhood added a commit to stuhood/pants that referenced this pull request Jul 1, 2020
stuhood added a commit to stuhood/pants that referenced this pull request Jul 9, 2020
stuhood added a commit to stuhood/pants that referenced this pull request Jul 16, 2020
hrfuller pushed a commit to twitter/pants that referenced this pull request Jul 29, 2020
…ging (pantsbuild#10139)

As described in pantsbuild#10081, there are a few different concerns around retrying nodes when the filesystem changes:
1. whether we have retried enough times
2. whether we've made it clear to the user that we're retrying
3. whether we retry immediately upon change, or only after something has completed

To address each of the above points, we retry:
1. indefinitely
2. with logging
3. immediately upon invalidation of nodes by:
    * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation
    * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable`

Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`.

A message like:
```
12:39:52.93 [INFO] Completed: Building requirements.pex
15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms...
15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms...
12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6
```
...is rendered immediately when a file that a test depends on is invalidated.

Fixes pantsbuild#10081.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry for filesystem changes should either use both time and attempts, or be infinite
2 participants