Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry for filesystem changes should either use both time and attempts, or be infinite #10081

Closed
stuhood opened this issue Jun 17, 2020 · 2 comments · Fixed by #10139
Closed
Assignees
Labels
Milestone

Comments

@stuhood
Copy link
Sponsor Member

stuhood commented Jun 17, 2020

In #9920 I changed Graph::get to backoff for a time period in order to account for "many small invalidations occurring rapidly", as is the case with fmt. But there is another case that using time alone does not account for, and that's long running tests: our 60 second timeout is significantly shorter than some integration tests, which means that a file change invalidating a test that runs for longer than 60 seconds will immediately fail with:

Exception: Exhausted retries while waiting for the filesystem to stabilize.

We should either use both time and an attempt count, or consider removing the timeout here entirely and just retrying indefinitely while files are changing.

@stuhood stuhood added the bug label Jun 17, 2020
@stuhood stuhood added this to the 1.29.x milestone Jun 17, 2020
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 20, 2020

Hm. I suppose that another aspect of this that might be important is that we don't currently cancel the ongoing attempt when a file is changed. If we did (and if that's possible with uncacheable nodes?) we could retry more quickly. It probably doesn't affect the choice of infinite vs bounded retry though.

@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 22, 2020

Discussing this with @Eric-Arellano , some more expectations that folks might have:

  1. a user's first expectation might be that tests run to completion based on a global snapshot of the repository
  2. if that's not going to be the case, we should definitely be logging, and restarting asap

Given that 2. is the reality for the near term (we lazily consume the filesystem, and so while we can try to give the illusion of having snapshotted the entire thing, in reality, there will be restarts), we should add the logging, and we should restart immediately.

With logging and retry that doesn't wait for the previous attempt to complete, I think that infinite retries are sane.

@stuhood stuhood self-assigned this Jun 23, 2020
stuhood added a commit that referenced this issue Jun 24, 2020
…ging (#10139)

### Problem

As described in #10081, there are a few different concerns around retrying nodes when the filesystem changes:
1. whether we have retried enough times
2. whether we've made it clear to the user that we're retrying
3. whether we retry immediately upon change, or only after something has completed

### Solution

To address each of the above points, we retry:
1. indefinitely
2. with logging
3. immediately upon invalidation of nodes by:
    * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation
    * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable`

Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`.

### Result

A message like:
```
12:39:52.93 [INFO] Completed: Building requirements.pex
15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms...
15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms...
12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6
```
...is rendered immediately when a file that a test depends on is invalidated.

Fixes #10081.
hrfuller pushed a commit to twitter/pants that referenced this issue Jul 29, 2020
…ging (pantsbuild#10139)

As described in pantsbuild#10081, there are a few different concerns around retrying nodes when the filesystem changes:
1. whether we have retried enough times
2. whether we've made it clear to the user that we're retrying
3. whether we retry immediately upon change, or only after something has completed

To address each of the above points, we retry:
1. indefinitely
2. with logging
3. immediately upon invalidation of nodes by:
    * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation
    * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable`

Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`.

A message like:
```
12:39:52.93 [INFO] Completed: Building requirements.pex
15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms...
15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms...
12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6
```
...is rendered immediately when a file that a test depends on is invalidated.

Fixes pantsbuild#10081.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant