Retry for filesystem changes should either use both time and attempts, or be infinite #10081

stuhood · 2020-06-17T16:41:32Z

In #9920 I changed Graph::get to backoff for a time period in order to account for "many small invalidations occurring rapidly", as is the case with fmt. But there is another case that using time alone does not account for, and that's long running tests: our 60 second timeout is significantly shorter than some integration tests, which means that a file change invalidating a test that runs for longer than 60 seconds will immediately fail with:

Exception: Exhausted retries while waiting for the filesystem to stabilize.

We should either use both time and an attempt count, or consider removing the timeout here entirely and just retrying indefinitely while files are changing.

The text was updated successfully, but these errors were encountered:

stuhood · 2020-06-20T03:52:57Z

Hm. I suppose that another aspect of this that might be important is that we don't currently cancel the ongoing attempt when a file is changed. If we did (and if that's possible with uncacheable nodes?) we could retry more quickly. It probably doesn't affect the choice of infinite vs bounded retry though.

stuhood · 2020-06-22T22:08:28Z

Discussing this with @Eric-Arellano , some more expectations that folks might have:

a user's first expectation might be that tests run to completion based on a global snapshot of the repository
if that's not going to be the case, we should definitely be logging, and restarting asap

Given that 2. is the reality for the near term (we lazily consume the filesystem, and so while we can try to give the illusion of having snapshotted the entire thing, in reality, there will be restarts), we should add the logging, and we should restart immediately.

With logging and retry that doesn't wait for the previous attempt to complete, I think that infinite retries are sane.

…ging (#10139) ### Problem As described in #10081, there are a few different concerns around retrying nodes when the filesystem changes: 1. whether we have retried enough times 2. whether we've made it clear to the user that we're retrying 3. whether we retry immediately upon change, or only after something has completed ### Solution To address each of the above points, we retry: 1. indefinitely 2. with logging 3. immediately upon invalidation of nodes by: * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable` Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`. ### Result A message like: ``` 12:39:52.93 [INFO] Completed: Building requirements.pex 15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms... 15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms... 12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6 ``` ...is rendered immediately when a file that a test depends on is invalidated. Fixes #10081.

…ging (pantsbuild#10139) As described in pantsbuild#10081, there are a few different concerns around retrying nodes when the filesystem changes: 1. whether we have retried enough times 2. whether we've made it clear to the user that we're retrying 3. whether we retry immediately upon change, or only after something has completed To address each of the above points, we retry: 1. indefinitely 2. with logging 3. immediately upon invalidation of nodes by: * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable` Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`. A message like: ``` 12:39:52.93 [INFO] Completed: Building requirements.pex 15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms... 15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms... 12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6 ``` ...is rendered immediately when a file that a test depends on is invalidated. Fixes pantsbuild#10081.

stuhood added the bug label Jun 17, 2020

stuhood added this to the 1.29.x milestone Jun 17, 2020

stuhood self-assigned this Jun 23, 2020

stuhood mentioned this issue Jun 23, 2020

Retry for filesystem changes more quickly, indefinitely, and with logging #10139

Merged

stuhood closed this as completed in #10139 Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry for filesystem changes should either use both time and attempts, or be infinite #10081

Retry for filesystem changes should either use both time and attempts, or be infinite #10081

stuhood commented Jun 17, 2020 •

edited

stuhood commented Jun 20, 2020

stuhood commented Jun 22, 2020

Retry for filesystem changes should either use both time and attempts, or be infinite #10081

Retry for filesystem changes should either use both time and attempts, or be infinite #10081

Comments

stuhood commented Jun 17, 2020 • edited

stuhood commented Jun 20, 2020

stuhood commented Jun 22, 2020

stuhood commented Jun 17, 2020 •

edited