New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry for filesystem changes should either use both time and attempts, or be infinite #10081
Comments
Hm. I suppose that another aspect of this that might be important is that we don't currently cancel the ongoing attempt when a file is changed. If we did (and if that's possible with uncacheable nodes?) we could retry more quickly. It probably doesn't affect the choice of infinite vs bounded retry though. |
Discussing this with @Eric-Arellano , some more expectations that folks might have:
Given that With logging and retry that doesn't wait for the previous attempt to complete, I think that infinite retries are sane. |
…ging (#10139) ### Problem As described in #10081, there are a few different concerns around retrying nodes when the filesystem changes: 1. whether we have retried enough times 2. whether we've made it clear to the user that we're retrying 3. whether we retry immediately upon change, or only after something has completed ### Solution To address each of the above points, we retry: 1. indefinitely 2. with logging 3. immediately upon invalidation of nodes by: * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable` Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`. ### Result A message like: ``` 12:39:52.93 [INFO] Completed: Building requirements.pex 15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms... 15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms... 12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6 ``` ...is rendered immediately when a file that a test depends on is invalidated. Fixes #10081.
…ging (pantsbuild#10139) As described in pantsbuild#10081, there are a few different concerns around retrying nodes when the filesystem changes: 1. whether we have retried enough times 2. whether we've made it clear to the user that we're retrying 3. whether we retry immediately upon change, or only after something has completed To address each of the above points, we retry: 1. indefinitely 2. with logging 3. immediately upon invalidation of nodes by: * removing the `Running::dirty` flag, and moving a `Running` node to `NotStarted` to trigger invalidation * "aborting"/dropping ongoing work for a canceled attempt using `futures::Abortable` Additionally, improve the `Display` and `user_facing_name` implementations for `Node` to allow for better log messages from the `Graph`. A message like: ``` 12:39:52.93 [INFO] Completed: Building requirements.pex 15:21:58.95 [INFO] Filesystem changed during run: retrying `Test` in 500ms... 15:21:58.96 [INFO] Filesystem changed during run: retrying `@rule coordinator_of_tests((OptionsBootstrapper(args=['--no-v1', '--v2', '--no-process-execution-use-local-cache', 'test', 'tests/python/pants_test/util:'], env={'PANTS_DEV': '1'}, config=ChainedConfig(['/Users/stuhood/src/pants/pants.toml', '/Users/stuhood/.pants.rc'])), WrappedTestFieldSet(field_set=PythonTestFieldSet(address=Address(tests/python/pants_test/util, argutil), origin=SiblingAddresses(directory='tests/python/pants_test/util'), sources=<class 'pants.backend.python.target_types.PythonTestsSources'>(alias='sources', sanitized_raw_value=('test_argutil.py',), default=('test_*.py', '*_test.py', 'tests.py', 'conftest.py')), timeout=pants.backend.python.target_types.PythonTestsTimeout(alias='timeout', value=None, default=None), coverage=pants.backend.python.target_types.PythonCoverage(alias='coverage', value=('pants.util.argutil',), default=None)))))` in 500ms... 12:39:57.17 [INFO] Completed: Building requirements.pex with 1 requirement: dataclasses==0.6 ``` ...is rendered immediately when a file that a test depends on is invalidated. Fixes pantsbuild#10081.
In #9920 I changed
Graph::get
to backoff for a time period in order to account for "many small invalidations occurring rapidly", as is the case withfmt
. But there is another case that using time alone does not account for, and that's long running tests: our 60 second timeout is significantly shorter than some integration tests, which means that a file change invalidating a test that runs for longer than 60 seconds will immediately fail with:We should either use both time and an attempt count, or consider removing the timeout here entirely and just retrying indefinitely while files are changing.
The text was updated successfully, but these errors were encountered: