-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969
Conversation
Add while loop for running tests, too.
Fix loop & exit code handling for test runs.
Finish TODOs.
@vinitamurthi could you PTAL as a reviewer since this is tied pretty strongly into developer workflow/happiness? |
Break test to verify CI workflow failure conditions.
Note that I just added a commit to trigger a real failure in StateFragmentLocalTest to make sure the failure is properly represented in CI results. |
Looks like the failure worked as expected. Reverting now so that the PR is passing CI. |
Revert intentional breakage since CI failures were verified.
Overall this LGTM ..Once question though -- does it make sense to retry only failures? From my understanding , this is retrying all targets right? |
Unassigning @vinitamurthi since they have already approved the PR. |
Hi @BenHenning, this PR is ready to be merged. Please address any remaining comments prior to merging, and feel free to merge this PR once the CI checks pass and you're happy with it. Thanks! |
@vinitamurthi you're correct that it will retry all targets, but Bazel caches passing results so it won't re-run those (unless you pass a specific flag to force it to re-run, or a flag that implies that like "runs_per_test." |
Thanks also for the review @vinitamurthi! Going ahead and merging this since nothing seems outstanding, and it'd be really good to get this merged for improved CI stability. |
Explanation
Try to workaround the last issue in #2844: a SIGSEGV thrown sometimes when running StateFragmentTest/StateFragmentLocalTest.
While it would be preferable to fix this issue, it will likely be extremely time consuming since it'll require digging into a JVM bug. After some effort to try and find a root cause, I haven't been able to come up with a single dependable way to work around the issue. It mostly went away with previous mitigations, but after adding test sharding it seems to come up a lot.
This solution introduces the same while-loop retry mechanism as we do for builds (which was effective in addressing #3789). Bazel makes this work well since passing tests won't be re-run (their results are cached--see the CI results for this PR). Given that the issue seems to happen less often when StateFragmentTest/StateFragmentLocalTest runs by itself, this is a reasonable outcome (since it'll generally result in running just those tests for runs 2-5).
That being said, I'm not a complete fan of this solution since it will:
While the second outcome is definitely worse, it also seems to happen much less often so it seems like a worthwhile trade-off for the stability benefits that we get from this fix.
#3970 was filed to track the long-term fix of the underlying SIGSEGV so that this mechanism isn't needed for CI runs to reliably pass.
Essential Checklist
For UI-specific PRs only
N/A -- infrastructure change