Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

Merged
merged 5 commits into from
Oct 27, 2021

Conversation

BenHenning
Copy link
Sponsor Member

@BenHenning BenHenning commented Oct 25, 2021

Explanation

Try to workaround the last issue in #2844: a SIGSEGV thrown sometimes when running StateFragmentTest/StateFragmentLocalTest.

While it would be preferable to fix this issue, it will likely be extremely time consuming since it'll require digging into a JVM bug. After some effort to try and find a root cause, I haven't been able to come up with a single dependable way to work around the issue. It mostly went away with previous mitigations, but after adding test sharding it seems to come up a lot.

This solution introduces the same while-loop retry mechanism as we do for builds (which was effective in addressing #3789). Bazel makes this work well since passing tests won't be re-run (their results are cached--see the CI results for this PR). Given that the issue seems to happen less often when StateFragmentTest/StateFragmentLocalTest runs by itself, this is a reasonable outcome (since it'll generally result in running just those tests for runs 2-5).

That being said, I'm not a complete fan of this solution since it will:

  • Result in true flakes being discovered less often, though fortunately the same retry mechanism also applies to the develop branch
  • False flakes could be missed which could result in them being checked in (but similarly, this also should result in a more stable situation since develop won't fail from them, but it does mean legitimate issues could be missed)
  • Result in failures taking longer to be reported since they'll be retried up to 5 times

While the second outcome is definitely worse, it also seems to happen much less often so it seems like a worthwhile trade-off for the stability benefits that we get from this fix.

#3970 was filed to track the long-term fix of the underlying SIGSEGV so that this mechanism isn't needed for CI runs to reliably pass.

Essential Checklist

  • The PR title and explanation each start with "Fix #bugnum: " (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".)
  • Any changes to scripts/assets files have their rationale included in the PR explanation.
  • The PR follows the style guide.
  • The PR does not contain any unnecessary code changes from Android Studio (reference).
  • The PR is made from a branch that's not called "develop" and is up-to-date with "develop".
  • The PR is assigned to the appropriate reviewers (reference).

For UI-specific PRs only

N/A -- infrastructure change

Add while loop for running tests, too.
Fix loop & exit code handling for test runs.
Finish TODOs.
@BenHenning BenHenning marked this pull request as ready for review October 26, 2021 22:05
@BenHenning
Copy link
Sponsor Member Author

@vinitamurthi could you PTAL as a reviewer since this is tied pretty strongly into developer workflow/happiness?

Break test to verify CI workflow failure conditions.
@BenHenning
Copy link
Sponsor Member Author

Note that I just added a commit to trigger a real failure in StateFragmentLocalTest to make sure the failure is properly represented in CI results.

@BenHenning
Copy link
Sponsor Member Author

Looks like the failure worked as expected. Reverting now so that the PR is passing CI.

Revert intentional breakage since CI failures were verified.
@vinitamurthi
Copy link
Contributor

Overall this LGTM ..Once question though -- does it make sense to retry only failures? From my understanding , this is retrying all targets right?

@oppiabot
Copy link

oppiabot bot commented Oct 27, 2021

Unassigning @vinitamurthi since they have already approved the PR.

@oppiabot oppiabot bot added the PR: LGTM label Oct 27, 2021
@oppiabot
Copy link

oppiabot bot commented Oct 27, 2021

Hi @BenHenning, this PR is ready to be merged. Please address any remaining comments prior to merging, and feel free to merge this PR once the CI checks pass and you're happy with it. Thanks!

@BenHenning
Copy link
Sponsor Member Author

Overall this LGTM ..Once question though -- does it make sense to retry only failures? From my understanding , this is retrying all targets right?

@vinitamurthi you're correct that it will retry all targets, but Bazel caches passing results so it won't re-run those (unless you pass a specific flag to force it to re-run, or a flag that implies that like "runs_per_test."

@BenHenning
Copy link
Sponsor Member Author

Thanks also for the review @vinitamurthi! Going ahead and merging this since nothing seems outstanding, and it'd be really good to get this merged for improved CI stability.

@BenHenning BenHenning merged commit 762eba4 into develop Oct 27, 2021
@BenHenning BenHenning deleted the add-workaround-for-sigsegv branch October 27, 2021 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants