[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

BenHenning · 2021-10-25T21:07:21Z

Explanation

Try to workaround the last issue in #2844: a SIGSEGV thrown sometimes when running StateFragmentTest/StateFragmentLocalTest.

While it would be preferable to fix this issue, it will likely be extremely time consuming since it'll require digging into a JVM bug. After some effort to try and find a root cause, I haven't been able to come up with a single dependable way to work around the issue. It mostly went away with previous mitigations, but after adding test sharding it seems to come up a lot.

This solution introduces the same while-loop retry mechanism as we do for builds (which was effective in addressing #3789). Bazel makes this work well since passing tests won't be re-run (their results are cached--see the CI results for this PR). Given that the issue seems to happen less often when StateFragmentTest/StateFragmentLocalTest runs by itself, this is a reasonable outcome (since it'll generally result in running just those tests for runs 2-5).

That being said, I'm not a complete fan of this solution since it will:

Result in true flakes being discovered less often, though fortunately the same retry mechanism also applies to the develop branch
False flakes could be missed which could result in them being checked in (but similarly, this also should result in a more stable situation since develop won't fail from them, but it does mean legitimate issues could be missed)
Result in failures taking longer to be reported since they'll be retried up to 5 times

While the second outcome is definitely worse, it also seems to happen much less often so it seems like a worthwhile trade-off for the stability benefits that we get from this fix.

#3970 was filed to track the long-term fix of the underlying SIGSEGV so that this mechanism isn't needed for CI runs to reliably pass.

Essential Checklist

The PR title and explanation each start with "Fix #bugnum: " (If this PR fixes part of an issue, prefix the title with "Fix part of #bugnum: ...".)
Any changes to scripts/assets files have their rationale included in the PR explanation.
The PR follows the style guide.
The PR does not contain any unnecessary code changes from Android Studio (reference).
The PR is made from a branch that's not called "develop" and is up-to-date with "develop".
The PR is assigned to the appropriate reviewers (reference).

For UI-specific PRs only

N/A -- infrastructure change

Add while loop for running tests, too.

Fix loop & exit code handling for test runs.

Finish TODOs.

BenHenning · 2021-10-26T22:06:18Z

@vinitamurthi could you PTAL as a reviewer since this is tied pretty strongly into developer workflow/happiness?

Break test to verify CI workflow failure conditions.

BenHenning · 2021-10-26T22:07:51Z

Note that I just added a commit to trigger a real failure in StateFragmentLocalTest to make sure the failure is properly represented in CI results.

BenHenning · 2021-10-27T06:56:53Z

Looks like the failure worked as expected. Reverting now so that the PR is passing CI.

Revert intentional breakage since CI failures were verified.

vinitamurthi · 2021-10-27T07:47:01Z

Overall this LGTM ..Once question though -- does it make sense to retry only failures? From my understanding , this is retrying all targets right?

oppiabot · 2021-10-27T07:50:11Z

Unassigning @vinitamurthi since they have already approved the PR.

oppiabot · 2021-10-27T07:50:15Z

Hi @BenHenning, this PR is ready to be merged. Please address any remaining comments prior to merging, and feel free to merge this PR once the CI checks pass and you're happy with it. Thanks!

BenHenning · 2021-10-27T21:17:34Z

Overall this LGTM ..Once question though -- does it make sense to retry only failures? From my understanding , this is retrying all targets right?

@vinitamurthi you're correct that it will retry all targets, but Bazel caches passing results so it won't re-run those (unless you pass a specific flag to force it to re-run, or a flag that implies that like "runs_per_test."

BenHenning · 2021-10-27T21:18:02Z

Thanks also for the review @vinitamurthi! Going ahead and merging this since nothing seems outstanding, and it'd be really good to get this merged for improved CI stability.

BenHenning added 2 commits October 25, 2021 14:01

Update unit_tests.yml

e0e04fb

Add while loop for running tests, too.

Update unit_tests.yml

c0c3503

Fix loop & exit code handling for test runs.

BenHenning mentioned this pull request Oct 26, 2021

StateFragment(Local)Test sometimes fail with SIGSEGV #3970

Open

Update unit_tests.yml

ebab4a1

Finish TODOs.

BenHenning marked this pull request as ready for review October 26, 2021 22:05

BenHenning requested a review from vinitamurthi October 26, 2021 22:05

BenHenning assigned vinitamurthi Oct 26, 2021

Update StateFragmentLocalTest.kt

fcb0a2a

Break test to verify CI workflow failure conditions.

Update StateFragmentLocalTest.kt

8c62300

Revert intentional breakage since CI failures were verified.

vinitamurthi approved these changes Oct 27, 2021

View reviewed changes

oppiabot bot unassigned vinitamurthi Oct 27, 2021

oppiabot bot added the PR: LGTM label Oct 27, 2021

oppiabot bot assigned BenHenning Oct 27, 2021

BenHenning merged commit 762eba4 into develop Oct 27, 2021

BenHenning deleted the add-workaround-for-sigsegv branch October 27, 2021 21:18

BenHenning mentioned this pull request Nov 1, 2021

Fixes #3827: [Portuguese] Translated text overlap #3925

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

BenHenning commented Oct 25, 2021 •

edited

BenHenning commented Oct 26, 2021

BenHenning commented Oct 26, 2021

BenHenning commented Oct 27, 2021

vinitamurthi commented Oct 27, 2021

oppiabot bot commented Oct 27, 2021

oppiabot bot commented Oct 27, 2021

BenHenning commented Oct 27, 2021

BenHenning commented Oct 27, 2021

[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969

Conversation

BenHenning commented Oct 25, 2021 • edited

Explanation

Essential Checklist

For UI-specific PRs only

BenHenning commented Oct 26, 2021

BenHenning commented Oct 26, 2021

BenHenning commented Oct 27, 2021

vinitamurthi commented Oct 27, 2021

oppiabot bot commented Oct 27, 2021

oppiabot bot commented Oct 27, 2021

BenHenning commented Oct 27, 2021

BenHenning commented Oct 27, 2021

BenHenning commented Oct 25, 2021 •

edited