Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry failed parallel test sequentially #1856

Merged
merged 12 commits into from Dec 16, 2019
Merged

Conversation

@ailisp
Copy link
Member

ailisp commented Dec 12, 2019

Come up with this hacking solution to failed tests in parallel runner.
Given:

  1. everytime is 1-3 test fails
  2. most fails are timeout.

So set a threshold of 5, if test fail <= threshold, retries failed test sequentially. If pass still consider as test pass. If failed tests > threshold, fail to avoid taking too much time retry. This hopefully reduce flaky, benefit to existing PRs with less likely timeout.

ailisp added 5 commits Dec 12, 2019
@ailisp

This comment has been minimized.

Copy link
Member Author

ailisp commented Dec 12, 2019

Also fix cov upload. latest one is uploaded to https://codecov.io/github/nearprotocol/nearcore/commit/07cea8efaefe29a714e7e9b2bb4e070546777a47

codecov doesn't comment here, i think because no file change cause in coverage:
image

Updated: cov bot comments now. weird

@ailisp ailisp marked this pull request as ready for review Dec 12, 2019
@ailisp ailisp requested review from nearmax, vgrichina and bowenwang1996 and removed request for vgrichina Dec 12, 2019
scripts/parallel_run_tests.py Outdated Show resolved Hide resolved
for f in fails:
print(f)
exit(1)
if len(fails) <= RERUN_THRESHOLD:

This comment has been minimized.

Copy link
@bowenwang1996

bowenwang1996 Dec 12, 2019

Member

I feel like rerun 5 times is a bit too much. If it just passes 1 out of 5 times we should not consider it as passing.

This comment has been minimized.

Copy link
@ailisp

ailisp Dec 12, 2019

Author Member

It's not rerun 5 times. it's allow at most five tests faild. Then retry these failed test once without parallel execution. If this time pass => pass. if still not pass => fail.

This comment has been minimized.

Copy link
@ailisp

ailisp Dec 12, 2019

Author Member

5 is based on historical result, if all test pass locally there are 0-3 test timeout in ci. If i run every of these timeout test seperately on ci they're supposed to pass

@codecov

This comment has been minimized.

Copy link

codecov bot commented Dec 12, 2019

Codecov Report

❗️ No coverage uploaded for pull request base (staging@d781c41). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             staging    #1856   +/-   ##
==========================================
  Coverage           ?   86.59%           
==========================================
  Files              ?      166           
  Lines              ?    30839           
  Branches           ?        0           
==========================================
  Hits               ?    26705           
  Misses             ?     4134           
  Partials           ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d781c41...832b67e. Read the comment docs.

Copy link
Collaborator

nearmax left a comment

I strongly think we should not be doing this change. Instead of writing an automatic retry on our tests we should figure out and fix flaky tests and the scripts that run them.

If anything, we should actually have an opposite script that runs each of our tests 10 times and if it fails at least once then the whole thing fails, because we have a large number of tests that are non-deterministic because of the parallellism and if there is a bug parallel code might fail very rarely. It is actually a common practice to tests parallel code, if model checking or static analysis is not available.

@ailisp

This comment has been minimized.

Copy link
Member Author

ailisp commented Dec 13, 2019

I strongly think we should not be doing this change. Instead of writing an automatic retry on our tests we should figure out and fix flaky tests and the scripts that run them.

I have a rationale for this: If every test is run sequentially they're very unlikely to fail, just like you run single test separately locally, due to computing resource is adequate. But we cannot really run all tests in sequential, that makes ci too long (and doesn't scale. I tried this parallel script in 32 vcpu instance, it's extremely fast).

Looking at almost all fails of #1832: failures are in ~4 tests because of timeout. Timeout time is provided based on run it separately locally, It's therefore make sense to run them separately in a weaker-than-local ci machine. If it can pass in this weaker machine, it's still prove to work in the local machine. (If it pass in the first pass of parallel run, which proves it can success under more limited resource, it's even better).
By make test to run this way, all timeout number actually has a criteria: every integration test should not timeout when run with 4vcpu, ssd without other significant read/write happen, enough ram (ram can be limit too)

Identifying test that becomes flaky due to resource limitation is possible, but both sounds considerable more work. I can think of

  1. schedule io bounded task not together with io bounded task, cpu bounded not with cpu bounded task
  2. attach N disks to ci runner if going to run N test in parallel, limit ram for each test container to be 1/N too. then if any test fail, increase all timeout as needed because now every parallel test is suppose to grab same, measurable amount of io and cpu resource.
  3. manually identify io bounded task, to skip parallel run, and directly put them into "non parallel run". But this seems doesn't give benefit automatic retry in sequential run

If anything, we should actually have an opposite script that runs each of our tests 10 times and if it fails at least once then the whole thing fails, because we have a large number of tests that are non-deterministic because of the parallellism and if there is a bug parallel code might fail very rarely. It is actually a common practice to tests parallel code, if model checking or static analysis is not available.

Parallel testing is needed but run each test x times in ci is considerable expensive. We have nightly runner that keep running integration test over and over again, I think that is a good place to keep flaky test monitored. In ci, my proposal is make reasonably good result as fast as possible: like this pr makes ci faster, show all fails, less likely timeout without trying to hide problem. (If you agree run test separately isn't hiding problem)

@ailisp

This comment has been minimized.

Copy link
Member Author

ailisp commented Dec 13, 2019

attach N disks to ci runner if going to run N test in parallel, limit ram for each test container to be 1/N too. then if any test fail, increase all timeout as needed because now every parallel test is suppose to grab same, measurable amount of io and cpu resource.

Actually this may be what true isolated test should be. what do you think?

@nearmax

This comment has been minimized.

Copy link
Collaborator

nearmax commented Dec 13, 2019

Okay, then we should have a script that runs all tests multiple times each, sequentially and then have it run daily even if it takes hours to complete.

@ailisp

This comment has been minimized.

Copy link
Member Author

ailisp commented Dec 13, 2019

Okay, then we should have a script that runs all tests multiple times each, sequentially and then have it run daily even if it takes hours to complete.

we have it in nightly runner. though the issue is it list all non ci test as "basic", not divided into each test. but i can separate display / run 10 times there:

https://nightly.neartest.com/run/5007890837b264d362b75837ae76b31cff3464c7_191212_225534

A different topic, I suddenly found gcloud instance disk benchmark only 1/20 of local machine. given comparable cpu and ram, this could really be a pain.

Bo Yao Bo Yao
@ailisp ailisp force-pushed the parallel-test2 branch from af43666 to 7736a48 Dec 14, 2019
@ailisp

This comment has been minimized.

Copy link
Member Author

ailisp commented Dec 14, 2019

@nearmax Add run 10 times sequentially script too. ptal

ailisp added 4 commits Dec 14, 2019
Copy link
Collaborator

nearmax left a comment

@ailisp ailisp added the automerge label Dec 16, 2019
@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit cc33034 into staging Dec 16, 2019
2 checks passed
2 checks passed
gitlab-ci
Details
reallyfastci
Details
@nearprotocol-bulldozer nearprotocol-bulldozer bot deleted the parallel-test2 branch Dec 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.