go/oasis-test-runner: parallel job execution runs the same scenarios multiple times or misses some scenarios #3104

tjanez · 2020-07-13T12:32:13Z

SUMMARY

To speed up getting the results of tests, it is possible to parallelize their execution by specifying the --parallel.job_count and --parallel.job_index flags as is done in out Buildkite CI script:

oasis-core/.buildkite/scripts/test_e2e.sh

Lines 64 to 65 in 6c6e703

    
           ${BUILDKITE_PARALLEL_JOB_COUNT:+--parallel.job_count ${BUILDKITE_PARALLEL_JOB_COUNT}} \ 
        
           ${BUILDKITE_PARALLEL_JOB:+--parallel.job_index ${BUILDKITE_PARALLEL_JOB}} \

However, the oasis-test-runner currently doesn't correctly partition the jobs to different runners.

ISSUE TYPE

Bug Report

ACTUAL RESULTS

For example, running:

.buildkite/scripts/test_e2e.sh --test e2e/runtime/byzantine/.* --parallel.job_count 3 --parallel.job_index 2

resulted in the following two scenarios being run:

e2e/runtime/byzantine/executor-honest
e2e/runtime/byzantine/executor-straggler

Fuller output:

+ WORKDIR=/home/tadej/Oasis/oasis-core
+ runtime_target=default
+ [[ '' == \i\n\t\e\l\-\s\g\x ]]
+ [[ '' != '' ]]
+ node_binary=/home/tadej/Oasis/oasis-core/go/oasis-node/oasis-node
+ test_runner_binary=/home/tadej/Oasis/oasis-core/go/oasis-test-runner/oasis-test-runner
+ [[ '' != '' ]]
+ ias_mock=true
+ set +x
+ /home/tadej/Oasis/oasis-core/go/oasis-test-runner/oasis-test-runner --basedir.no_cleanup --e2e.node.binary /home/tadej/Oasis/oasis-core/go/oasis-node/oasis-node --e2e/runtime.client.binary_dir /home/tadej/Oasis/oasis-core/target/default/debug --e2e/runtime.runtime.binary_dir /home/tadej/Oasis/oasis-core/target/default/debug --e2e/runtime.runtime.loader /home/tadej/Oasis/oasis-core/target/default/debug/oasis-core-runtime-loader --e2e/runtime.tee_hardware '' --e2e/runtime.ias.mock=true --remote-signer.binary /home/tadej/Oasis/oasis-core/go/oasis-remote-signer/oasis-remote-signer --log.level info --test 'e2e/runtime/byzantine/.*' --parallel.job_count 3 --parallel.job_index 2
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:51:34.823787934Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/merge-wrong run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:51:34.823817532Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/merge-honest run_id=0
level=info module=test-runner caller=root.go:367 ts=2020-07-13T10:51:34.823826608Z msg="running test case" test=e2e/runtime/byzantine/executor-honest run_id=0
[ ... output trimmed ...]
level=info module=test-runner caller=root.go:425 ts=2020-07-13T10:51:34.823832891Z msg="passed test case" test=e2e/runtime/byzantine/executor-honest run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:51:34.823838913Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/merge-straggler run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:51:34.823844475Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/executor-wrong run_id=0
level=info module=test-runner caller=root.go:367 ts=2020-07-13T10:51:34.823850193Z msg="running test case" test=e2e/runtime/byzantine/executor-straggler run_id=0
[ ... output trimmed ...]
level=info module=test-runner caller=root.go:425 ts=2020-07-13T10:51:34.823855891Z msg="passed test case" test=e2e/runtime/byzantine/executor-straggler run_id=0

Running the same command again resulted in the following two scenarios being run:

e2e/runtime/byzantine/executor-straggler
e2e/runtime/byzantine/merge-wrong

Fuller output:

+ WORKDIR=/home/tadej/Oasis/oasis-core
+ runtime_target=default
+ [[ '' == \i\n\t\e\l\-\s\g\x ]]
+ [[ '' != '' ]]
+ node_binary=/home/tadej/Oasis/oasis-core/go/oasis-node/oasis-node
+ test_runner_binary=/home/tadej/Oasis/oasis-core/go/oasis-test-runner/oasis-test-runner
+ [[ '' != '' ]]
+ ias_mock=true
+ set +x
+ /home/tadej/Oasis/oasis-core/go/oasis-test-runner/oasis-test-runner --basedir.no_cleanup --e2e.node.binary /home/tadej/Oasis/oasis-core/go/oasis-node/oasis-node --e2e/runtime.client.binary_dir /home/tadej/Oasis/oasis-core/target/default/debug --e2e/runtime.runtime.binary_dir /home/tadej/Oasis/oasis-core/target/default/debug --e2e/runtime.runtime.loader /home/tadej/Oasis/oasis-core/target/default/debug/oasis-core-runtime-loader --e2e/runtime.tee_hardware '' --e2e/runtime.ias.mock=true --remote-signer.binary /home/tadej/Oasis/oasis-core/go/oasis-remote-signer/oasis-remote-signer --log.level info --test 'e2e/runtime/byzantine/.*' --parallel.job_count 3 --parallel.job_index 2
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:52:12.53867671Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/executor-honest run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:52:12.538721487Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/executor-wrong run_id=0
level=info module=test-runner caller=root.go:367 ts=2020-07-13T10:52:12.538734912Z msg="running test case" test=e2e/runtime/byzantine/executor-straggler run_id=0
[ ... output trimmed ...]
level=info module=test-runner caller=root.go:425 ts=2020-07-13T10:52:12.538745432Z msg="passed test case" test=e2e/runtime/byzantine/executor-straggler run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:52:12.538757833Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/merge-straggler run_id=0
level=info module=test-runner caller=root.go:352 ts=2020-07-13T10:52:12.538768428Z msg="skipping test case (assigned to different parallel job)" test=e2e/runtime/byzantine/merge-honest run_id=0
level=info module=test-runner caller=root.go:367 ts=2020-07-13T10:52:12.538779146Z msg="running test case" test=e2e/runtime/byzantine/merge-wrong run_id=0
[ ... output trimmed ...]
level=info module=test-runner caller=root.go:425 ts=2020-07-13T10:52:12.538789464Z msg="passed test case" test=e2e/runtime/byzantine/merge-wrong run_id=0

EXPECTED RESULTS

Running the same command for parallel job execution should always result in the same set of scenarios being run.

The text was updated successfully, but these errors were encountered:

ptrus · 2020-07-14T06:49:10Z

This is probably cause we iterate a map of scenarios (for which go deliberately randomizes order).

See:

oasis-core/go/oasis-test-runner/cmd/root.go

Lines 280 to 289 in c56a606

    
           for scName, scenario := range common.GetScenarios() { 
        
           	var match bool 
        
           	match, err = regexp.MatchString(name, scName) 
        
           	if err != nil { 
        
           		return fmt.Errorf("root: bad scenario name regexp: %w", err) 
        
           	} 
        
           	if match { 
        
           		matched[scenario] = true 
        
           		anyMatched = true 
        
           	}

I think this just means that scenarios will be grouped into different parallel jobs, not that it will miss/skip some cases. If we want to ensure a stable order one possible fix (i think) is sorting this array, once it's built:

oasis-core/go/oasis-test-runner/cmd/root.go

Line 302 in c56a606

toRun = append(toRun, scenario)

kostko · 2020-07-14T07:10:24Z

Yeah we should sort the scenarios.

tjanez added c:testing Category: testing c:bug Category: bug labels Jul 13, 2020

tjanez self-assigned this Jul 13, 2020

tjanez mentioned this issue Jul 14, 2020

go/oasis-test-runner/cmd: Sort scenarios for correct parallel execution #3107

Merged

tjanez closed this as completed in #3107 Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/oasis-test-runner: parallel job execution runs the same scenarios multiple times or misses some scenarios #3104

go/oasis-test-runner: parallel job execution runs the same scenarios multiple times or misses some scenarios #3104

tjanez commented Jul 13, 2020

ptrus commented Jul 14, 2020 •

edited

kostko commented Jul 14, 2020 via email

go/oasis-test-runner: parallel job execution runs the same scenarios multiple times or misses some scenarios #3104

go/oasis-test-runner: parallel job execution runs the same scenarios multiple times or misses some scenarios #3104

Comments

tjanez commented Jul 13, 2020

SUMMARY

ISSUE TYPE

ACTUAL RESULTS

EXPECTED RESULTS

ptrus commented Jul 14, 2020 • edited

kostko commented Jul 14, 2020 via email

ptrus commented Jul 14, 2020 •

edited