Concurrent rr recordings behave worse than serialized ones #1905

jdm · 2016-11-24T20:39:39Z

STR:

clone https://github.com/servo/servo/
run ./mach build --release
run ./mach test-wpt --release --chaos --processes 1 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
3 tests should finish running without incident, then the set of tests will be repeated until an unexpected result is found. Let it run at least once more, then interrupt it.
run ./mach test-wpt --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/

Expected:
The tests should take less time to run, and no unexpected timeouts should be encountered.

Actual:

  ▶ TIMEOUT [expected OK] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.no-redirect.http.html
  │
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  └ 3.3 (Core Profile) Mesa 12.0.1

The text was updated successfully, but these errors were encountered:

Keno · 2016-11-25T00:20:29Z

Does this run under separate rr processes or multiple children under the same rr process?

jdm · 2016-11-25T00:25:31Z

Separate repository processes.

jdm · 2016-11-25T00:25:56Z

Sorry, phone autocorrect. rr processes.

Keno · 2016-11-25T00:26:32Z

How many cores does your machine have? rr spawns a number of threads to do data compression, so this could easily be explained by that.

rocallahan · 2016-11-25T00:32:25Z

I don't see how.

rocallahan · 2016-11-25T05:18:47Z

I get slightly different results:

[roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 1 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 25.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 6 tests finished in 22.0 seconds.
  • 6 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

  Main thread got signal
  [6/3] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.keep
mach interrupted by signal or user action. Stopping.
[roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 8.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

  ▶ TIMEOUT [expected OK] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.no-redirect.http.html
  │ 
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  │ 3.3 (Core Profile) Mesa 12.0.1
  │ 
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  └ 3.3 (Core Profile) Mesa 12.0.1

  [6/3] No tests running.

rocallahan · 2016-11-25T05:19:49Z

So --processes 2 is actually a lot faster, except that it sometimes times out anyway. Or something.

rocallahan · 2016-11-25T05:20:14Z

And I would have expected --timeout-multiplier=10 to fully prevent timeouts here.

rocallahan · 2016-11-25T05:28:37Z

and now I can't reproduce the problem at all :-(.

roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 23.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 6 tests finished in 18.0 seconds.
  • 6 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 9 tests finished in 10.0 seconds.
  • 9 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 12 tests finished in 11.0 seconds.
  • 12 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 15 tests finished in 8.0 seconds.
  • 15 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 18 tests finished in 8.0 seconds.
  • 18 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 21 tests finished in 12.0 seconds.
  • 21 ran as expected. 0 tests skipped.

rocallahan · 2016-11-25T05:31:27Z

If I remove the timeout-multiplier I can reproduce a timeout. From timestamps in the rr dump I can see the three tests taking 3, 14 and 3 seconds. What is the default timeout value?

jdm · 2016-11-25T06:02:18Z

10s, according to https://github.com/w3c/wptrunner/blob/master/wptrunner/wpttest.py#L5.

rocallahan · 2016-11-25T07:02:39Z

It looks to me like generic.no-redirect.http.html is just pretty close the timeout edge when run under rr. One thing I notice is that most syscalls appear to not be buffered.

rocallahan · 2016-11-25T07:04:01Z

Exactly none of them, in fact.

rocallahan · 2016-11-25T07:06:08Z

Looks like glibc changes in Fedora 25 are the culprit for that.

rocallahan · 2016-11-25T07:13:00Z

Ah no, it's just that --chaos does rr record -n -c10000 which is performance-killing. Are those really needed? -c10000 is the worst; is it not possible to reproduce bugs without it, even after lots of runs? rr chaos mode tries to randomize context switch intervals without increasing execution time too much, so if possible one should just let it do its thing.

jdm · 2016-11-25T08:01:21Z

That's good to know!

rocallahan · 2016-11-25T09:01:56Z

I'm not sure what provoked your original bug report here. It is possible that when executing multiple tests under rr concurrently, sometimes two tests will be randomly assigned the same CPU which would lengthen the runtime of both tests, meaning that individual test times could increase (compared to running them sequentially) and trigger a timeout, even though overall test throughput should be no worse than running tests sequentially. If you still want to pursue this I suggest modifying the harness to report the run-times of individual tests to give us a better idea of what's going on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent rr recordings behave worse than serialized ones #1905

Concurrent rr recordings behave worse than serialized ones #1905

jdm commented Nov 24, 2016 •

edited

Loading

Keno commented Nov 25, 2016

jdm commented Nov 25, 2016

jdm commented Nov 25, 2016

Keno commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

jdm commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

jdm commented Nov 25, 2016

rocallahan commented Nov 25, 2016

Concurrent rr recordings behave worse than serialized ones #1905

Concurrent rr recordings behave worse than serialized ones #1905

Comments

jdm commented Nov 24, 2016 • edited Loading

Keno commented Nov 25, 2016

jdm commented Nov 25, 2016

jdm commented Nov 25, 2016

Keno commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

jdm commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

rocallahan commented Nov 25, 2016

jdm commented Nov 25, 2016

rocallahan commented Nov 25, 2016

jdm commented Nov 24, 2016 •

edited

Loading