Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent rr recordings behave worse than serialized ones #1905

Open
jdm opened this issue Nov 24, 2016 · 17 comments
Open

Concurrent rr recordings behave worse than serialized ones #1905

jdm opened this issue Nov 24, 2016 · 17 comments

Comments

@jdm
Copy link

jdm commented Nov 24, 2016

STR:

  1. clone https://github.com/servo/servo/
  2. run ./mach build --release
  3. run ./mach test-wpt --release --chaos --processes 1 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
  4. 3 tests should finish running without incident, then the set of tests will be repeated until an unexpected result is found. Let it run at least once more, then interrupt it.
  5. run ./mach test-wpt --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/

Expected:
The tests should take less time to run, and no unexpected timeouts should be encountered.

Actual:

  ▶ TIMEOUT [expected OK] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.no-redirect.http.html
  │
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  └ 3.3 (Core Profile) Mesa 12.0.1
@Keno
Copy link
Member

Keno commented Nov 25, 2016

Does this run under separate rr processes or multiple children under the same rr process?

@jdm
Copy link
Author

jdm commented Nov 25, 2016

Separate repository processes.

@jdm
Copy link
Author

jdm commented Nov 25, 2016

Sorry, phone autocorrect. rr processes.

@Keno
Copy link
Member

Keno commented Nov 25, 2016

How many cores does your machine have? rr spawns a number of threads to do data compression, so this could easily be explained by that.

@rocallahan
Copy link
Collaborator

I don't see how.

@rocallahan
Copy link
Collaborator

I get slightly different results:

[roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 1 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 25.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 6 tests finished in 22.0 seconds.
  • 6 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

  Main thread got signal
  [6/3] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.keep
mach interrupted by signal or user action. Stopping.
[roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 8.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

  ▶ TIMEOUT [expected OK] /_mozilla/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/generic.no-redirect.http.html
  │ 
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  │ 3.3 (Core Profile) Mesa 12.0.1
  │ 
  │ VMware, Inc.
  │ Gallium 0.4 on softpipe
  └ 3.3 (Core Profile) Mesa 12.0.1

  [6/3] No tests running.

@rocallahan
Copy link
Collaborator

So --processes 2 is actually a lot faster, except that it sometimes times out anyway. Or something.

@rocallahan
Copy link
Collaborator

And I would have expected --timeout-multiplier=10 to fully prevent timeouts here.

@rocallahan
Copy link
Collaborator

and now I can't reproduce the problem at all :-(.

roc@glory servo]$ ./mach test-wpt --timeout-multiplier=10 --release --chaos --processes 2 tests/wpt/mozilla/tests/mozilla/referrer-policy/no-referrer/http-rp/same-origin/http-http/iframe-tag/
Running 3 tests in web-platform-tests

Ran 3 tests finished in 23.0 seconds.
  • 3 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 6 tests finished in 18.0 seconds.
  • 6 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 9 tests finished in 10.0 seconds.
  • 9 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 12 tests finished in 11.0 seconds.
  • 12 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 15 tests finished in 8.0 seconds.
  • 15 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 18 tests finished in 8.0 seconds.
  • 18 ran as expected. 0 tests skipped.

Running 3 tests in web-platform-tests

Ran 21 tests finished in 12.0 seconds.
  • 21 ran as expected. 0 tests skipped.

@rocallahan
Copy link
Collaborator

If I remove the timeout-multiplier I can reproduce a timeout. From timestamps in the rr dump I can see the three tests taking 3, 14 and 3 seconds. What is the default timeout value?

@jdm
Copy link
Author

jdm commented Nov 25, 2016

@rocallahan
Copy link
Collaborator

It looks to me like generic.no-redirect.http.html is just pretty close the timeout edge when run under rr. One thing I notice is that most syscalls appear to not be buffered.

@rocallahan
Copy link
Collaborator

Exactly none of them, in fact.

@rocallahan
Copy link
Collaborator

Looks like glibc changes in Fedora 25 are the culprit for that.

@rocallahan
Copy link
Collaborator

Ah no, it's just that --chaos does rr record -n -c10000 which is performance-killing. Are those really needed? -c10000 is the worst; is it not possible to reproduce bugs without it, even after lots of runs? rr chaos mode tries to randomize context switch intervals without increasing execution time too much, so if possible one should just let it do its thing.

@jdm
Copy link
Author

jdm commented Nov 25, 2016

That's good to know!

@rocallahan
Copy link
Collaborator

I'm not sure what provoked your original bug report here. It is possible that when executing multiple tests under rr concurrently, sometimes two tests will be randomly assigned the same CPU which would lengthen the runtime of both tests, meaning that individual test times could increase (compared to running them sequentially) and trigger a timeout, even though overall test throughput should be no worse than running tests sequentially. If you still want to pursue this I suggest modifying the harness to report the run-times of individual tests to give us a better idea of what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants