Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make test fails for most tests on a Parallels VM #1818

Open
stepthom opened this issue Sep 21, 2016 · 18 comments
Open

make test fails for most tests on a Parallels VM #1818

stepthom opened this issue Sep 21, 2016 · 18 comments

Comments

@stepthom
Copy link

stepthom commented Sep 21, 2016

I didn't see Parallels either way on list of supported VMs on the wiki page (https://github.com/mozilla/rr/wiki/Building-And-Installing), so I'm not sure if I'm barking up the wrong tree by trying to use rr in Parallels...

I'm running Parallels 9. I have PMU virtualization enabled. Guest OS is Ubuntu 14.04. I cloned the HEAD of rr's master branch, currently b0ae677. Built fine, but make test is failing for most the tests:

[dev@ubuntu:~/rr/obj (master)] $ make test
Running tests...
Test project /home/dev/rr/obj
        Start   1: 64bit_child
  1/1832 Test   #1: 64bit_child ...............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.36 sec
        Start   2: 64bit_child-no-syscallbuf
  2/1832 Test   #2: 64bit_child-no-syscallbuf .................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.35 sec
        Start   3: _llseek
  3/1832 Test   #3: _llseek ...................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start   4: _llseek-no-syscallbuf
  4/1832 Test   #4: _llseek-no-syscallbuf .....................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.25 sec
        Start   5: accept
  5/1832 Test   #5: accept ....................................................   Passed    0.33 sec
        Start   6: accept-no-syscallbuf
  6/1832 Test   #6: accept-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.30 sec
        Start   7: alarm
  7/1832 Test   #7: alarm .....................................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.27 sec
        Start   8: alarm-no-syscallbuf
  8/1832 Test   #8: alarm-no-syscallbuf .......................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.24 sec
        Start   9: alarm2
  9/1832 Test   #9: alarm2 ....................................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.24 sec
        Start  10: alarm2-no-syscallbuf
 10/1832 Test  #10: alarm2-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  1.23 sec
        Start  11: alsa_ioctl
 11/1832 Test  #11: alsa_ioctl ................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  12: alsa_ioctl-no-syscallbuf
 12/1832 Test  #12: alsa_ioctl-no-syscallbuf ..................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  13: arch_prctl
 13/1832 Test  #13: arch_prctl ................................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  14: arch_prctl-no-syscallbuf
 14/1832 Test  #14: arch_prctl-no-syscallbuf ..................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.26 sec
        Start  15: async_segv_ignored
 15/1832 Test  #15: async_segv_ignored ........................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.25 sec
        Start  16: async_segv_ignored-no-syscallbuf
 16/1832 Test  #16: async_segv_ignored-no-syscallbuf ..........................   Passed    0.16 sec
        Start  17: at_threadexit
 17/1832 Test  #17: at_threadexit .............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.31 sec
        Start  18: at_threadexit-no-syscallbuf
 18/1832 Test  #18: at_threadexit-no-syscallbuf ...............................***Failed  Error regular expression found in output. Regex=[FAILED]  0.27 sec
        Start  19: bad_ip
 19/1832 Test  #19: bad_ip ....................................................   Passed    0.17 sec
        Start  20: bad_ip-no-syscallbuf
 20/1832 Test  #20: bad_ip-no-syscallbuf ......................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  21: bad_syscall
 21/1832 Test  #21: bad_syscall ...............................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.24 sec
        Start  22: bad_syscall-no-syscallbuf
 22/1832 Test  #22: bad_syscall-no-syscallbuf .................................***Failed  Error regular expression found in output. Regex=[FAILED]  0.23 sec
        Start  23: barrier

If I look in Testing/Temporary/LastTest.log.tmp, I see things like:

Start testing: Sep 21 12:13 EDT
----------------------------------------------------------
1/1832 Testing: 64bit_child
1/1832 Test: 64bit_child
Command: "/bin/bash" "/home/dev/rr/src/test/basic_test.run" "-b" "/home/dev/rr" "/home/dev/rr/obj" "64bit_child"
Directory: /home/dev/rr/obj
"64bit_child" start time: Sep 21 12:13 EDT
Output:
----------------------------------------------------------
/home/dev/rr/src/test/util.sh: line 219: 18001 Aborted                 (core dumped) _RR_TRACE_DIR="$workdir" rr $GLOBAL_OPTIONS replay -a $replayflags > replay.out 2> replay.err
Test '64bit_child' FAILED: : error during replay:
--------------------------------------------------
[FATAL /home/dev/rr/src/ReplaySession.cc:847:check_ticks_consistency() errno: SUCCESS]
 (task 18002 (rec:17999) at time 104)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: open'; expected 3086, got 4294970382
[FATAL /home/dev/rr/src/log.cc:283:emergency_debug() errno: SUCCESS] (session doesn't look interactive, aborting emergency debugging)
--------------------------------------------------
replay.out:
--------------------------------------------------
--------------------------------------------------
Test 64bit_child failed, leaving behind /tmp/rr-test-64bit_child-P3OscNkkp
To replay the failed test, run
  _RR_TRACE_DIR=/tmp/rr-test-64bit_child-P3OscNkkp rr replay
<end of output>
Test time =   0.36 sec
----------------------------------------------------------
Test Fail Reason:
Error regular expression found in output. Regex=[FAILED]
"64bit_child" end time: Sep 21 12:13 EDT
"64bit_child" time elapsed: 00:00:00
----------------------------------------------------------

The output of perf list is:

[dev@ubuntu:~/rr/obj (master)] $ perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  ref-cycles                                         [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]
  dummy                                              [Software event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]

  branch-instructions OR cpu/branch-instructions/    [Kernel PMU event]
  branch-misses OR cpu/branch-misses/                [Kernel PMU event]
  bus-cycles OR cpu/bus-cycles/                      [Kernel PMU event]
  cache-misses OR cpu/cache-misses/                  [Kernel PMU event]
  cache-references OR cpu/cache-references/          [Kernel PMU event]
  cpu-cycles OR cpu/cpu-cycles/                      [Kernel PMU event]
  instructions OR cpu/instructions/                  [Kernel PMU event]
  mem-loads OR cpu/mem-loads/                        [Kernel PMU event]
  mem-stores OR cpu/mem-stores/                      [Kernel PMU event]
  stalled-cycles-frontend OR cpu/stalled-cycles-frontend/ [Kernel PMU event]

  rNNN                                               [Raw hardware event descriptor]
  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]
   (see 'man perf-list' on how to encode it)

  mem:<addr>[:access]                                [Hardware breakpoint]

  [ Tracepoints not available: Permission denied ]

Also:

[dev@ubuntu:~/rr/obj (master)] $ dmesg | grep PMU
[    5.040752] Performance Events: PEBS fmt1+, 16-deep LBR, IvyBridge events, full-width counters, Intel PMU driver.
[    5.253574] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
@rocallahan
Copy link
Collaborator

expected 3086, got 4294970382

That looks ... bad. I haven't seen anything like that before. I suspect a bug in Parallel's PMU virtualization. I'd focus on recording and replaying the simple test and use strace or add logging to PerfCounters.cc to get a detailed log of what's happening. Also you could dump the results of rr dump somewhere so we can check if we got reasonable recording results.

@khuey
Copy link
Collaborator

khuey commented Sep 21, 2016

Those values are different by exactly 2^31, fwiw.

@rocallahan
Copy link
Collaborator

I guess you could try masking off the high bit and see how far that gets you...

@stepthom
Copy link
Author

@rocallahan Sorry to be a noob, but I just downloaded rr yesterday and know basically nothing about it or the codebase or how things work. I'm not sure how to perform the actions you suggested. I'm very happy to try if given a bit more direction.

One additional piece of information: I noticed today that each test sometimes passes, sometimes not. Seems to be some nondeterminism involved somehow. (Maybe that's expected, I'm not sure.)

@khuey
Copy link
Collaborator

khuey commented Sep 22, 2016

rr uses performance counters to establish a timeline of program execution. This is then used during replay to deliver asynchronous events (e.g. SIGALRM) to the tracee at the same point it was delivered by the kernel during recording. The code that interfaces with the performance counter APIs is in src/PerfCounters.cc.

@rocallahan is suggesting that you modify read_counter. If the counter values reported by Parallels is off by some fixed amount you may be able to compensate for it there.

@sidkshatriya
Copy link
Contributor

As pointed out by @rickard-von-essen in machyve/xhyve#91 (comment)

Just a note about Parallels and the bug you link. The author uses a ~4 y old version of Parallels Desktop for Mac so I wouldn't assume that it the current state of PD PMU virtualization.

@rickard-von-essen
Copy link

I can confirm that this looks the same on Parallels Desktop 12 for Mac Pro ed 12.0.2 (41353). With the virtualisation type set to Parallels.

@rickard-von-essen
Copy link

$ dmesg|grep PMU
[    0.009541] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Broken PMU hardware detected, using software events only.
$ uname -a
Linux fedora-24-x86-64.shared 4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

@legal90 Do you know if there is anyone at Parallels who would like to dive into this?

@legal90
Copy link

legal90 commented Oct 25, 2016

Hi everyone!
Yes, unfortunately, we have some issues with PMU in Linux guests in Parallels Desktop 12 and earlier. Our engineers at Parallels are working on that and it should be fixed in one of the upcoming updates for Parallels Desktop. I will let you know here when it is fixed.
Sorry for the inconvenience 😞

@Gumix
Copy link

Gumix commented Aug 22, 2017

Hi! PMU virtualization for Linux guests is fixed in Parallels Desktop 13. Let me know if you have any issues.

@sidkshatriya
Copy link
Contributor

@Gumix That is great news! Does this mean that rr passes all tests when run in a Linux guest on Parallels Desktop 13?

@Gumix
Copy link

Gumix commented Aug 22, 2017

@sidkshatriya Well, most of them passed :) I'm not familiar with rr, so don't know how to investigate these failures, but I hope that they're not caused by virtualization (don't see anything strange in internal logs).

97% tests passed, 54 tests failed out of 2117

The following tests FAILED:
         42 - chew_cpu (Failed)
         43 - chew_cpu-no-syscallbuf (Failed)
        322 - prctl_caps (Failed)
        568 - strict_priorities (Failed)
        588 - syscallbuf_timeslice2 (Failed)
        600 - thread_yield (Failed)
        656 - async_signal_syscalls (Timeout)
        658 - async_signal_syscalls2 (Timeout)
        660 - async_signal_syscalls_siginfo (Failed)
        667 - block_clone_interrupted-no-syscallbuf (Failed)
        682 - breakpoint_overlap (Failed)
        692 - check_lost_interrupts (Failed)
        828 - string_instructions_async_signals (Failed)
        830 - string_instructions_async_signals_shared (Timeout)
        831 - string_instructions_async_signals_shared-no-syscallbuf (Failed)
        840 - syscallbuf_signal_blocking_read (Failed)
        877 - watchpoint_at_sched-no-syscallbuf (Failed)
        884 - async_signal_syscalls_100 (Timeout)
        886 - async_signal_syscalls_1000 (Timeout)
        887 - async_signal_syscalls_1000-no-syscallbuf (Failed)
        912 - break_time_slice (Failed)
        913 - break_time_slice-no-syscallbuf (Failed)
        938 - deliver_async_signal_during_syscalls (Timeout)
        1044 - syscallbuf_timeslice_250 (Failed)
        1045 - syscallbuf_timeslice_250-no-syscallbuf (Failed)
        1100 - chew_cpu-32 (Failed)
        1604 - stack_growth_syscallbuf-32 (Failed)
        1626 - strict_priorities-32 (Failed)
        1627 - strict_priorities-32-no-syscallbuf (Failed)
        1646 - syscallbuf_timeslice2-32 (Failed)
        1714 - async_signal_syscalls-32 (Timeout)
        1718 - async_signal_syscalls_siginfo-32 (Failed)
        1721 - async_usr1-32-no-syscallbuf (Failed)
        1724 - block_clone_interrupted-32 (Failed)
        1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
        1750 - check_lost_interrupts-32 (Failed)
        1752 - clone_interruption-32 (Failed)
        1768 - daemon_read-32 (Failed)
        1803 - ignored_async_usr1-32-no-syscallbuf (Failed)
        1868 - seccomp_signals-32 (Failed)
        1886 - string_instructions_async_signals-32 (Failed)
        1889 - string_instructions_async_signals_shared-32-no-syscallbuf (Timeout)
        1898 - syscallbuf_signal_blocking_read-32 (Failed)
        1934 - watchpoint_at_sched-32 (Failed)
        1935 - watchpoint_at_sched-32-no-syscallbuf (Failed)
        1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
        1943 - async_signal_syscalls_100-32-no-syscallbuf (Failed)
        1944 - async_signal_syscalls_1000-32 (Timeout)
        1970 - break_time_slice-32 (Failed)
        1978 - checkpoint_async_signal_syscalls_1000-32 (Timeout)
        1983 - checkpoint_prctl_name-32-no-syscallbuf (Failed)
        1996 - deliver_async_signal_during_syscalls-32 (Timeout)
        2045 - reverse_alarm-32-no-syscallbuf (Failed)
        2102 - syscallbuf_timeslice_250-32 (Failed)

There are many errors like:

----------------------------------------------------------
/rr/src/test/util.sh: line 210: 31571 Aborted (core dumped)
_RR_TRACE_DIR="$workdir" test-monitor $TIMEOUT record.err rr $GLOBAL_OPTIONS record $LIB_ARG $RECORD_ARGS "$exe" $exeargs > record.out 2> record.err
Test 'syscallbuf_timeslice_250_32' FAILED: : error during recording:
--------------------------------------------------
[FATAL /rr/src/PerfCounters.cc:762:read_ticks() errno: SUCCESS]
 (task 31580 (rec:31580) at time 177)
 -> Assertion `!counting_period || interrupt_val <= adjusted_counting_period' failed to hold. Detected 741 ticks, expected no more than 685
warning: remote target does not support file transfer, attempting to access files from local filesystem.
[FATAL /rr/src/log.cc:356:emergency_debug() errno: SUCCESS] Can't resume execution from invalid state

@rocallahan
Copy link
Collaborator

What Linux kernel version are you using in the guest?

Showing us the errors from a few more of the tests might be helpful.

@Gumix
Copy link

Gumix commented Aug 23, 2017

What Linux kernel version are you using in the guest?

4.11.8-300.fc26.x86_64

Showing us the errors from a few more of the tests might be helpful.

  • 31 of 54 are "Assertion ... Detected ... ticks, expected no more than ...".
  • 11 Timeouts with empty output.
  • 8 "Error regular expression found in output. Regex=[FAILED]"
  • 3 "Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: kill'; expected 104428, got 107821"
  • 1 "FAILED: rip not found"

@rocallahan
Copy link
Collaborator

@Gumix
Copy link

Gumix commented Aug 24, 2017

Here are the results with 4.12.8-300.fc26.x86_64:

99% tests passed, 29 tests failed out of 2117

The following tests FAILED:
         13 - alarm-no-syscallbuf (Failed)
         42 - chew_cpu (Failed)
         43 - chew_cpu-no-syscallbuf (Failed)
        202 - keyctl (Failed)
        203 - keyctl-no-syscallbuf (Failed)
        568 - strict_priorities (Failed)
        667 - block_clone_interrupted-no-syscallbuf (Failed)
        692 - check_lost_interrupts (Failed)
        804 - reverse_step_threads (Failed)
        828 - string_instructions_async_signals (Failed)
        876 - watchpoint_at_sched (Failed)
        913 - break_time_slice-no-syscallbuf (Failed)
        931 - cont_signal-no-syscallbuf (Failed)
        1100 - chew_cpu-32 (Failed)
        1101 - chew_cpu-32-no-syscallbuf (Failed)
        1260 - keyctl-32 (Failed)
        1261 - keyctl-32-no-syscallbuf (Failed)
        1724 - block_clone_interrupted-32 (Failed)
        1725 - block_clone_interrupted-32-no-syscallbuf (Failed)
        1750 - check_lost_interrupts-32 (Failed)
        1802 - ignored_async_usr1-32 (Failed)
        1886 - string_instructions_async_signals-32 (Timeout)
        1887 - string_instructions_async_signals-32-no-syscallbuf (Failed)
        1888 - string_instructions_async_signals_shared-32 (Timeout)
        1899 - syscallbuf_signal_blocking_read-32-no-syscallbuf (Failed)
        1937 - watchpoint_before_signal-32-no-syscallbuf (Failed)
        1970 - break_time_slice-32 (Failed)
        1971 - break_time_slice-32-no-syscallbuf (Failed)
        2045 - reverse_alarm-32-no-syscallbuf (Failed)

Full log is here.

@rocallahan
Copy link
Collaborator

Thanks. It looks like PMU interrupts are sometimes being dropped :-(.

If you're sure you were running with 4.12.8 then that means interrupts are being dropped further down the stack, in Parallels or possibly by MacOS itself.

It looks like the chew_cpu test is reproducing this reliably.

@Gumix
Copy link

Gumix commented Aug 25, 2017

I will check it. BTW, Linux perf subsystem configures APIC to "NMI" delivery mode, but the hypervisor intercepts it and writes "Normal" delivery mode to the real APIC for PMI. It may lead to some inaccuracy in the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants