t/pmc/nci.t tests are fragile and broken by design #808

leto opened this Issue Aug 9, 2012 · 1 comment


None yet

2 participants

leto commented Aug 9, 2012

Currently the NCI tests in t/pmc/nci.t are fragile and are blocking the merging of the very important threads branch.

Originally it was thought that the threads branch on Darwin/PPC was the only platform that broke the NCI tests, but it has come to light that the test failures can be reproduced on various "slow" machines as well as clang with address sanitation turned on.

The fact that the NCI tests use "sleep" currently is very broken and fragile and only works on "faster" machines.
For more details, see http://lists.parrot.org/pipermail/parrot-dev/2012-August/007106.html

The NCI tests should be refactored so they do not depend on peppering various "sleep" calls in each test.
(done in 56c96dd)

I'm with Andy that sleep() is the real problem here.
1st the tests are way too fragile, and
2nd sleep() itself got more instable in the threads branch.

Esp. this section loops in while. How can we ensure that the alarm signal
is ever delivered, that the timer thread is not also waiting? Is sleep disturbed by another alarm?

thread.c-    ASSERT_ARGS(Parrot_thread_wait_for_notification)
thread.c-#ifdef PARROT_HAS_THREADS
thread.c:    LOCK(interp->sleep_mutex);
thread.c-    while (interp->wake_up == 0)
thread.c:        COND_WAIT(interp->sleep_cond, interp->sleep_mutex);
thread.c-    interp->wake_up = 0;
thread.c:    UNLOCK(interp->sleep_mutex);
thread.c-    Parrot_alarm_wait_for_next_alarm(interp);

The backtraces of the sleep-alarm deadlock are:

2304 Thread_2503
  2304 start
    2304 main
      2304 Parrot_api_run_bytecode
        2304 Parrot_pf_execute_bytecode_program
          2304 Parrot_cx_begin_execution
            2304 Parrot_cx_outer_runloop
              2304 Parrot_thread_wait_for_notification
                2304 _pthread_cond_wait
                  2304 __semwait_signal
                    2304 __semwait_signal
2304 Thread_2603
  2304 _pthread_start
    2304 Parrot_alarm_runloop
      2304 _pthread_cond_wait
        2304 __semwait_signal
          2304 __semwait_signal

e.g. If I remove the sleep 0.01 line in t/pmc/nci_37.pasm the nci tests succeed.
On linux amd64 also repro with tsan. See http://pastebin.com/gKRDTkh3

cd ~/bin
wget http://build.chromium.org/p/client.tsan/binaries/tsan-r4356-amd64-linux-self-contained.sh
chmod +x tsan-r4356-amd64-linux-self-contained.sh
cd - # threads branch

tsan-r4356-amd64-linux-self-contained.sh ./parrot t/pmc/nci_37.pasm
tsan-r4356-amd64-linux-self-contained.sh ./parrot t/pmc/task.t

So the sleep implementation with threads is racy.
rurban assumes the sleep thread blocks signals which arrive during the sleep, and the sleep loop never finishes.
I think that there's a design limitation that only one signal can be accepted per thread.

@rurban rurban added a commit that referenced this issue Aug 10, 2012
@rurban rurban [GH #808] Remove sleep calls in nci.t, because of signal deadlocks wi…
…th parrot threads

Even without sleep calls the tests succeed. But since it loops until the resuilt arrives, let
it busy loop a bit longer.
Note: This is a hack. sleep on threads should be fixed instead.
rurban commented Sep 2, 2012

remaining deadlock fixed with e1d4c06 in the threads branch.

@rurban rurban closed this Sep 2, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment