Skip to content

Infinite loop in nunit 3.9 #2761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
algol-fi opened this issue Mar 15, 2018 · 19 comments
Closed

Infinite loop in nunit 3.9 #2761

algol-fi opened this issue Mar 15, 2018 · 19 comments

Comments

@algol-fi
Copy link

I am running a test suite with nunit 3.9 using nunit3-console, that randomly blocks at the end of the run. All of the tests derive from a base class that is declared with [Parallelizable(ParallelScope.All)] attribute.

I managed to break into nunit and the problem is that sometimes the NonParallel worker is stuck inside the OnEndOfShift() method of ParallelWorkItemDispatcher. The _topLevelWorkItem state is WorkItemState.Complete, and after all of the shifts have been run their queues are empty. No other worker is changing the state of the _topLevelWorkItem. Given this situation, the following code inside OnEndOfShift() is an endless loop:

// Shift has ended but all work may not yet be done
while (_topLevelWorkItem.State != WorkItemState.Complete)
{
    // This will return null if all queues are empty.
    nextShift = SelectNextShift();
    if (nextShift != null)
    {
        ShiftStarting?.Invoke(nextShift);
        nextShift.Start();
        return;
    }
}

The name of the _topLevelWorkItem is the name of the dll containing all the tests.
I was not able to recreate the problem with a small set of tests but I can make investigations about variable values, execution stacks, or order of events, if you need info.

@CharliePoole
Copy link
Member

CharliePoole commented Mar 15, 2018

The code you posted doesn't seem to be able to cause an infinite loop, at least not if the value of _topLevelWorkItem.State is WorkItemState.Complete as you say. In that case, the loop would exit immediately.

OTOH, if you meant to say it's not Complete, and no other tests are running, this would definitely lock things up.

It's interesting that the non-parallel worker is the one having problems. Do you have any non-parallel tests, or do they all run in parallel? What happens if you add a non-parallel test?

Using --trace:Info would give us more information about what's happening as execution switches from one worker to another, without being quite so voluminous as --trace:Debug.

@algol-fi
Copy link
Author

Sorry, I messed things up, I meant that the state is WorkItemState.Running. I don't have any non-parallel test, as I mentioned in the first message, every test class inherits from a base class that has the [Parallelizable(ParallelScope.All)] attribute. I can try adding a single non-parallel test to see if this changes the outcome.

@CharliePoole
Copy link
Member

I figured. 😄 Yes, try that. My theory is that an unused non-parallel worker may be a special case, since the workers are all created up front, whether needed or not. Otherwise, the info trace will give more info as to what was running immediately before the hangup.

@algol-fi
Copy link
Author

I tried to add a non parallel test but I managed to replicate the hang, so it makes no difference.

However, here's something I noticed in the logs (I emitted a couple of additional messages in the code to help me understand what was going on). When the tests ends correctly the sequence I see regarding the dll is:

18:33:09.107 Debug [ 5] Dispatcher: Using NonParallel strategy for Company.QA.Product.Api.Tests.dll
18:33:09.112 Info  [ 9] TestWorker: NonParallelWorker executing Company.QA.Product.Api.Tests.dll
18:33:09.142 Info  [ 9] Dispatcher: Top level work Item Company.QA.Product.Api.Tests.dll state: Running, owned by NonParallelWorker
18:35:57.603 Debug [11] Dispatcher: Using Direct strategy for Company.QA.Product.Api.Tests.dll OneTimeTearDown
18:35:57.603 Info  [11] WorkItem: Workitem Company.QA.Product.Api.Tests.dll completed by managed thread ParallelWorker#2

So it seems that even if the work item is executed by NonParallelWorker the OneTimeTearDown and the completion are performed by a ParallelWorker. I guess that this is expected behaviour.
When the execution remains stuck, the last two lines are not present, so no OneTimeTearDown and no log about the completion of the work item. As far as I understood, since we have a number of nested namespaces, after all the tests in a namespace are completed, the namespace itself is completed and a OneTimeTearDown is invoked for the namespace. It seems that something is blocking this roll-up process. I'll keep investigating.

@CharliePoole
Copy link
Member

The dispatching of the OneTimeTearDown is normal behavior.

Assuming that you have no specific ParallelizationAttribute at the assembly level, we kick it off on the non-parallel queue. To avoid unnecessary scheduling of and switching of shifts for the one-time teardown, NUnit remembers that you had no specific preference and uses the Direct strategy, that is, executes the one-time teardown on the thread that activated the one-time teardown. That would be whatever thread ran the last test, causing the higher level item to be complete.

It's possible you have discovered a race condition in that logic. When the last actual test completes, that "rollup" would normally start. The one-time teardown for each containing suite, starting with the fixture itself, would need to execute. You should check the sequence of those calls. DId anything happen in the teardowns to hang the proccess?

@rprouse
Copy link
Member

rprouse commented Mar 16, 2018

Possible duplicate of #2764

@herebebeasties
Copy link

See also #2810.

@jnm2
Copy link
Contributor

jnm2 commented May 28, 2018

I caught a hang live on my machine that I've seen about 2% of the time on AppVeyor. Saved a minidump and attached VS to it!

Going to do my best to leave the process running in case one of you wants me to try something.

@jnm2

This comment has been minimized.

@jnm2

This comment has been minimized.

@jnm2

This comment has been minimized.

@jnm2
Copy link
Contributor

jnm2 commented May 28, 2018

/cc @CharliePoole for insight.

@CharliePoole
Copy link
Member

@jnm2 There's not much specific that we can say from the info posted.

In general, it looks like this...

  1. The event pump thread is doing what it should do... waiting for events. It will be shut down when the run finishes. Normally there is only one thread but if you were running NUnit tests of the event pump, then there may be a test thread. If so, that may be the problem.

  2. The seven threads that have nothing to do are also doing what they need to do... waiting for the active thread to either terminate the run or create some new work items to execute.

  3. The eighth thread is likely the culprit here. Everything is waiting on it but nothing is happening.

I have only been able to debug situations like this by looking at how we got to this place - that is at what tests were running on what threads before we got to the situation. An Info-level trace usually has the right information.

@dukearena
Copy link

dukearena commented Jul 6, 2018

I managed to reproduce the issue @algol-fi wrote here.
Note that it can be reproduced also with Nunit 3.10.1.

The problem lies in the fact that some of the ParallelWorkers die during execution and before completion, and this seems to block the roll-up process he mentioned.
This is also proved by logs reading this in the middle of the file:

ParallelWorker#1 stopping - 8 WorkItems processed.

Looking at the code, I noticed this block:

try
{
BeforeTest(context);
context.CurrentResult = innerCommand.Execute(context);
}
catch (Exception ex)
{
#if THREAD_ABORT
if (ex is ThreadAbortException)
Thread.ResetAbort();
#endif
context.CurrentResult.RecordException(ex);
}
finally
{
if (context.ExecutionStatus != TestExecutionStatus.AbortRequested)
AfterTest(context);
}

According to MSDN, ThreadAbortException is a special exception that always rethrows even if caught, unless Thread.ResetAbort() is called. (reference: https://msdn.microsoft.com/en-us/library/system.threading.threadabortexception.aspx#Remarks).

So, if a test has a TimeoutAttribute that expires during execution, it is caught by the catch block and the ResetAbort() method is called, but if the timeout expires during a TearDown, the exception is thrown during the AfterTest method execution, and nobody cancels the ThreadAbort, so eventually the thread is closed.

You can find an example reproducing the issue here: https://github.com/dukearena/nunit-issue-2761-confirm
Instruction to recreate the issue can be found into the repository home page.

@CharliePoole
Copy link
Member

One note about parallel workers "dying" - In the course of execution, it is normal for an entire new set of parallel workers to be created and then terminated.

For example, when a non-parallel fixture starts executing, NUnit does not yet know whether some of the tests in that fixture will run in parallel. If they do call for parallel execution, we don't want them to run in parallel with tests from other fixtures, since this is a non-parallel fixture. Therefore, we create an entire new set of workers to run the tests within the fixture. When the fixture completes, all those workers are stopped. There's a bit of inefficiency in doing this, but it isn't large. I originally planned to optimize it should the need arise but it never did.

So the line in the log reading

ParallelWorker#1 stopping - 8 WorkItems processed.

is probably due to this. If the thread had merely been killed due to an error, I don't think you would get a message at all. However, I'd need to see other entries in the log around that time to be sure.

@jnm2
Copy link
Contributor

jnm2 commented Jul 6, 2018

Ah, this sounds similar to #2328!

@CharliePoole
Copy link
Member

@dukearena Regarding abort during teardown, see #352.

I created this issue four years ago. It has been deprioritized a few times and nobody has ever taken it on as a task. I continue to feel that high priority was the right call back in 2014.

It seems to me, however, that we could rather easily fix the problem of not calling ResetAbort as you called out in this issue.

@dukearena
Copy link

@CharliePoole regarding the logs, this is a more detailed explanation:
the issue comes from the fact that the fixture is not yet completed but a test has ended due to the Thread.Abort() when AfterTest function is running.
That's why the sample project contains more than 8 tests per fixture; it needs to close all the workers while some test still has to be executed.

So, TimeoutAttribute closes the thread and, since nobody is calling Thread.ResetAbort() in that point, the exception is thrown again and again (even if catched) until this level:

private void TestWorkerThreadProc()
{
_running = true;
try
{
while (_running)
{
_currentWorkItem = WorkQueue.Dequeue();
if (_currentWorkItem == null)
break;
log.Info("{0} executing {1}", _workerThread.Name, _currentWorkItem.Name);
_currentWorkItem.TestWorker = this;
// During this Busy call, the queue state may be saved.
// This gives us a new set of queues, which are initially
// empty. The intention is that only children of the current
// executing item should make use of the new set of queues.
// TODO: If we had a separate NonParallelTestWorker, it
// could simply create the isolated queue without any
// worrying about competing workers.
Busy(this, _currentWorkItem);
// Because we execute the current item AFTER the queue state
// is saved, its children end up in the new queue set.
_currentWorkItem.Execute();
// This call may result in the queues being restored. There
// is a potential race condition here. We should not restore
// the queues unless all child items have finished.
Idle(this, _currentWorkItem);
++_workItemCount;
}
}
finally
{
log.Info("{0} stopping - {1} WorkItems processed.", Name, _workItemCount);
}
}

So, even if the fixture is not completed, it exits the while block and stops the worker without restarting it anymore.

Adding a catch block in TestWorker.cs helped me to find this out

@CharliePoole
Copy link
Member

@dukearena OK, I see the point now. In any case, I think your fix takes care of the problem. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants