Only restore isolated queues when current queues are completed #2445

ChrisMaddock · 2017-09-17T12:42:23Z

What I believe was happening - RestoreQueues was being called while the OneTimeTearDown was waiting to be executed. RestoreQueues was then removing the current queues, and the as-yet-unexecuted OneTimeTearDown was being lost.

I added a Guard to RestoreQueues() - as I can't think of any situation where we would want to restore the queues while the existing queues still contain work. Hopefully in future that would give us a crash, rather than a hang. I then added the check to OnEndOfShift(), so the queues are only popped from the save-state-stack if all current work has been completed.

This is a new area to me - would appreciate someone checking this is all following what's expected here!

CharliePoole

Sorry I haven't gotten to this sooner or looked more deeply into it. As I stated, I"m suspicious of a hidden race condition but I'll have to leave it to you to do the thinking. 😄

CharliePoole · 2017-09-17T16:39:39Z

src/NUnitFramework/framework/Internal/Execution/ParallelWorkItemDispatcher.cs

@@ -242,7 +244,7 @@ private void RestoreQueues()

        private void OnEndOfShift(object sender, EventArgs ea)
        {
-            if (_isolationLevel > 0)
+            if (_isolationLevel > 0 && Shifts.All(q => !q.HasWork))


I don't actually understand what happened here. I know I added code to fix an earlier bug that made exactly this test using a different, non-Linq implementation. But that change seems to have disappeared!

Anyway, this is definitely needed.

However... I think there is more to it. It's possible for all the queues to be empty while there is still unfinished work because some fixture is running but has not yet queued its child tests. This is handled in the immediately following code for the purpose of not terminating the appication...

// Shift has ended but all work may not yet be done while (_topLevelWorkItem.State != WorkItemState.Complete) { // This will fail if there is no work - all queues empty. // In that case, we just continue the loop until either // a shift is started or all the work is complete. if (StartNextShift()) return; }

I don't have time to step through it right now, but I'm worried that there's a race condition buried here, where a new test gets queued just as you are making your test.

You are using a different condition here than in your guard. I tried switching this to Queues.All(q => q.IsEmpty) and it works some of the time, but we need to add a null check when dequeuing and the guard failed one time making me think there is a race condition.

I also tried by changing the guard to Shifts.All(q => !q.HasWork) but that doesn't work.

I am running out of time today, I will try to get back to looking at it tonight.

@CharliePoole - yes, that sounds feasible. I'm not totally sure how to deal with that one. The whole saving and restoring queues logic seems difficult to multi-thread - it wasn't entirely clear what the side-effects of dealing with #2428 would be here either.

@rprouse - Aha! I umm'd and arr'd over which condition to use, and seem to have ended up with one of each! I thought the two should be functionally equivalent here - interested as to what's not working by switching them around...

but we need to add a null check when dequeuing

@rprouse - what did you mean by that? Are you possibly seeing #2428? That needs to be dealt with as a separate issue, imo.

rprouse

This is a tough one to review so I pulled your branch and added a test project to the solution with the following test,

{
[TestFixture]
public class NunitParallelizationTest
{
    [Parallelizable(ParallelScope.All)]
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

Running with your Guard in place, but your fix commented out, it crashes AND hangs every time for me and the OneTimeTearDown is not run.

Running with the Guard AND your fix, it runs every time and the OneTimeSetup and OneTimeTearDown are run correctly.

I also tried it without OneTimeSetup and OneTimeTearDown and get the same results. I am going to try a few other cases of parallelization, but so far looks good to me.

The only thing, do you think there is a better way to add tests for this? You are testing the guard, but not the fix. Do you think the above code would be a valid enough test if better named?

rprouse

Continuing to test and I've managed to get it to crash/hang again by adding in an STA queue. Test code is,

[TestFixture]
public class NunitParallelizationTest
{
    [Parallelizable(ParallelScope.All)]
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[Parallelizable(ParallelScope.All)]
public class NunitParallelizationTest2
{
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[Parallelizable(ParallelScope.Children)]
public class NunitParallelizationTest3
{
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[RequiresThread(System.Threading.ApartmentState.STA)]
public class NunitParallelizationTestSTA
{
    [Test]
    [Parallelizable(ParallelScope.All)]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

Testing fix, more to come...

CharliePoole · 2017-09-17T17:20:57Z

Nice test! I think you have hit the race condition I was talking about.

Unfortunately, when we test a queue for work available, we only get the work that has already been queued. In the case of final termination of the app, we are able to test the top level work item for completion. In this case, where an NP fixture is running tests in parallel, we would need to test for completion of the particular fixture that caused the queue to be saved in the first place. This isn't an enormous change but it's a bit more than one might have thought.

@ChrisMaddock I'm not looking at the code, so don't trust any of the following. 😈

When we save the queues, we always do it because we are starting a particular composite work item. I think we should save that work item along with the queues, perhaps in a parallel stack. If the queues all become empty but that work item has not yet completed, we should actually not end the shift. Instead, we should continue until there is really no more work to be done.

That paragraph was probably easier to write than the code will be. 😭

ChrisMaddock · 2017-09-17T22:24:16Z

Replying to everything bit by bit. 🙂

The only thing, do you think there is a better way to add tests for this? You are testing the guard, but not the fix. Do you think the above code would be a valid enough test if better named?

@rprouse - Yes, I agree. I did a similar thing to debug. I wasn't too sure how better to test this, other than smoke test it - I wondered if using a variety of paralellization modes in the NUnit might give us a better smoke-test coverage there.

I didn't add the precise test, as it's race-conditiony, and only one of many possible combinations of parallelisation attributes which may cause problems. I'm not sure how best to approach this - the specific issue here I believe is a combination of more tests than workers, test cases in parallel, AND the containing fixture in Non-Parallel. It feels to me like any combination of these things could potentially cause us issues in the future - and it would be great if we could test all that somehow, I'm just not too sure how to do it yet...

Continuing to test and I've managed to get it to crash/hang again by adding in an STA queue.

Good find! 😞

When we save the queues, we always do it because we are starting a particular composite work item. I think we should save that work item along with the queues, perhaps in a parallel stack. If the queues all become empty but that work item has not yet completed, we should actually not end the shift. Instead, we should continue until there is really no more work to be done.

@CharliePoole - that's an interesting idea. ParallelWorkItemDispatcher already saves the top level work item to figure out completion, so now we're saving and restoring queues, it seems fairly logical to replicate that behaviour down the tree.

Both - afraid I'm pretty flat out for the next few days, and likely won't have a chance to come back to this for a bit. Please feel free to move it forward in my absence!

CharliePoole · 2017-09-27T00:44:53Z

@ChrisMaddock I think my fix to #2454, PR #2461, will at least partially fix this. Looking at your guard statement, it may be better than what I did. If you don't have time to get back to this one, I can work on it after or in conjunction with #2461 now that I'm back from my trip.

ChrisMaddock · 2017-09-27T09:38:41Z

@CharliePoole - oh, this relates to my comment on #2461 - great minds! 😄

Would appreciate you having a look at this - sorry, had too much going on the last few weeks to revisit this, and I think it could use a fix ASAP.

CharliePoole · 2017-09-27T12:51:17Z

@ChrisMaddock Per @rprouse #2438 is fixed now. I'm going back to bed but I'll check this later to see if any of the changes need to go into a re-design.

ChrisMaddock added 2 commits September 17, 2017 13:22

Only restore isolated queues when current queues are completed

b4cc488

Test for exception

a339b4a

ChrisMaddock requested review from rprouse and CharliePoole September 17, 2017 12:42

ChrisMaddock mentioned this pull request Sep 17, 2017

NUnit 3.8+ does not finish running tests #2395

Closed

CharliePoole reviewed Sep 17, 2017

View reviewed changes

rprouse approved these changes Sep 17, 2017

View reviewed changes

rprouse requested changes Sep 17, 2017

View reviewed changes

ChrisMaddock mentioned this pull request Sep 22, 2017

[Parallelizable] hangs after a few tests #2438

Closed

rprouse pushed a commit to rprouse/NUnit.Parallelism.Integration.Tests that referenced this pull request Sep 27, 2017

Project to test nunit/nunit#2445 and related issues

7381656

ChrisMaddock mentioned this pull request Sep 27, 2017

Ensure queues are not started when they are supposed to be paused or stopped #2461

Merged

ChrisMaddock closed this Oct 5, 2017

ChrisMaddock deleted the issue-2438 branch October 5, 2017 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only restore isolated queues when current queues are completed #2445

Only restore isolated queues when current queues are completed #2445

ChrisMaddock commented Sep 17, 2017

CharliePoole left a comment

CharliePoole Sep 17, 2017

CharliePoole Sep 17, 2017

rprouse Sep 17, 2017

ChrisMaddock Sep 17, 2017

ChrisMaddock Sep 17, 2017

rprouse left a comment

rprouse left a comment

CharliePoole commented Sep 17, 2017

ChrisMaddock commented Sep 17, 2017

CharliePoole commented Sep 27, 2017

ChrisMaddock commented Sep 27, 2017

CharliePoole commented Sep 27, 2017

Only restore isolated queues when current queues are completed #2445

Only restore isolated queues when current queues are completed #2445

Conversation

ChrisMaddock commented Sep 17, 2017

CharliePoole left a comment

Choose a reason for hiding this comment

CharliePoole Sep 17, 2017

Choose a reason for hiding this comment

CharliePoole Sep 17, 2017

Choose a reason for hiding this comment

rprouse Sep 17, 2017

Choose a reason for hiding this comment

ChrisMaddock Sep 17, 2017

Choose a reason for hiding this comment

ChrisMaddock Sep 17, 2017

Choose a reason for hiding this comment

rprouse left a comment

Choose a reason for hiding this comment

rprouse left a comment

Choose a reason for hiding this comment

CharliePoole commented Sep 17, 2017

ChrisMaddock commented Sep 17, 2017

CharliePoole commented Sep 27, 2017

ChrisMaddock commented Sep 27, 2017

CharliePoole commented Sep 27, 2017