Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only restore isolated queues when current queues are completed #2445

Closed
wants to merge 2 commits into from

Conversation

ChrisMaddock
Copy link
Member

Fixes #2438.

What I believe was happening - RestoreQueues was being called while the OneTimeTearDown was waiting to be executed. RestoreQueues was then removing the current queues, and the as-yet-unexecuted OneTimeTearDown was being lost.

I added a Guard to RestoreQueues() - as I can't think of any situation where we would want to restore the queues while the existing queues still contain work. Hopefully in future that would give us a crash, rather than a hang. I then added the check to OnEndOfShift(), so the queues are only popped from the save-state-stack if all current work has been completed.

This is a new area to me - would appreciate someone checking this is all following what's expected here!

Copy link
Contributor

@CharliePoole CharliePoole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven't gotten to this sooner or looked more deeply into it. As I stated, I"m suspicious of a hidden race condition but I'll have to leave it to you to do the thinking. 😄

@@ -242,7 +244,7 @@ private void RestoreQueues()

private void OnEndOfShift(object sender, EventArgs ea)
{
if (_isolationLevel > 0)
if (_isolationLevel > 0 && Shifts.All(q => !q.HasWork))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually understand what happened here. I know I added code to fix an earlier bug that made exactly this test using a different, non-Linq implementation. But that change seems to have disappeared!

Anyway, this is definitely needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However... I think there is more to it. It's possible for all the queues to be empty while there is still unfinished work because some fixture is running but has not yet queued its child tests. This is handled in the immediately following code for the purpose of not terminating the appication...

            // Shift has ended but all work may not yet be done
            while (_topLevelWorkItem.State != WorkItemState.Complete)
            {
                // This will fail if there is no work - all queues empty.
                // In that case, we just continue the loop until either
                // a shift is started or all the work is complete.
                if (StartNextShift())
                    return;
            }

I don't have time to step through it right now, but I'm worried that there's a race condition buried here, where a new test gets queued just as you are making your test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using a different condition here than in your guard. I tried switching this to Queues.All(q => q.IsEmpty) and it works some of the time, but we need to add a null check when dequeuing and the guard failed one time making me think there is a race condition.

I also tried by changing the guard to Shifts.All(q => !q.HasWork) but that doesn't work.

I am running out of time today, I will try to get back to looking at it tonight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CharliePoole - yes, that sounds feasible. I'm not totally sure how to deal with that one. The whole saving and restoring queues logic seems difficult to multi-thread - it wasn't entirely clear what the side-effects of dealing with #2428 would be here either.

@rprouse - Aha! I umm'd and arr'd over which condition to use, and seem to have ended up with one of each! I thought the two should be functionally equivalent here - interested as to what's not working by switching them around...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we need to add a null check when dequeuing

@rprouse - what did you mean by that? Are you possibly seeing #2428? That needs to be dealt with as a separate issue, imo.

Copy link
Member

@rprouse rprouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tough one to review so I pulled your branch and added a test project to the solution with the following test,

{
[TestFixture]
public class NunitParallelizationTest
{
    [Parallelizable(ParallelScope.All)]
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

Running with your Guard in place, but your fix commented out, it crashes AND hangs every time for me and the OneTimeTearDown is not run.

Running with the Guard AND your fix, it runs every time and the OneTimeSetup and OneTimeTearDown are run correctly.

I also tried it without OneTimeSetup and OneTimeTearDown and get the same results. I am going to try a few other cases of parallelization, but so far looks good to me.

The only thing, do you think there is a better way to add tests for this? You are testing the guard, but not the fix. Do you think the above code would be a valid enough test if better named?

Copy link
Member

@rprouse rprouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing to test and I've managed to get it to crash/hang again by adding in an STA queue. Test code is,

[TestFixture]
public class NunitParallelizationTest
{
    [Parallelizable(ParallelScope.All)]
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[Parallelizable(ParallelScope.All)]
public class NunitParallelizationTest2
{
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[Parallelizable(ParallelScope.Children)]
public class NunitParallelizationTest3
{
    [Test]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

[TestFixture]
[RequiresThread(System.Threading.ApartmentState.STA)]
public class NunitParallelizationTestSTA
{
    [Test]
    [Parallelizable(ParallelScope.All)]
    public void Test([Range(1, 10)] int x)
    {
        TestContext.WriteLine($"Test {x}");
    }

    [OneTimeSetUp]
    public void OneTimeSetup()
    {
        TestContext.WriteLine("OneTimeSetup");
    }

    [OneTimeTearDown]
    public void OneTimeTearDown()
    {
        TestContext.WriteLine("OneTimeTearDown");
    }
}

Testing fix, more to come...

@CharliePoole
Copy link
Contributor

Nice test! I think you have hit the race condition I was talking about.

Unfortunately, when we test a queue for work available, we only get the work that has already been queued. In the case of final termination of the app, we are able to test the top level work item for completion. In this case, where an NP fixture is running tests in parallel, we would need to test for completion of the particular fixture that caused the queue to be saved in the first place. This isn't an enormous change but it's a bit more than one might have thought.

@ChrisMaddock I'm not looking at the code, so don't trust any of the following. 😈

When we save the queues, we always do it because we are starting a particular composite work item. I think we should save that work item along with the queues, perhaps in a parallel stack. If the queues all become empty but that work item has not yet completed, we should actually not end the shift. Instead, we should continue until there is really no more work to be done.

That paragraph was probably easier to write than the code will be. 😭

@ChrisMaddock
Copy link
Member Author

Replying to everything bit by bit. 🙂

The only thing, do you think there is a better way to add tests for this? You are testing the guard, but not the fix. Do you think the above code would be a valid enough test if better named?

@rprouse - Yes, I agree. I did a similar thing to debug. I wasn't too sure how better to test this, other than smoke test it - I wondered if using a variety of paralellization modes in the NUnit might give us a better smoke-test coverage there.

I didn't add the precise test, as it's race-conditiony, and only one of many possible combinations of parallelisation attributes which may cause problems. I'm not sure how best to approach this - the specific issue here I believe is a combination of more tests than workers, test cases in parallel, AND the containing fixture in Non-Parallel. It feels to me like any combination of these things could potentially cause us issues in the future - and it would be great if we could test all that somehow, I'm just not too sure how to do it yet...

Continuing to test and I've managed to get it to crash/hang again by adding in an STA queue.

Good find! 😞

When we save the queues, we always do it because we are starting a particular composite work item. I think we should save that work item along with the queues, perhaps in a parallel stack. If the queues all become empty but that work item has not yet completed, we should actually not end the shift. Instead, we should continue until there is really no more work to be done.

@CharliePoole - that's an interesting idea. ParallelWorkItemDispatcher already saves the top level work item to figure out completion, so now we're saving and restoring queues, it seems fairly logical to replicate that behaviour down the tree.


Both - afraid I'm pretty flat out for the next few days, and likely won't have a chance to come back to this for a bit. Please feel free to move it forward in my absence!

@CharliePoole
Copy link
Contributor

@ChrisMaddock I think my fix to #2454, PR #2461, will at least partially fix this. Looking at your guard statement, it may be better than what I did. If you don't have time to get back to this one, I can work on it after or in conjunction with #2461 now that I'm back from my trip.

rprouse pushed a commit to rprouse/NUnit.Parallelism.Integration.Tests that referenced this pull request Sep 27, 2017
@ChrisMaddock
Copy link
Member Author

@CharliePoole - oh, this relates to my comment on #2461 - great minds! 😄

Would appreciate you having a look at this - sorry, had too much going on the last few weeks to revisit this, and I think it could use a fix ASAP.

@CharliePoole
Copy link
Contributor

@ChrisMaddock Per @rprouse #2438 is fixed now. I'm going back to bed but I'll check this later to see if any of the changes need to go into a re-design.

@ChrisMaddock ChrisMaddock deleted the issue-2438 branch October 5, 2017 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants