-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only restore isolated queues when current queues are completed #2445
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I haven't gotten to this sooner or looked more deeply into it. As I stated, I"m suspicious of a hidden race condition but I'll have to leave it to you to do the thinking. 😄
@@ -242,7 +244,7 @@ private void RestoreQueues() | |||
|
|||
private void OnEndOfShift(object sender, EventArgs ea) | |||
{ | |||
if (_isolationLevel > 0) | |||
if (_isolationLevel > 0 && Shifts.All(q => !q.HasWork)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually understand what happened here. I know I added code to fix an earlier bug that made exactly this test using a different, non-Linq implementation. But that change seems to have disappeared!
Anyway, this is definitely needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However... I think there is more to it. It's possible for all the queues to be empty while there is still unfinished work because some fixture is running but has not yet queued its child tests. This is handled in the immediately following code for the purpose of not terminating the appication...
// Shift has ended but all work may not yet be done
while (_topLevelWorkItem.State != WorkItemState.Complete)
{
// This will fail if there is no work - all queues empty.
// In that case, we just continue the loop until either
// a shift is started or all the work is complete.
if (StartNextShift())
return;
}
I don't have time to step through it right now, but I'm worried that there's a race condition buried here, where a new test gets queued just as you are making your test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using a different condition here than in your guard. I tried switching this to Queues.All(q => q.IsEmpty)
and it works some of the time, but we need to add a null check when dequeuing and the guard failed one time making me think there is a race condition.
I also tried by changing the guard to Shifts.All(q => !q.HasWork)
but that doesn't work.
I am running out of time today, I will try to get back to looking at it tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CharliePoole - yes, that sounds feasible. I'm not totally sure how to deal with that one. The whole saving and restoring queues logic seems difficult to multi-thread - it wasn't entirely clear what the side-effects of dealing with #2428 would be here either.
@rprouse - Aha! I umm'd and arr'd over which condition to use, and seem to have ended up with one of each! I thought the two should be functionally equivalent here - interested as to what's not working by switching them around...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tough one to review so I pulled your branch and added a test project to the solution with the following test,
{
[TestFixture]
public class NunitParallelizationTest
{
[Parallelizable(ParallelScope.All)]
[Test]
public void Test([Range(1, 10)] int x)
{
TestContext.WriteLine($"Test {x}");
}
[OneTimeSetUp]
public void OneTimeSetup()
{
TestContext.WriteLine("OneTimeSetup");
}
[OneTimeTearDown]
public void OneTimeTearDown()
{
TestContext.WriteLine("OneTimeTearDown");
}
}
Running with your Guard in place, but your fix commented out, it crashes AND hangs every time for me and the OneTimeTearDown is not run.
Running with the Guard AND your fix, it runs every time and the OneTimeSetup and OneTimeTearDown are run correctly.
I also tried it without OneTimeSetup and OneTimeTearDown and get the same results. I am going to try a few other cases of parallelization, but so far looks good to me.
The only thing, do you think there is a better way to add tests for this? You are testing the guard, but not the fix. Do you think the above code would be a valid enough test if better named?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing to test and I've managed to get it to crash/hang again by adding in an STA queue. Test code is,
[TestFixture]
public class NunitParallelizationTest
{
[Parallelizable(ParallelScope.All)]
[Test]
public void Test([Range(1, 10)] int x)
{
TestContext.WriteLine($"Test {x}");
}
[OneTimeSetUp]
public void OneTimeSetup()
{
TestContext.WriteLine("OneTimeSetup");
}
[OneTimeTearDown]
public void OneTimeTearDown()
{
TestContext.WriteLine("OneTimeTearDown");
}
}
[TestFixture]
[Parallelizable(ParallelScope.All)]
public class NunitParallelizationTest2
{
[Test]
public void Test([Range(1, 10)] int x)
{
TestContext.WriteLine($"Test {x}");
}
[OneTimeSetUp]
public void OneTimeSetup()
{
TestContext.WriteLine("OneTimeSetup");
}
[OneTimeTearDown]
public void OneTimeTearDown()
{
TestContext.WriteLine("OneTimeTearDown");
}
}
[TestFixture]
[Parallelizable(ParallelScope.Children)]
public class NunitParallelizationTest3
{
[Test]
public void Test([Range(1, 10)] int x)
{
TestContext.WriteLine($"Test {x}");
}
[OneTimeSetUp]
public void OneTimeSetup()
{
TestContext.WriteLine("OneTimeSetup");
}
[OneTimeTearDown]
public void OneTimeTearDown()
{
TestContext.WriteLine("OneTimeTearDown");
}
}
[TestFixture]
[RequiresThread(System.Threading.ApartmentState.STA)]
public class NunitParallelizationTestSTA
{
[Test]
[Parallelizable(ParallelScope.All)]
public void Test([Range(1, 10)] int x)
{
TestContext.WriteLine($"Test {x}");
}
[OneTimeSetUp]
public void OneTimeSetup()
{
TestContext.WriteLine("OneTimeSetup");
}
[OneTimeTearDown]
public void OneTimeTearDown()
{
TestContext.WriteLine("OneTimeTearDown");
}
}
Testing fix, more to come...
Nice test! I think you have hit the race condition I was talking about. Unfortunately, when we test a queue for work available, we only get the work that has already been queued. In the case of final termination of the app, we are able to test the top level work item for completion. In this case, where an NP fixture is running tests in parallel, we would need to test for completion of the particular fixture that caused the queue to be saved in the first place. This isn't an enormous change but it's a bit more than one might have thought. @ChrisMaddock I'm not looking at the code, so don't trust any of the following. 😈 When we save the queues, we always do it because we are starting a particular composite work item. I think we should save that work item along with the queues, perhaps in a parallel stack. If the queues all become empty but that work item has not yet completed, we should actually not end the shift. Instead, we should continue until there is really no more work to be done. That paragraph was probably easier to write than the code will be. 😭 |
Replying to everything bit by bit. 🙂
@rprouse - Yes, I agree. I did a similar thing to debug. I wasn't too sure how better to test this, other than smoke test it - I wondered if using a variety of paralellization modes in the NUnit might give us a better smoke-test coverage there. I didn't add the precise test, as it's race-conditiony, and only one of many possible combinations of parallelisation attributes which may cause problems. I'm not sure how best to approach this - the specific issue here I believe is a combination of more tests than workers, test cases in parallel, AND the containing fixture in Non-Parallel. It feels to me like any combination of these things could potentially cause us issues in the future - and it would be great if we could test all that somehow, I'm just not too sure how to do it yet...
Good find! 😞
@CharliePoole - that's an interesting idea. ParallelWorkItemDispatcher already saves the top level work item to figure out completion, so now we're saving and restoring queues, it seems fairly logical to replicate that behaviour down the tree. Both - afraid I'm pretty flat out for the next few days, and likely won't have a chance to come back to this for a bit. Please feel free to move it forward in my absence! |
@ChrisMaddock I think my fix to #2454, PR #2461, will at least partially fix this. Looking at your guard statement, it may be better than what I did. If you don't have time to get back to this one, I can work on it after or in conjunction with #2461 now that I'm back from my trip. |
@CharliePoole - oh, this relates to my comment on #2461 - great minds! 😄 Would appreciate you having a look at this - sorry, had too much going on the last few weeks to revisit this, and I think it could use a fix ASAP. |
@ChrisMaddock Per @rprouse #2438 is fixed now. I'm going back to bed but I'll check this later to see if any of the changes need to go into a re-design. |
Fixes #2438.
What I believe was happening - RestoreQueues was being called while the OneTimeTearDown was waiting to be executed. RestoreQueues was then removing the current queues, and the as-yet-unexecuted OneTimeTearDown was being lost.
I added a Guard to
RestoreQueues()
- as I can't think of any situation where we would want to restore the queues while the existing queues still contain work. Hopefully in future that would give us a crash, rather than a hang. I then added the check toOnEndOfShift()
, so the queues are only popped from the save-state-stack if all current work has been completed.This is a new area to me - would appreciate someone checking this is all following what's expected here!