Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing regression that caused NUnit to be sensitive to windows timer resolution #2233

Merged
merged 2 commits into from Jun 22, 2017

Conversation

indy-singh
Copy link
Contributor

@indy-singh indy-singh commented Jun 8, 2017

Issue #2217 covers this regression in detail.

Copy link
Contributor

@jnm2 jnm2 left a comment

Before we fix this I'd like to have a good idea of why it was important. Thread.Sleep(1) forces a context switch and Thread.Sleep(0) does not force a context switch. We may be losing the ability for the event pump to process an event which has complex side effects on something else which wasn't documented here when the code was first written. The original change from 0 to 1 should have come with a test demonstrating the necessity, but I think we should at least try to figure it out and add that test now if possible. There may be a better way to solve the original problem than either Thread.Sleep(1) or Thread.Sleep(0).

I'd also love if you could write a timing test which fails before the change and passes after the change. That will help keep us from getting into the very same kind of scenario again.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 8, 2017

@jnm2 You're right... that's why we did it. Events were not being sent, which was quite noticeable in the GUI. A better way that doesn't wait 15ms would be a wonderful thing. 😄

@mintsoft
Copy link

mintsoft commented Jun 8, 2017

I'd also love if you could write a timing test which fails before the change and passes after the change. That will help keep us from getting into the very same kind of scenario again.

Hmm, I think that might be quite hard given that VS sets the timer interval to 1; the tests would pass when executed in VS. However when ran on a machine with nothing running, they would fail.

I guess one option would be to reset the timer interval to default in the setup of the test, it does mean calling an unmanaged, undocumented kernel function in ntdll though, probably not portable to mono!

@CharliePoole
Copy link
Member

CharliePoole commented Jun 8, 2017

IMO, such a test is not a unit test. If we had one it should be elsewhere.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

@jnm2 Were you about to offer an alternative to Sleep?

@jnm2
Copy link
Contributor

jnm2 commented Jun 9, 2017

@mintsoft

I guess one option would be to reset the timer interval to default in the setup of the test, it does mean calling an unmanaged, undocumented kernel function in ntdll though, probably not portable to mono!

Yes, I see your point. Next best thing: can you please document in a comment above the line exactly why it should not be Thread.Sleep(1)? We should have that, no matter what fix we land on.

@jnm2
Copy link
Contributor

jnm2 commented Jun 9, 2017

@CharliePoole

I'm still interested in getting a unit test of the original problem, if that's possible.

Were you about to offer an alternative to Sleep?

Yes, but I want to understand the problem. Now that you said GUI, I only have more questions though.

Thread.Sleep and this event pump were not being run on a GUI thread, were they? I believe Thread.Sleep only pumps a few whitelisted messages. If this code was causing GUI to lock up it's a perfect candidate for moving to a background thread.
(If Thread.Sleep was being used as a poor man's Application.DoEvents to synchronously pump messages and then continue, all the more reason to move it to a background thread. DoEvents is dangerous because it causes reentry in GUI logic. I can tell many war stories.)

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

I just refreshed my memory by doing a quick search. Bottom line is that many articles say your best bet is to use Sleep(1) to avoid starving the consumer thread in a situation like this. That's clearly why we made the change, as @oznetmaster points out.

I'm inclined to accept the change back to Sleep(0), which we had clearly been using for a long time before we changed it. It may be that Windows load balancing takes care of the problem for us automatically. We could (later) even experiment with eliminating the sleep entirely. Since we just did a release, we do have some time to play with this change in master.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

I'm a little suspicious here since the Travis failure is related to timing.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

No, this is many threads away from any gui. We're in the framework here. The point is only that these events are consumed by runners, including any gui, which use them to show progress. If the event pump thread is starved of time, then it appears to the runner that nothing is happening in the tests.

Of course, all of this was invented back in the world of single-core machines. In most cases today, we have multiple threads running at the same time, not just pretending. However, we may occasionally have to run on a single processor even today.

@mintsoft
Copy link

mintsoft commented Jun 9, 2017

Of course, all of this was invented back in the world of single-core machines. In most cases today, we have multiple threads running at the same time, not just pretending. However, we may occasionally have to run on a single processor even today.

One option, would be to only sleep(1) on a single core machine? If you're using a single core machine, you can kiss parallel test performance goodbye so the 15ms pause is probably tolerable; the moment you have more than 1 core, then it becomes irrelevant.

@oznetmaster
Copy link
Contributor

oznetmaster commented Jun 9, 2017

On a single core machine, if any of the tests do blocking I/O, then parallel test performance can be significant.

@mintsoft
Copy link

mintsoft commented Jun 9, 2017

On a single core machine, if any of the tests do blocking I/O, then parallel test performance can be significant.

That's true, however my point is that it doesn't feel like a use case which should be prioritised? Plus if the IO is stalled, would you notice a 15ms pause in that execution?

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

@singh400 Have you looked at the failing test?

@indy-singh
Copy link
Contributor Author

indy-singh commented Jun 9, 2017

@CharliePoole Can not get them to fail on my machine.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 9, 2017

Failure is only on linux and only on the .NET 2.0 build. Is anyone else able to duplicate it?

jnm2
jnm2 approved these changes Jun 22, 2017
Copy link
Contributor

@jnm2 jnm2 left a comment

Sorry, I forgot to come back to this. Comments documenting the reason 1 is bad were all I was waiting for.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 22, 2017

Travis failure is an unrelated intermittent error. Restarted.

@CharliePoole
Copy link
Member

CharliePoole commented Jun 22, 2017

I'm approving this change. I would actually prefer to see the Sleep removed entirely and the high priority setting on the event pump thread removed. However, this will improve performance and we can change it further later.

@rprouse I don't see this as a hotfix item myself. The problem seems to have existed since 3.4.0. Of course, if you do a hotfix for another reason, this or a follow-on larger fix should go in it.

@CharliePoole CharliePoole merged commit 7bdec91 into nunit:master Jun 22, 2017
@mintsoft
Copy link

mintsoft commented Jun 22, 2017

Thanks for this @singh400

@indy-singh
Copy link
Contributor Author

indy-singh commented Jun 22, 2017

Thanks for merging this PR @CharliePoole . How does one match up the correct build on MyGet? I presume this PR was automatically built and pushed to MyGet?

MyGet currently lists (https://www.myget.org/feed/nunit/package/nuget/NUnit) the following builds that occurred today:-

  • 3.8.0-dev-04109
  • 3.8.0-dev-04110

The AppVeyor out suggests that this PR was built as NUnit.3.8.0-ci-04074-pr-2233 except I can not locate that on MyGet?

Cheers,
Indy

@CharliePoole
Copy link
Member

CharliePoole commented Jun 22, 2017

Those builds on appveyor with "ci" are all prior to merge. The build after merge is "dev", this one being https://ci.appveyor.com/project/CharliePoole/nunit/build/3.8.0-dev-04110

@mintsoft
Copy link

mintsoft commented Jul 7, 2017

@CharliePoole Is there a release schedule for the next version of NUnit? I'm wondering when we can expect to upgrade to NUnit 3 again

@CharliePoole
Copy link
Member

CharliePoole commented Jul 7, 2017

@rprouse That's a question for you. 😄

@rprouse
Copy link
Member

rprouse commented Jul 11, 2017

@mintsoft we try to release quarterly and the next release is scheduled for end-Aug, https://github.com/nunit/nunit/milestone/30

@indy-singh
Copy link
Contributor Author

indy-singh commented Jul 11, 2017

@rprouse I'd hope that a regression of this type would be severe enough to issue a hot fix (v3.7.2 maybe?).

@indy-singh
Copy link
Contributor Author

indy-singh commented Jul 23, 2017

@rprouse Bump - I'd like an answer to whether or not a hot-fix will be released. Currently v3.7.1 is having an non-trivial impact on our CI server. I'd rather not move to a semi-official release from MyGet if that can be helped.

@mintsoft
Copy link

mintsoft commented Jul 24, 2017

Same here @singh400 I'd like to upgrade to NUnit3 again 👍

@rprouse
Copy link
Member

rprouse commented Jul 24, 2017

Are there features in 3.7 that you need that would prevent you from using an earlier 3.x release that doesn't have this problem?

3.8 is due out next month. I might be willing to ship a bit early, but I would prefer not to release a hot fix this far out.

@mintsoft
Copy link

mintsoft commented Aug 15, 2017

Are there features in 3.7 that you need that would prevent you from using an earlier 3.x release that doesn't have this problem?

Personally, no there aren't; I just don't really want to incur the pain of upgrading platform wide more than I have to.

@indy-singh
Copy link
Contributor Author

indy-singh commented Aug 17, 2017

Are there features in 3.7 that you need that would prevent you from using an earlier 3.x release that doesn't have this problem?

No. Just that I was hoping that once the regression was identified and fixed that a hotfix would be released without delay. Especially since I proved that this bug has the potential to impact CI flow by adding minutes to test execution.

But given that the next scheduled release is due at the end of this month, we might as well wait for it.

@jnm2
Copy link
Contributor

jnm2 commented Sep 14, 2018

Related: #3019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants