Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing SocketException caused by --explore #223

Merged
merged 3 commits into from
May 25, 2017

Conversation

jnm2
Copy link
Collaborator

@jnm2 jnm2 commented May 10, 2017

Fixes #225, and probably #171 and #219 and nunit/nunit#2027 and #175 and #31 as well so we should check them.

Remote caller == console runner
Callee == agent

Cause of the bug

Every remoting call has three necessary stages:

  1. Transfer the information necessary to make the call from the caller to the callee over the communication channel. (For example, instance ID and marshaled parameters.)
  2. Perform the requested code execution in the callee.
  3. Transfer the result of the code execution from the callee to the caller over the communication channel. (For instance, return value and successful completion or exception.)

The agent's main thread starts the remoting TcpChannel listener. Then it blocks, waiting for a stop signal from another thread.

The agent's TcpChannel receives calls from the console runner and handles the three stages of each call on a threadpool thread.

The last call is the console runner telling the agent to shut down. The agent's code implementing the stop command is executed during stage 2. The agent can't shut down before stage 3 is complete without causing a SocketException in the remote caller which expects all three stages to complete normally for every call. So what the agent does is set the stop signal to unblock the agent's main thread and let the main thread handle the shutdown.

As soon as the agent's main thread is unblocked, the very next thing on its agenda is shutting itself down. Sometimes it shuts itself down so fast that the third stage of the stop call on the background thread is not complete. The main thread closes the socket while the background thread is still in the process of responding to the remote caller. This is the race condition @lsem was talking about.

So the stop signal really isn't useful on its own.

The repeatable repro (the test commit)

For some reason, none of us could reliably reproduce this race condition unless we:

  • Run the console runner and agent from a folder which does not contain nunit.framework.dll
  • Use the --explore option
  • Have tests with dynamic test cases requiring execution of the user code at discovery time, when that code involves accessing TestContext.Current.

As far as I can tell, these are not related logically to the actual bug. They only shift timings.
If it doesn't repro the first time, it almost certainly will the second time. My test runs it up to three times.
I did see it pass all tests in AppVeyor once which it shouldn't have. On @tom-dudley's and my machines it repros every time in every runner we tried. (I tried build.cake, VS runner, and ReSharper.)

The fix commit

Swallowing SocketExceptions is the least palatable option because it foists responsibility for the agent's internal problem onto the remote caller. Also you don't want to miss a legit and identical SocketException if do you lose your network connection. I will continue to assume this option is not on the table.

So what we need is to be sure that the three stages of the stop call have finished before shutting the TcpChannel down.

Thread.Delay(1000) is no guarantee. Due to thread scheduling or network delays, the stop call response may take an arbitrarily long time.

After researching and stepping through .NET Framework code in the debugger, it is evident we must shut down no sooner than this line completing:
http://referencesource.microsoft.com/#System.Runtime.Remoting/channels/tcp/tcpserverchannel.cs,613

Since we want to shut down as soon as possible after that line completes, we need a hook or event of some sort to let us know. As far as I can tell this leaves us with only one possible option: implement an IServerChannelSink which spies on all remoting calls and decorates the TcpServerTransportSink.

That is what the newly added ObservableServerChannelSinkProvider does. Every time you create a TcpChannel now you have an option to pass a CurrentMessageCounter which it will then use to keep track of how many calls are in flight at any given moment. Before the TcpServerTransportSink sees each message, the counter is incremented, and the counter is not decremented until the TcpServerTransportSink is completely finished all three stages of the call.

The CurrentMessageCounter also provides a method allowing you to wait until the number of in-flight messages is down to zero. So after the main thread receives the shutdown signal from the background thread, the main thread now waits for all concurrent messages (not just the stop message) to finish before shutting down.

This solution seems to be working well. There is now an absolute guarantee that every existing remote call has finished using the socket before we close the socket.

Edge cases

  • If the remote caller tries to send a remoting call after the stop call has finished, the remote caller will (and should) still get a SocketException.

  • If the remote caller sends more remoting calls overlapping with the stop call in such a way that the number of concurrent messages being handled does not get back to zero, the agent will not shut down until there are no more remote calls in flight. This would be bizarre behavior from the remote caller, but if it happened, this would likely be a useful feature rather than a bug.

@jnm2
Copy link
Collaborator Author

jnm2 commented May 19, 2017

@rprouse Rebased on master which enabled me to drop the commit "Enable people who have only VS2017 to build the repo."

@CharliePoole
Copy link
Collaborator

If the changes for building with only VS2017 make it require 2017, I feel we have a problem.

@rprouse
Copy link
Member

rprouse commented May 19, 2017

@CharliePoole it does not require 2017. I've been using undocumented hacks to ensure that isn't the case.

@jnm2 jnm2 force-pushed the explore_socketexception branch 2 times, most recently from fd8c9e7 to b0dacf0 Compare May 20, 2017 02:14
@jnm2
Copy link
Collaborator Author

jnm2 commented May 20, 2017

@CharliePoole @rprouse @ChrisMaddock @tom-dudley

Everything is ready! I would appreciate each of your reviews on this one if you can spare the time.

It will probably make the most sense to look at each of the commits on its own. The tests required a bunch of helper files unrelated to the fix.

@@ -133,7 +125,8 @@ public static int Main(string[] args)
log.Info("Initializing Services");
engine.Initialize();

Channel = ServerUtilities.GetTcpChannel();
// Owns the channel used for communications with the agency and with clients
var testAgencyServer = engine.Services.GetService<TestAgency>();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been starting two separate channels for no good reason which caused no small amount of confusion when prototyping the fix at 1 am last night. =D

@ChrisMaddock
Copy link
Member

Awesome! 😄 I'll come back to this next week if you don't mind, after I've got the stuff I wanted to do sorted for the 3.8 framework, and when I can run some assemblies from work against it.

Not being even slightly familiar with .NET remoting, would you mind adding a summary of what the problem was, and the fix you've chosen? That would make it much easier to look through, from my point of view. 🙂

@jnm2
Copy link
Collaborator Author

jnm2 commented May 20, 2017

Sure thing! I updated with a detailed description.

@ChrisMaddock
Copy link
Member

ChrisMaddock commented May 20, 2017

Perfect - thank you!

This also links to #175. 🙂

Copy link
Contributor

@tom-dudley tom-dudley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just some minor style issues.

@@ -44,8 +44,9 @@ public static class ServerUtilities
/// <param name="name">The name of the channel to create.</param>
/// <param name="port">The port number of the channel to create.</param>
/// <param name="limit">The rate limit of the channel to create.</param>
/// <param name="currentMessageCounter">An optional counter to provide the ability to wait for all current messages</param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't optional for this method

/// <returns>The specified <see cref="TcpChannel"/> or null if it cannot be found and created</returns>
public static TcpChannel GetTcpChannel( string name, int port )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra spaces in various method calls go against the Coding Standard

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire file is this way so I followed the local style. I'd like nothing more than to clean up the coding style in the file though. I'll do it in an isolated commit so it isn't so noisy. (@nunit/core-team is this the right approach?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. 🙂 Great to see that .NET 1.0 workaround gone!

@jnm2 jnm2 force-pushed the explore_socketexception branch 2 times, most recently from 13ba00d to aecb14a Compare May 20, 2017 19:37
@jnm2
Copy link
Collaborator Author

jnm2 commented May 20, 2017

I didn't want to mix coding style with the fix, so I added a third commit to address @tom-dudley's style concerns. Hopefully this makes ServerUtilities.cs much more readable too.

@jnm2
Copy link
Collaborator Author

jnm2 commented May 20, 2017

@ChrisMaddock

This also links to #175.

Do you want me to go ahead and link this PR to autoclose all five issues (#171, #175, #219, #225 and nunit/nunit#2027), or will we handle them on a case-by-case basis?

@jnm2 jnm2 force-pushed the explore_socketexception branch 2 times, most recently from d36f1a6 to 0a7bc22 Compare May 20, 2017 19:52
@jnm2 jnm2 force-pushed the explore_socketexception branch from 0a7bc22 to 33e75f2 Compare May 20, 2017 19:58
@jnm2
Copy link
Collaborator Author

jnm2 commented May 20, 2017

I noticed that I stopped using On.cs so I removed it.

Copy link
Member

@rprouse rprouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, great explanation.

@@ -103,7 +96,7 @@
<None Include="..\..\nunit.snk">
<Link>nunit.snk</Link>
</None>
<None Include="packages.config" />
<None Include="project.json" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got an alternate fix in for the build issue which has been merged. It will conflict with your approach.


return new TcpChannel( props, clientProvider, serverProvider );
return new TcpChannel(props, clientProvider, serverProvider);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good. Like @ChrisMaddock says, nice to see the .NET 1.0 hack removed, the code is much cleaner. I'm still not sure why we need a TcpChannel here where an IpcChannel would do and would likely be quicker. Not really part of this PR, just thinking out loud 😄

@rprouse
Copy link
Member

rprouse commented May 22, 2017

Annoying, comments on individual commits don't show up here even though they were review comments. No matter, it wasn't a requested change, just a comment that I wonder why we are still using TcpChannel rather than switch to IpcChannel since it would be faster. Seems to me that using TcpChannel for running agents on a different machine is overkill until we do run code on a different machine and even then, I have a feeling that we will drop remoting and use a straight TCP socket or writer like we do in the Xamarin runner since remoting is a done technology that isn't available in many recent platforms.

No need to change that here, just thinking out loud 😄

@rprouse rprouse merged commit e5c3022 into nunit:master May 25, 2017
@JackUkleja
Copy link

Excellent work @jnm2.

While the pain of debugging, reproing and writing unit tests for a 'Heisenbug' is still fresh in your mind 😄 you may be interested taking a look at program called CHESS from MSR.

CHESS actually allows you to specify the order of thread interleaving to remove the non-determinism from these sorts of heisenbugs.

On a slight tangent, but I have contacted the team at MSR about it (because its currently mothballed and not updated to support VS 2017) and they are willing to dump it on github and change the license if people are willing to take it on. @CharliePoole could you see something like this being integrated into NUnit one day?

@ChrisMaddock
Copy link
Member

Sorry I didn't manage to get back to this earlier @jnm2. Solves all the problems I was seeing at work - great to have a fix in for this. Thanks!

@jnm2
Copy link
Collaborator Author

jnm2 commented May 27, 2017

Dang, sorry for ignoring yall! I didn't get GitHub notifications for any of the comments on this PR for the last six days. It's not the first time. (Does anyone know why this happens? It's only PRs.)

@jnm2
Copy link
Collaborator Author

jnm2 commented May 27, 2017

Huh, I may have had the PR open on another machine for six days... guess I'll have to be more careful.

@jnm2
Copy link
Collaborator Author

jnm2 commented May 27, 2017

@rprouse

Annoying, comments on individual commits don't show up here even though they were review comments.

Fyi they are all showing up here for me, all under the Conversation tab.

@JackUkleja That looks really cool! Thanks for the info.

@ChrisMaddock
Copy link
Member

@jnm2 - it would be good to close all the suspected duplicate issues that should be solved by this - maybe with a note that we can reopen them if it turns out to be a different bug. Shall I leave that with you? I don't want to steal your glory! 😄

@jnm2
Copy link
Collaborator Author

jnm2 commented May 29, 2017

@ChrisMaddock It was a group effort! 😄 But sure, I can do that.

@jnm2
Copy link
Collaborator Author

jnm2 commented May 29, 2017

@ChrisMaddock I didn't know what I was in for 😆 In chronological order:

Duplicates

nunit/nunit#514
nunit/nunit#916
#31 (from nunit/nunit#1671)
#225 (from nunit/nunit#1834) (PR #223)
nunit/nunit#2027
#171 (PR #175)
nunit/nunit#2036
#219

Similar to

#44 (from nunit/nunit#1627)
nunit/nunit#1824

Appears in

nunit/nunit#1045
nunit/nunit#1656

@ChrisMaddock
Copy link
Member

Awesome, thanks! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SocketException thrown by nunit3-console.exe --explore option
6 participants