Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running multiple Bus tests causes error code 3 #151

Closed
mwpowellhtx opened this issue Nov 4, 2017 · 24 comments
Closed

Running multiple Bus tests causes error code 3 #151

mwpowellhtx opened this issue Nov 4, 2017 · 24 comments

Comments

@mwpowellhtx
Copy link

Taken individually, InProc, IPv4, or IPv6, the tests all run just fine. But run in parallel causes problems.

2017.11.04 18:36:17.392   ERROR Process C:\Users\Michael\AppData\Local\JetBrains\Installations\ReSharperPlatformVs14\JetBrains.ReSharper.TaskRunner.CLR45.x64.exe:13608 exited with code '3'.

However, by contrast, the pipeline tests are running just fine, individually, or altogether in parallel (same set as above).

Generally, I am following the same approach as the C code, closer to the C++ code, probably even a bit "simpler" due no smart pointer gymnastics going on. I am also not bothering to vet any of the protocol/peer stuff, just focusing on messaging, the socket protocols, etc.

I'll have my repo committed this weekend so you can take a look.

@mwpowellhtx
Copy link
Author

Similar issue with req/rep taken in parallel.

@mwpowellhtx
Copy link
Author

Ditto survey.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017

So I'm not sure what "error code 3" is, as you're not running my verbatim test programs. (NNG error code 3 means "invalid", which can be many different things.)

Having said that, some of these tests make use of "fixed" (as in hard-coded) addresses and will definitely not play well with other instances of the same test running. In some cases, even other instances of different tests won't play well, unless parameters are supplied to the test to select non-conflicting port numbers. (With IPC we don't have port numbers, but we use numeric values to give each test a unique address.)

I'm fairly confident that any poor interaction between different test instances is just this address conflict. (Inproc should be 100% free of these though.)

@mwpowellhtx
Copy link
Author

My tests are not using hard coded addresses. Each individual unit test is calling for a different address, with the strategy being this:

internal static class PortExtensionMethods
{
    private const int MinPort = 10000;
    private const int MaxPort = 10999;

    private static int? _port;

    private static int GetPort(int delta)
    {
        _port = _port ?? 0;
        _port += delta;
        // ReSharper disable once PossibleInvalidOperationException
        return (_port = Max(MinPort, Min(_port.Value, MaxPort))).Value;
    }

    public static string WithPort(this string addr, int delta = 1)
    {
        return $"{addr}:{GetPort(delta)}";
    }

    public static string BuildAddress(this SocketAddressFamily family)
    {
        var uuid = Guid.NewGuid();
        // ReSharper disable once SwitchStatementMissingSomeCases
        switch (family)
        {
            case InProcess:
                return $"inproc://{uuid}";

            case InterProcess:
                return $"ipc://pipe/{uuid}";

            case IPv4:
                return $"tcp://127.0.0.1".WithPort();

            case IPv6:
                return $"tcp://[::1]".WithPort();

            case Unspecified:
            default:
                throw new ArgumentException($"Address invalid for family '{family}'", nameof(family));
        }
    }
}

IPC and InProc both utilize a UUID for uniqueness. Where as TCP IPv4 and IPv6 both utilize a different port.

Which address is subsequently used for both listen and dial protocol operations.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017

At this point I don't have any ideas. I will write a test program showing multiple concurrent uses just to prove that it isn't nng.

Failing that, I'd need your code -- I still feel pretty strongly that somehow or other address reuse is to blame. There are some relatively straight-forward ways to verify this, but they do involve either adding printf's in the code, or looking at it with a debugger.

@mwpowellhtx
Copy link
Author

I'm not buying it. You can read the code here for yourself. Since posting this draft addressing strategy, I updated it slightly to account for each test fixture. So each test fixture gets its own range from a unique base, etc.

Even the InProcess stuff is unique, based on UUID with every iteration:

   That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.284] Success
Running protocol tests for address family 'InProcess'.
Testing using address 'inproc://94fd531d-ba57-4dd1-a4af-f3e419aa0a71'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  And messages can be delivered.
    And receive times out.
    And Bus1 delivers message to both Bus2 and Bus3.
   That_Bus2_delivers_message_to_Bus1_and_Bus3_times_out [0:00.240] Success
Running protocol tests for address family 'InProcess'.
Testing using address 'inproc://29949971-54f2-4391-ac32-d748bf68b9e2'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  And messages can be delivered.
    And receive times out.
    And Bus2 delivers message to Bus1, Bus3 times out.

But which even the InProcess stuff falls over when run concurrently with the IPv4 and IPv6. All of the default tests are InProc, consistent with your C tests, and my C++ tests. The only difference is the Test Address from InProc to IPv4 to IPv6 Bus test.

image

So methinks there is something, perhaps SOCKADDR confusion, going on. Of course, I can't say for sure, and I don't want to white box test your code line for line.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017

I doubt there's SOCKADDR confusion happening. These aren't used for IPC, and unfortunately the error cases I've received from you are too "vague" from the point of nng for me to reasonably know what the heck is going on.

I'm wondering if there is something going on that I don't understand -- for example you indicated that inproc tests fail when run in parallel... that makes me thing that maybe I'm misunderstanding something fairly fundamental in the platform. For example, are the tests actually run in separate processes, or do they share address spaces? I'd have though you had separate .exe files and were running each one separately, but if that's not the case then a bunch of other things may be called into doubt.

(This would be reflective of my general ignorance about C# and .NET.)

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017

Generally speaking, on most platforms at least, nng tries to provide a stack backtrace when it aborts in panic. Having that information would be far less "vague" than just "error code 3."

@mwpowellhtx
Copy link
Author

I'm not sure what else I can tell you there. I'm just copying what has been reported to me to the clipboard.

I could be wrong, but I seem to recall that the VS test runner is a multi-process runner. It can be "trained" to run sequentially, but I'm not sure what that helps by doing so.

I pushed my repo this morning, at your convenience.

@mwpowellhtx
Copy link
Author

So I'm not sure what "error code 3" is, as you're not running my verbatim test programs. (NNG error code 3 means "invalid", which can be many different things.)

Respectfully, you need to drop this "verbatim" expectation. That's unrealistic, considering these efforts, and efforts like it, will be wrappers. Heck, if you don't want it wrapped, just say so. But I think NNG would be helped by the exposure.

As long as the calling conventions are correct, and so far, I think I've demonstrated far and away that they are, or corrected when they weren't, what's left is usage, however awkward or clumsy it may be.

@mwpowellhtx
Copy link
Author

@gdamore So I was curious, was it multi process. No, it is not. And I will prove it. However, the test runner is multi threaded. You said it should be re-entrant, but obviously, it is not. I don't know what that means to you, or for NNG, but it is what it is as far as I'm concerned. I think it is unrealistic to expect the host application not be at least threaded.

  BusTests (3 tests) [0:07.035] Aborted
   That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.000] Aborted
   That_Bus2_delivers_message_to_Bus1_and_Bus3_times_out [0:03.634] Inconclusive: Test not run
   That_default_socket_correct [0:03.401] Inconclusive: Test not run
Current process Id: 10308
Managed thread Id: 13
Running protocol tests for address family 'IPv6'.
Testing using address 'tcp://[::1]:10352'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  IPv4BusTests (3 tests) [0:21.820] Aborted
   That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.000] Aborted
   That_Bus2_delivers_message_to_Bus1_and_Bus3_times_out [0:11.026] Inconclusive: Test not run
   That_default_socket_correct [0:10.793] Inconclusive: Test not run
Current process Id: 10308
Managed thread Id: 12
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:10368'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  IPv6BusTests (3 tests) [0:42.819] Aborted
   That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.000] Aborted
   That_Bus2_delivers_message_to_Bus1_and_Bus3_times_out [0:21.409] Inconclusive: Test not run
   That_default_socket_correct [0:21.410] Inconclusive: Test not run
Current process Id: 10308
Managed thread Id: 9
Running protocol tests for address family 'InProcess'.
Testing using address 'inproc://037efff8-2b92-417e-a14c-62736239e01d'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017 via email

@mwpowellhtx
Copy link
Author

I documented, it's definitely not multi-process. It is multi-thread.

@mwpowellhtx
Copy link
Author

As for simplest repro case, that's up to you to figure out. My suggestions would be to include these elements: multiple-transports involved, definitely multi-threading involved. You may even be able to reproduce using the same boilerplate unit testing.

As a back drop, I have not diverged into brand new unit tests. I am using the same patterns for testing, the same sequence of operations, etc, that you have done in the C code, and that I modeled in the C++ code, in the C# code. So there shouldn't be any great mysteries there.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017 via email

@mwpowellhtx
Copy link
Author

Like hell I am not using the same pattern. The boilerplate code is patterned after the same test. The ONLY difference is that I run 2+ concurrently. OF THE SAME BOILERPLATE TESTS. That is the ONLY difference.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017 via email

@mwpowellhtx
Copy link
Author

You aren't hearing what I've said. The test itself hasn't changed.

We're here because NNG apparently has a problem with at least two elements: 1) multi-threading, and/or 2) support for multiple concurrent transports.

As long as the boilerplate tests run against a homogenous transport, the tests run fine. But when heterogeneous transports are involved, and/or threading, it's a problem.

Literally instead of just running one boilerplate test fixture, I select 2+ to run. Falls over immediately.

Which is why I'm asking whether NNG is re-entrant (read: thread-safe).

@mwpowellhtx
Copy link
Author

Can I suggest that you refactor whatever static tables you are maintaining into a self-contained struct? You're already doing that, more or less, with the Message concept, and even with the AIO concept, which is seen opaquely. Along the same lines. That way whatever state is maintained travels with the caller's context. Perhaps an nng_start(...) to link up with the ``nng_fini()`. That sort of thing. I would personally have more confidence that NNG was re-entrant and thread-safe under the hood that way.

@gdamore
Copy link
Contributor

gdamore commented Nov 5, 2017 via email

@mwpowellhtx
Copy link
Author

To be clear, in this case I am not calling nng_fini. I'm not sure why you attribute the assumption that I am for simply mentioning it.

I understand that the steps in your tests are undocumented. There are no ivory towers here, though. I was able to reverse engineer my focused message pipe unit tests, for example.

@gdamore
Copy link
Contributor

gdamore commented Nov 6, 2017

I've done a bunch more analysis.

I'm closing this bug, as the real issue is #150 -- there is a bug in the Windows IPC code. Stay tuned.

@gdamore gdamore closed this as completed Nov 6, 2017
@mwpowellhtx
Copy link
Author

I'm just not buying it. This has nothing whatsoever to do with IPC. Again, run individually, Bus and Pipeline tests, for instance, run successfully.

image

    BusTests.That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.274] Success
Current process Id: 12984
Managed thread Id: 9
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:10368'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  And messages can be delivered.
    And receive times out.
    And Bus1 delivers message to both Bus2 and Bus3.
    BusTests.That_Bus2_delivers_message_to_Bus1_and_Bus3_times_out [0:00.248] Success
Current process Id: 12984
Managed thread Id: 9
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:11368'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
  And messages can be delivered.
    And receive times out.
    And Bus2 delivers message to Bus1, Bus3 times out.
    PipelineTests.Load_balancing_works [0:00.405] Success
Current process Id: 3180
Managed thread Id: 9
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:10304'.
    PipelineTests.Push_can_send_Pull_can_receive [0:00.024] Success
Current process Id: 3180
Managed thread Id: 9
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:11304'.

However, run together, different threads, they fall over immediately.

2017.11.06 13:42:40.306   ERROR Process C:\Users\Michael\AppData\Local\JetBrains\Installations\ReSharperPlatformVs14\JetBrains.ReSharper.TaskRunner.CLR45.x64.exe:10572 exited with code '3'.

image

    BusTests.That_Bus1_delivers_message_to_both_Bus2_and_Bus3 [0:00.000] Aborted
Current process Id: 10572
Managed thread Id: 13
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:10368'.
Given three 'Nanomsg2.Sharp.Protocols.Bus.LatestBusSocket' instances.
    PipelineTests.Load_balancing_works [0:00.000] Aborted
Current process Id: 10572
Managed thread Id: 9
Running protocol tests for address family 'IPv4'.
Testing using address 'tcp://127.0.0.1:10304'.

@gdamore
Copy link
Contributor

gdamore commented Nov 6, 2017

Please see the multistress test for a test in C that demonstrates bus, pair0, pub/sub, and req/rep all running simultaneously. This issue is not reproducible for me, and I have no time to debug your environment. If you want to submit a bug report demonstrating the problem in C, please feel free to do so in another bug. This issue is closed, however.

@nanomsg nanomsg locked and limited conversation to collaborators Nov 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants