Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nng_listen occasionally fails when reusing ipc socket on Windows #1175

Open
jeikabu opened this issue Jan 23, 2020 · 9 comments
Open

nng_listen occasionally fails when reusing ipc socket on Windows #1175

jeikabu opened this issue Jan 23, 2020 · 9 comments
Labels

Comments

@jeikabu
Copy link
Contributor

jeikabu commented Jan 23, 2020

NNG & Platform details.

nng v1.2.4 Win10

Expected Behavior

After calling nng_close() on a socket listening on "ipc://", can immediately open another socket and listen on same pipe.

Actual Behavior

Works on Linux. On Windows, nng_listen() fails with EADDRINUSE after 10-500x iterations.

Steps to Reproduce

int
main(int argc, char** argv) {
    nng_socket socket0, socket1;
    int i;
    const char* url = "ipc://test_socket";
    for (int i = 0; i < 1000; i++) {
        assert(nng_pair0_open(&socket0) == 0);
        assert(nng_listen(socket0, url, NULL, 0) == 0);
        assert(nng_pair0_open(&socket1) == 0);
        assert(nng_dial(socket1, url, NULL, 0) == 0);
        assert(nng_close(socket1) == 0);
        assert(nng_close(socket0) == 0);
    }
    
    return 0;
}
@gdamore
Copy link
Contributor

gdamore commented Jan 23, 2020

I'm assuming you're using actual Windows and not WSL.

IPC on Windows is implemented using named pipes. I wonder if the OS has some limitation here -- I don't think I'm doing anything particularly asynchronously -- but I will look into it.

@gdamore
Copy link
Contributor

gdamore commented Jan 23, 2020

I suspect that what is happening is that while I'm calling DisconnectNamedPipe -- you might have clients on the far side that are still holding that point in the "namespace" active. This is different from how it works on UNIX or with TCP. If this is actually what is happening, I'm not sure I'll be able to do much about it.

@gdamore
Copy link
Contributor

gdamore commented Jan 23, 2020

Docs about this are here: https://docs.microsoft.com/en-us/windows/win32/api/namedpipeapi/nf-namedpipeapi-disconnectnamedpipe

Essentially you have to wait for the clients to see the pipe error, and close the pipe.

This does mean that a malicious client can hold the pipe indefinitely, and prevent future server instances from listening. This is a design defect in Named Pipes in Windows, IMO. (But again, Windows wasn't designed for multi-user use, really, and this API reflects that -- mostly processes are assumed to be cooperative and not actively hostile.)

@jeikabu
Copy link
Contributor Author

jeikabu commented Jan 24, 2020

Yes, this is actual Windows, not WSL.
Tried looking into this and this seems to be what's going on:

Under normal circumstances, e.g. pair1_pipe_recv_cb receives an aio error and calls nni_pipe_close.(). nni_pipe_close() kicks off nni_reap(&p->p_reap, (nni_cb) pipe_destroy, p) and when it gets into ipctran_pipe_fini(), refcnt drops to 0. Then nni_listener_reap() -> nni_listener_destroy() and ipctran_ep_fini() cleans everything up.

Things go awry if nng_close() kicks off nni_reap(&l->l_reap, (nni_cb) nni_listener_reap, l) before pipe_destroy. In this case when it gets to ipctran_ep_fini, refcnt != 0 and things aren't cleaned up. Next iteration around the loop CreateNamedPipe() fails with EADDRINUSE.

In the above sample, you can prevent this by delaying nng_close() with nng_msleep(10) or whatever.

jake-ruyi pushed a commit to jeikabu/nng.NETCore that referenced this issue Jan 26, 2020
- Starting with nng 1.2 NNG_OPT_LOCADDR fails unless the listener is started
- Add several missing defines
- ws protocol uses ephemeral ports
- Disable ipc tests to workaround bug nanomsg/nng#1175
jeikabu added a commit to jeikabu/nng.NETCore that referenced this issue Jan 28, 2020
- Starting with nng 1.2 NNG_OPT_LOCADDR fails unless the listener is started
- Add several missing defines
- ws protocol uses ephemeral ports
- Disable ipc tests to workaround bug nanomsg/nng#1175
- SocketTests work with protocols using ephemeral ports (tcp/ws)
jeikabu added a commit to jeikabu/nng.NETCore that referenced this issue Jan 28, 2020
- Starting with nng 1.2 NNG_OPT_LOCADDR fails unless the listener is started
- Add several missing defines
- ws protocol uses ephemeral ports
- Disable ipc tests to workaround bug nanomsg/nng#1175
- SocketTests work with protocols using ephemeral ports (tcp/ws)
- Duplicate listeners on same url will only fail when not using ephemeral ports (":0")
jeikabu added a commit to jeikabu/nng.NETCore that referenced this issue Jan 28, 2020
- Starting with nng 1.2 NNG_OPT_LOCADDR fails unless the listener is started
- Add several missing defines
- ws protocol uses ephemeral ports
- Disable ipc tests to workaround bug nanomsg/nng#1175
- SocketTests work with protocols using ephemeral ports (tcp/ws)
- Duplicate listeners on same url will only fail when not using ephemeral ports (":0")
@gdamore
Copy link
Contributor

gdamore commented Feb 25, 2020

Your last comment is relevant. I'll have to think hard about properly fixing this safely -- it probably means we need to change the way we tear down the listener -- these pipes probably need to add to the reference count on the listener, so that the listener isn't fully discarded until the last pipe closes.

This is quite unlike POSIX, where the service address (the file name) is only used at rendezvous, and the two sockets can happily stay connected without impacting the ability to delete and create the socket in the file system.

I might not get to this for 1.3, but only because I'm trying hard to get a 1.3 out the door soonish.

@gdamore
Copy link
Contributor

gdamore commented Dec 30, 2021

Closing both sockets should clean this up properly. I'm still trying to figure out what resource is left behind.

@gdamore
Copy link
Contributor

gdamore commented Dec 30, 2021

Actually I may have found it... the conn_aio should have been closed too.

@gdamore
Copy link
Contributor

gdamore commented Dec 30, 2021

I've published what I think might be a fix for this (I can see circumstances where possibly this impacts TCP as well btw.)

Please have a look at my fix1175 branch and give it a shot. It's not easy for me to test Windows right this second (because I don't have any decent toolchain for Windows ARM).

@alzix
Copy link
Contributor

alzix commented Oct 31, 2022

tested the changes proposed in #1562 on Windows 10 x64 - the issue persists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants