Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http_server deadlocks on shutdown #1598

Closed
L0PiTaL opened this issue Jun 15, 2022 · 7 comments · Fixed by #1604
Closed

http_server deadlocks on shutdown #1598

L0PiTaL opened this issue Jun 15, 2022 · 7 comments · Fixed by #1604

Comments

@L0PiTaL
Copy link
Contributor

L0PiTaL commented Jun 15, 2022

Describe the bug
Probably is not a bug, but an issue on my code, but I cannot figure out what should be done.

So, before calling it a bug, I would like to know if the shutdown code is correct, or I missed some other function.

Expected behavior
After receiving a particular GET query, the server must shut down.

Actual Behavior
The server deadlocks in nng_http_server_stop

** Environment Details **

  • latest master (41ce9fb)
  • Windows 10
  • C++
  • Static library

Additional context
The issue is a lot more visible when using the VPN client AstrillVPN.
With it, it happens like 80% of the times, even if it is not enabled (just being installed).

To Reproduce

The following code deadlocks.
All error handling has been removed for clarity. No errors are thrown.

The code deadlocks when calling nng_http_server_stop().
It hangs waiting for closing of all connections (in http_server_stop(), line nni_cv_wait(&s->cv); )

While the reaper thread trying to close the connections is waiting for completion of the task on conn->rd_aio
(in nni_http_conn_fini(), line nni_aio_stop(conn->rd_aio);

Thank you for any input you can provide, as I am out of ideas about what to do here....
Best regards,
Ruben

#include <nng/nng.h>
#include <nng/supplemental/http/http.h>

#include <iostream>
#include <string>
#include <thread>
#include <chrono>
#include <atomic>

std::atomic<bool> callback_done(false);

static void handler_callback(nng_aio *io) {
    // Actual handler work and response generation
    callback_done = true;
    
    nng_http_res *nng_res=NULL;
    nng_http_res_alloc(&nng_res);
    
    // It doesn't matter if it is just a redirection or an actual HTML page.
    
    //nng_http_res_set_status(nng_res, 302);
    //nng_http_res_set_header(nng_res, "Location", "https://www.google.com");
    //nng_http_res_set_header(nng_res, "Connection", "close");
    
    nng_http_res_set_status(nng_res, 200);
    std::string body = "<html><body>Hello World</body></html>";
    nng_http_res_copy_data(nng_res, body.c_str(), body.size());
    
        
    // Return the response
    nng_aio_set_output(io, 0, nng_res);
    
    // And finish aio
    nng_aio_finish(io, 0);
}

int main (int argc, char *argv[]) {
    // Setup server
    // ------------
    
    // Create server
    nng_http_server *server = nullptr;
    {
        nng_url *server_url=nullptr;
        nng_url_parse(&server_url, "http://127.0.0.1:8080");
    
        nng_http_server_hold( &server, server_url);
        nng_url_free(server_url);
    }
    
    // Create and add handler
    nng_http_handler *nng_handler=nullptr;
    nng_http_handler_alloc(&nng_handler, "/callback", &handler_callback);
    nng_http_server_add_handler(server, nng_handler);
        
    // Start server   
    nng_http_server_start(server);
    
    // Normal Server Work
    // ------------------
    
    // Wait for the callback to be called
    while (!callback_done) {
        std::this_thread::sleep_for(std::chrono::milliseconds(20));
    }
    
    // Shutdown server
    // ---------------
    
    // Stop server
    nng_http_server_stop(server); // HERE it deadlocks waiting for completion of the task on conn->rd_aio
    
    // Remove handler
    nng_http_server_del_handler(server, nng_handler);
    
    // Destroy server and handler
    nng_http_server_release(server);
    nng_http_handler_free(nng_handler);
    
    return 0;
}

L0PiTaL added a commit to NetsoftHoldings/nng that referenced this issue Jun 17, 2022
If CancelIoEx fails, it does not call the read/write callbacks, so the AIOs are never finished, leading to a deadlock when closing a TCP server.
L0PiTaL added a commit to NetsoftHoldings/nng that referenced this issue Jun 17, 2022
If CancelIoEx fails, it does not call the read/write callbacks, so the AIOs are never finished, leading to a deadlock when closing a TCP server.
@L0PiTaL
Copy link
Contributor Author

L0PiTaL commented Jun 18, 2022

Ok, I have tracked down the issue.
It is due to CancelIoEx returning an error, so it does not perform the last call to the callback, thus the aio doesn't finish.

I have added a PR solving this (#1599), but I have only focused on win_tcpconn.c, but similar constructs with CancelIoEx also appear in several other files, which I guess will suffer of the same issue.
If you consider this solution correct, I can go ahead and modify also the rest, but first I wanted to have your confirmation on the solution.

@gdamore
Copy link
Contributor

gdamore commented Jun 20, 2022

Thanks for your work to track this down. Please see my reply. I'm concerned that you may be observing behavior that is not the root cause, and addressing that symptom may expose us to further races/corruption. Please see my comments on that PR. If I have time I'll try to dig in a bit more deeply as well, although I'm not really a Windows expert. Thanks again for your efforts here.

@L0PiTaL
Copy link
Contributor Author

L0PiTaL commented Jun 21, 2022

@gdamore
I reply you here, so we can keep the conversation in one place.

This is exactly why I wanted to share the code with you prior to make any other changes, as I didn't really know how to handle the error.

I concerned what the semantics here of CancelIoEx are. If this fails, does it mean that the I/O is still pending?

I really don't know. The documentation for CancelIoEx is quite thin....

It would be useful to know the error code... is it ERROR_NOT_FOUND? If so, why? Is this happening because cancel is racing against finish?

In this particular case, yes, the error is ERROR_NOT_FOUND. I don't know why....

The code I showed above is the complete code leading to this issue.
Can you confirm it is performing the shutdown correctly? Maybe I have missed something which is causing this issue, or there is something else I need to clear before or something....

Summarizing the problem:
The issue is that the program hungs on nng_http_server_stop, waiting for all the aios to be closed, but one (conn->rd_aio) never finishes.
This is because the CancelIoEx call fails (with ERROR_NOT_FOUND), thus the aio is still waiting for more callbacks.
I don't know what causes that error.

Maybe a better solution would be to check for ERROR_NOT_FOUND prior to clear all the AIOs, instead of doing it on any error...
But yes, I agree with you that this is just a "patch", not solving the root cause....

L0PiTaL added a commit to NetsoftHoldings/nng that referenced this issue Jun 21, 2022
If CancelIoEx fails, it does not call the read/write callbacks, so the AIOs are never finished, leading to a deadlock when closing a TCP server.
@L0PiTaL
Copy link
Contributor Author

L0PiTaL commented Jul 11, 2022

@gdamore
One thing I have seen, is that even when it has already called CancelIoEx with a result of ERROR_NOT_FOUND, when closing Chrome it still triggers the callback... This leads to the crash I mentioned in the PR, which made me close it as not a valid solution.

So I am completely lost here.....

@gdamore
Copy link
Contributor

gdamore commented Jul 11, 2022

It strikes me that there can be a race here. CancelIoEx is getting called while the callback itself is pending/dispatched. So from the I/O perspective, there is nothing left to cancel.

Really, CancelIoEx shouldn't fail, and if it does, it should be fine, because it should indicate that I/O is already being completed. That was the original design intent I believe.

I'll need some time to figure this out. Teardown is always the hard part of these problems.

L0PiTaL added a commit to NetsoftHoldings/nng that referenced this issue Jul 13, 2022
If CancelIoEx fails, it does not call the read/write callbacks, so the AIOs are never finished, leading to an infinite wait when closing a TCP server.
@L0PiTaL
Copy link
Contributor Author

L0PiTaL commented Jul 13, 2022

@gdamore

Really, CancelIoEx shouldn't fail, and if it does, it should be fine, because it should indicate that I/O is already being completed. That was the original design intent I believe.

I think so, but it fails while the completion is not called.

I have created a PR with a different workaround, which should be safer than before, which consists on pushing a 0 byte completion packet to the pending aio, only if CancelIoEx fails with not found BUT there are still aios pending.

@L0PiTaL
Copy link
Contributor Author

L0PiTaL commented Jul 14, 2022

@gdamore

You were completely right. It wasn't related to CancelIoEx, but the root cause was somewhere else, in the setsockopt function call.

The PR solves this issue, and now I can close the HTTP server without any other issue.

Thanks for your help and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants