Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client connection is lost with no obvious reason #174

Closed
zllvm opened this issue Nov 29, 2022 · 10 comments
Closed

Client connection is lost with no obvious reason #174

zllvm opened this issue Nov 29, 2022 · 10 comments

Comments

@zllvm
Copy link

zllvm commented Nov 29, 2022

I implemented an integration test that uses TcpClient and TcpServer. This test fails sporadically because the client loses the connection (e.g., disconnecting from 127.0.0.1:12345 due to connection lost) when trying to PollSocket, though the server is up and running.

I've treated this problem by increasing ConnectionLostEvaluationIntervalMs so that PollSocket is not called. Then the test is passed, and the client communicates correctly with the server.

The implementation of PollSocket does not provide an apparent reason for the connection loss:

private bool PollSocket()
{
    try
    {
        if (_client.Client == null || !_client.Client.Connected)
            return false;

        /* pear to the documentation on Poll:
         * When passing SelectMode.SelectRead as a parameter to the Poll method it will return 
         * -either- true if Socket.Listen(Int32) has been called and a connection is pending;
         * -or- true if data is available for reading; 
         * -or- true if the connection has been closed, reset, or terminated; 
         * otherwise, returns false
         */
        if (!_client.Client.Poll(0, SelectMode.SelectRead))
            return true;

        var buff = new byte[1];
        var clientSentData = _client.Client.Receive(buff, SocketFlags.Peek) != 0;
        return clientSentData; //False here though Poll() succeeded means we had a disconnect!
    }
    catch (SocketException)
    {
        return false;
    }
}

Thereby, it could be good to include some logging.

I'd like to know if I misunderstood some concepts or if there're flaws in the PollSocket implementation?

@jchristn
Copy link
Owner

Do you have TCP keepalives enabled on either client or server? I don't suspect there is an issue in the implementation, but if you can point to one, I'd love a PR with a fix!

@zllvm
Copy link
Author

zllvm commented Nov 30, 2022

I have keepalive enabled on the client but not on the test server. I've found another issue where you explain that keepalives can bring odd behavior. I can disable them and see if it helps.

It is worth mentioning that I cannot reproduce this issue on my windows pc, and it fails in the ci pipe that executes in a Linux environment (mcr.microsoft.com/dotnet/sdk:6.0-alpine). Therefore, it seems more likely to be something related to keepalives.

This issue can be closed, I suppose. And if I find a problem in the library code that I can fix, I'll do a PR as you suggest. Thanks!

@jchristn
Copy link
Owner

Hi @zllvm TCP keepalives are the bane of my existence (at least with these libraries...). I have unfortunately zero control over them, and they only exist to solve a very niche problem (cable removed from switch or router) and it often doesn't even do a good job of solving that, not to mention they seemingly introduce random problems everywhere. Please let me know if disabling them helps. Please re-open if the problem continues with keepalives disabled! Cheers, Joel

@zllvm
Copy link
Author

zllvm commented Dec 2, 2022

Hi @jchristn! Aw, I understand your pain!

It works much better with keepalives disabled, though it still disconnects occasionally. I started to debug it by increasing logging in your library. It turns out that it fails on:

var clientSentData = _client.Client.Receive(buff, SocketFlags.Peek) != 0;

with System.Net.Sockets.SocketException (110): Operation timed out

After that, the connection is forcefully closed by ConnectedMonitor. Nevertheless, I'm pretty sure that the connection is still alive.

If I understand it correctly, Receive is called in a non-blocking mode and therefore throws SocketException when there's no data according to the microsoft doc. In this document, it's suggested to check the Available property to know if there's data.

Thereby, I realized that I don't fully comprehend the logic in PollSocket:

  • If no data is available at a specific moment, how could Receive indicate a connection problem?
  • Can we be sure that SocketException on Receive always implies a connection loss?

Likewise, what do you think about configuring the ConnectedMonitor task so it can be optionally disabled?

@zllvm
Copy link
Author

zllvm commented Dec 7, 2022

I've done the following change in PollSocket method:

catch (SocketException ex)
{
    Logger?.Invoke($"{Header}poll socket from {ServerIpPort} failed with ex = {ex}");
    return ex.SocketErrorCode == SocketError.TimedOut;
}

After that, I deployed it to the dev server and started a TCP connection that has been up and running for two days with no issues. I could test it by running various commands on the open connection, which worked perfectly. The log entry that I've added $"{Header}poll socket from {ServerIpPort} failed with ex = {ex}" appeared several time times (~20) in the monitoring system with Operation timed out exception.

Evidently, PollSocket introduces a harmful behavior by disconnecting an existing connection. I don't know if it is possible to make it fully reliable. If not, then it should be allowed to override or disable it.

What is your opinion on this, @jchristn? Could we re-open this issue?

@Energiz0r
Copy link

FYI: We are having the exact same issue in our environment here.
Works stable when testing this on windows, but as soon as we deploy this to our Ubuntu server, disconnections start with Timeout 110 error.

@Espen-Kalhagen-Element-Logic

FYI: We are having the exact same issue in our environment here. Works stable when testing this on windows, but as soon as we deploy this to our Ubuntu server, disconnections start with Timeout 110 error.

I work with Energiz0r.
We need to use SuperSimpleTcp with the changes suggested by @zllvm, returning true if the exception is a timeout exception, to achieve stable tcp connections.

@jchristn jchristn reopened this Jan 18, 2023
@jchristn
Copy link
Owner

Hi, I will take a look today. Cheers

@jchristn
Copy link
Owner

Hi, could you please try v3.0.7?

NuGet: https://www.nuget.org/packages/SuperSimpleTcp/3.0.7
Commit: 825c017

@jchristn
Copy link
Owner

Please re-open if this doesn't address the issue. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants